previous project - kindle annotations
published: (updated: )
by Harshvardhan J. Pandit
is part of: klip - Kindle Annotations
automation kindle python regex
Background
I own a Kindle 4
which mother gifted me 5-6 years back. Since then, I've read hundreds of
books on it and saved thousands of annotations. I've lost all these
annotations twice. One was when I accidentally emptied my entire Kindle,
and the second time was when I did the same again. Since then, I've
come to realise that the annotations are stored in a single file called
clippings.txt
located in Documents
. The format of this file changes
with each iteration of the Kindle, and sometimes with certain updates.
This was the time I was under the influence of regex. Not to sound
idiotic, but I liked the power of its expressions, and as is the
case, with a hammer in my hand, everything looked like a nail. So I
engineered a way to parse each individual annotation out of the file
by using regex. The project was in python
and the only module used
was re
. I engineered the solution using some mangled version of a
state machine that iterated over each line, and depending on what it
had parsed before, executed some action.
Format of clippings
A typical annotation on the Kindle4 looks something like this -
==========
The Fountainhead (Ayn Rand)
- Highlight Loc. 13169 | Added on Saturday, 26 July 14 21:37:48 GMT+01:00
I could die for you. But I couldn’t and wouldn’t live for you.”
==========
All sections (or annotations) are seperated by a line populated with
only the character =
. This means that whenever the parser encounters
a line with =
in it, it assumes that this is the start of the annotation.
This is followed by the title of the book with the name of the author
enclosed in brackets. After that comes the type of annotation, which can
be a highlight or a bookmark, with the location of that annotation in
the book and the date it was added on separated by |
. This is followed
by a blank line and then the text of the annotation.
State machine
The start state checks whether the line starts with a =
character. If it
does, it signals the start of the annotation. After that, it needs to check
whether the annotation is a highlight, which can be done by checking whether
the line starts with - Highlight
. If it does, skip the blank line and gobble
the text of the annotation.
{
'check_breakpoint': lambda x: x.startswith('='),
'check_is_highlight': lambda x: x.startswith('- Highlight'),
'book_info_regex': "^([a-zA-Z']+\s*[a-zA-Z\._';\:,\s\d]*)\((.*)\)$",
'highlight_regex': '^(.*)$',
}
My naive previous self did not understand that the entire annotation could have been extracted using a single regex expression. Nevertheless, the code can be found at Github.