previous project - kindle annotations

The previous project of parsing kindle annotations
published: Fri May 26 2017 (updated: Fri May 26 2017)
by Harshvardhan J. Pandit
is part of: klip - Kindle Annotations
automation kindle python regex

Background

I own a Kindle 4 which mother gifted me 5-6 years back. Since then, I've read hundreds of books on it and saved thousands of annotations. I've lost all these annotations twice. One was when I accidentally emptied my entire Kindle, and the second time was when I did the same again. Since then, I've come to realise that the annotations are stored in a single file called clippings.txt located in Documents. The format of this file changes with each iteration of the Kindle, and sometimes with certain updates.

This was the time I was under the influence of regex. Not to sound idiotic, but I liked the power of its expressions, and as is the case, with a hammer in my hand, everything looked like a nail. So I engineered a way to parse each individual annotation out of the file by using regex. The project was in python and the only module used was re. I engineered the solution using some mangled version of a state machine that iterated over each line, and depending on what it had parsed before, executed some action.

Format of clippings

A typical annotation on the Kindle4 looks something like this -

==========
The Fountainhead (Ayn Rand)
- Highlight Loc. 13169  | Added on Saturday, 26 July 14 21:37:48 GMT+01:00

I could die for you. But I couldn’t and wouldn’t live for you.”
==========

All sections (or annotations) are seperated by a line populated with only the character =. This means that whenever the parser encounters a line with = in it, it assumes that this is the start of the annotation. This is followed by the title of the book with the name of the author enclosed in brackets. After that comes the type of annotation, which can be a highlight or a bookmark, with the location of that annotation in the book and the date it was added on separated by |. This is followed by a blank line and then the text of the annotation.

State machine

The start state checks whether the line starts with a = character. If it does, it signals the start of the annotation. After that, it needs to check whether the annotation is a highlight, which can be done by checking whether the line starts with - Highlight. If it does, skip the blank line and gobble the text of the annotation.

{
    'check_breakpoint': lambda x: x.startswith('='),
    'check_is_highlight': lambda x: x.startswith('- Highlight'),
    'book_info_regex': "^([a-zA-Z']+\s*[a-zA-Z\._';\:,\s\d]*)\((.*)\)$",
    'highlight_regex': '^(.*)$',
}

My naive previous self did not understand that the entire annotation could have been extracted using a single regex expression. Nevertheless, the code can be found at Github.