I found out about imdbnator.com - it's a nifty little website that reads a folder containing movie files, parses them, and presents a neat index of movie metadata from IMDb. That's exactly what HDD-indexer aims to do. Or does. Whatever.
The point is, it's great that such a tool exists out there, doesn't require any form of installation, and serves up a pretty list of movies on disk. It even involves movie posters! As of now, HDD-indexer has only the basic movie metadata such as title, release date, and a few ratings. I do plan to eventually add more fields to it, but I think those are secondary compared to finishing all the modules first.
Being aware of imdbnator.com has given me new aims and goals as to what I want to do with HDD-indexer. I emailed the developer, looking to collaborate or get some sort of help, but that didn't happen. The developer (great guy) informed me that he has no plans to open source his project any time soon. Upon persisting, he did give me a few tips, tricks, and things to think about.
Levenshtein distance - using this to get the most appropriate movie title based on filename. I had planned on using this anyhow. The idea came to me when I was looking at the output of the metadata_by_title functions that try to identify a movie based on its filename. It gave various options for titles, and some of them were way off the mark. I thought about using string similarity to get the closest match possible, and assign that as the identified movie. This search led me to the Levenshtein algorithm, which is the best candidate. It is fast, it is simple, and can be used directly on the database (SQL queries) if need be.
Offline database - here's where the tricky part comes in. As per the suggestion, having an offline copy of the database allows much faster search and resolutions. Sounds great, I know it works, and is a great model for static servers. But HDD-indexer is a portable utility that moves with the disk. So I need all that amount of data in a portable database that can be easily loaded into the program. SQLite struggles with huge amounts of data. So I'm in a pickle here as to how to go about it.
I only need the database for READ access, since I'm not going to change any information on it. All the writes and changes are going to be in another database, which will be the user's database. This read-only approach makes most sense because SQLite has a global write lock, which means it locks the entire database when writing. Not good for performance. Having a separate database with read-only access would allow concurrent multiple threads to access it without any problems. Only the primary keys referring to rows in this IMDb database could be saved in memory, and then later written to the user database in a single thread.
Integrating the database into Django seems to be possible. It supports multiple databases which are separate and can adhere to different models. It also supports a (great) tool called inspectdb, that creates models based on existing database schemas. So theoretically, I can construct the IMDb database, use inspectdb to create models for it, and use these in HDD-indexer. The only thing that is uncertain over here is the kind of performance it would return.
Integrating such an offline flow is great, because Internet over here still seems like a luxury rather than a commodity. How well this approach integrates with the online-metadata approach is something that I should be able to understand well as I develop this later on. For now, I'm more focused on the v0.3 release, which would introduce the Exporter and Organizer modules. I'm particularly excited about the Organizer, because it could (in theory) organize your entire movie collection like iTunes does with the help of metadata. For starters, there will be basic organization capabilities like release year and alphabetically. Later on, I plan to organize using complex conditions such as genre, series, and directors.