HDD-indexer v0.1

First implementation

The first deployment of this utility focuses on basic operations - identify movies on disk and download metadata. Along with these operations, the setup and browser interface are also made simple to understand and use. The setup takes care of installing dependencies and basic configuratons. The setup is split into two parts - one run in the shell, and the second in the browser. The setup first installs dependencies, creates the database and all related tables and schema, creates an admin account, starts the server and opens a browser window to complete the next steps. This allows the installation and usage to be the same experience on all platforms. 

The HDD Root is the mount directory (*NIX) or drive (Windows) where the disk can be accessed at. This is an absolute path so that the disk can be accessed on the system it is mounted on. The Movie Folder is the place where all movie files are stored. Its path is stored relative to the HDD Root. This is done so that the configuration remains the same across systems. The organisation and placement of contents on disk remain the same irrespective of where it is attached to. The only thing that changes is the place where the disk is accessed at. This is the HDD Root path. Therefore, whenever the disk is attached to a system, the root path needs to be configured. Sometimes, this path changes even when the disk is attached to the same system on which the root path was configured. In the future, it possibly could be checked if the hdd is mounted on the same path or the root path has changed. This can be done (maybe) by checking the path of the server when it starts execution. If the program is on the disk, the root path will be a substring of the server execution path. Cases exist where this can be subverted with directories and symlinks made in such a manner so that to mimic a valid disk path. But we can ignore this case since it is unlikely that someone will pervert the configurations in this manner, and even if they do, their usage of the utility would be reduced, or possibly prevented.

The only working modules at this time are browse, crawl, and load. The browse module allows viewing of movies already present in database. It uses django’s handy admin interface, which presents objects stored in the database in a nice and simple tabular layout. The results can be sorted or filtered based on columns or fields. In this case, the movies can be filtered by their release date or based on actors and directors. The columns can be sorted based on their contents. The movies section shows the title, release date, their relative path from the movie folder, imdb rating, rotten tomatoes rating, and metascore. By default, movies are sorted by their title. Filtering by release dates supports years and also months. It’s certainly handy to check which of the latest movies are on the disk. Filtering based on an actor or director allows me to check which of their movies are on the disk. All of these suffice the need and use of a basic organisation of movies based on their metadata.

The crawl module looks for movie files in the configured movie folder. It identifies a video file as a movie based on its file extension (video file) and file size (more than 300MB), and saves it to the database along with its relative path. This relative path is calculated from the movie folder. This abstraction allows the movie folder to be moved around, renamed, and copied while the movie paths remain consistent. To get the absolute path of any movie file, its relative path must be appended with the movie folder’s relative path, which is then appended to the root path. While tedious to calculate and evaluate, this abstraction greatly simplifies the code when trying to get the absolute path of every file on different systems. The crawl module is usually fast and simple. Identified files are immediately added to the database. When viewed, they show only the filename as its title, and the relative path.

The load module downloads the metadata from online source such as the Open Movie Database and The Movie Database. If a movie entry in database has an IMDb ID associated with it, this is used to retrieve its metadata. If this ID is absent, a search is performed based on the filename, which is assumed to be its title. If a movie is identified, its IMDb ID is retrieved, and used to retrieved metadata. If the movie cannot be identified by its filename, Open Subtitles is used to identify the movie and retrieve its IMDb ID. Open Subtitles is an online subtitles library/database that can identify a movie by calculating a hash of the movie file and matching it within its database.

The crawl and load (or loader) module when started, creates a new non-blocking asynchronous thread so as to keep the browser and other operations open for use. This thread initiates the necessary parameters for accessing the online sources via their API. The actual downloading is done in individual threads, which currently number five (5). This number was selected as the best choice after testing with various number of threads (1,2,5,10,20). Low number of threads take more time, whereas large number of threads download the data much faster, but block the database when writing. SQLite cannot handle this large load of write-locks and fails. To solve this, I only download the data in threads, and when all threads have finished, save the movie data into the database in a single thread. To simplify the object management, queues are used to store objects as threads retrieve and operate on them. If a large number of movie objects are in database, the queues and downloaded data take up more memory, which causes the loader to become sluggish towards the latter half of operations. To prevent this, I put 5 objects in the queue at a time, the same number as there are threads. This completes the operations in a faster time since each thread has less load to complete, and doesn’t clog up the server or database memory. The results and effects might (most likely will) change when a different, dedicated database is used. What would be interesting is to use a document database to store the movie data since it allows much more freedom to store arbitrary fields. Using such a division of labor also makes sure that if an error occurs that breaks the flow, the previous objects are stored to database. 

The status of the loader is constantly reflected in the browser via GET/POST methods that send the movies evaluated, metadata downloaded, and movies skipped. The user has control of the operation through the START/STOP button in the browser. Sometimes, due to (currently unchecked and unverified) some cause, the loader gets stuck and will not continue. It continues to run without downloading metadata. I call this as operation freezing, since it appears to have stuck in some operation, while still running and responding. In such cases, the process can be stopped via the button in the browser, and it (looks to be) seems safe to stop it at  any time.

The current version of HDD-indexer is certainly useful on its own, but there are a lot of features and operations that are yet to be developed. I have several ideas as to what these could be, and how I can implement them.