pdf2slideshow: Convert your PDFs to HTML Slideshows

Creating a nifty script to convert PDF into HTML slideshow
published: (updated: )
by Harshvardhan J. Pandit
is part of: web development
presentation web-dev
pdf2slideshow GitHub repo and example
Update 16-JAN-2022: batch script to process folders with PDFs.

SlideShare is a popular platform for sharing presentations on the web. It allows one to upload a PDF or a PPTX, and displays it as a set of slides one can navigate. It grew in popularity because it offered a simple interface, functional viewing of the slides, and didn't have any storage limits or requirements. Over time, it was bought by Scribd, and became part of its corporate offering. Though still free, there is always the issue of control (by a company) and lock-in (my data).

As an alternative, GitHub Pages is a free offering where a git repository can be rendered as a webpage. The good part of this is that there is no necessary lock-in. I can easily take that repository and host it somewhere else without loss of functionality. GitHub Pages also offers file hosting, which means it can store PDFs and PPTXs, and serve them for download. When using these features to host presentations, one gets a link to share with others for downloading the slides (PDF), but no nice-looking slideshow viewer like SlideShare.

My slides used to be hosted on SlideShare, and I eventually moved them to a GitHub repo so that I have control of my data and can choose how to provide it. Since these are PDFs, they can only be downloaded or viewed in the browser (as PDFs). Some others, such as Ruben Taelman, write their presentations in text, and render it natively using a HTML/JS engine such as shower.js or reveal.js. As an example, see Ruben's slides and their source.

While I wanted to create something similarly convenient and web-native for my presentations, the task got pushed further and further down the queue. Until today when I came across Daniel Gayo-Avello's tweet asking about an alternative to SlideShare. My knee-jerk reply was that of course you can convert PDF to images and host that as a slideshow. But after posting it, I started wondering how would one go about doing that. Surely something that does this must already exist.

Turns out I couldn't really find anything that does exactly this in the 15 minutes I tried to search using terms like "PDF to image slideshow". I'm still certain someone somewhere must have had the same thought and there's an elegant and ready-to-use solution sitting somewhere, probably on GitHub or in a stackoverflow answer. Regardless, I decided to create my own, because procrastination is the necessity of inventions.

Converting PDF to Images

There are lots of ways to convert a PDF to images. When in doubt, always look towards ImageMagick or Pandoc for conversion related tasks. In this case, though there are ways to extract or print a PDF as images in both of these, we're looking for something more convenient and easily scriptable without messing around with variables and settings. The solution turned out to be in Poppler, a powerful set of tools for working with PDFs. The specific tool is called pdftoppm, and converts PDFs to various image formats. See its man page for an overview.

Using pdftoppm, converting a PDF to images is as easy as calling it on the file, specifying the output format, and specifying the prefix for images. The below command exports each page of the specified PDF as a PNG image with the filename generated using format slide-NN.png where NN is the page number.

$> pdftoppm <PDF> slide -png

Compressing Images

For my sample PDF, which was about 10MB, the total size of pictures generated was about 25MB, which is too high for a reasonable slideshow on the web. To ensure images don't get too heavy, one could use other PDF to image generation tools that offer a lot of options to bring the size down using resolution, colour depth, and other technical tricks. For simplicity, I wanted something like TinyPNG where you just throw an image at it and it reduces its file size.

I found two utilities that were straightforward to use, and were available in the package managers. First was OptiPNG which had optimisation levels for aggresive compression. Its results were not that efficient. Second was pngquant which turned out to be a better tool based on output sizes. Neither of these were as efficient as TinyPNG, but their results were acceptable.

Using pngquant, shown in the example below, also allows specifying the level of optimisation it aims to achieve. Higher optimisation means slower speed, with 1 being the most optimised (highest compression) and 11 is the least (fastest). The below example takes a file and overwrites it with a compressed version.

$> pngquant <file> -s 1 -f -o <file>

Embedding Images into a HTML Slideshow

Finally to create a slideshow, I used reveal.js to do most of the heavy lifting. I downloaded its simple minimal example, and inserted links to images in the HTML. That's it. The reveal.js library does all the work in terms of ensuring images are displayed correctly, navigation between them, and rendering it on different display sizes.

To correctly insert images as slides with the least amount of work possible, I sorted the images in reverse order (last page of PDF to first) based on their filename, and inserted them as a HTML snippet in the file at a specific line where slides are supposed to be inserted. In my case, the HTML template had slides to be inserted at line 16. To actually insert the text, the usual text tool on *NIX environments do the job (awk, sed). See the example below.

$> sed -i "16i <section><img src='IMAGE.png'></section>" <template.html>

To instead use shower.js as the slideshow library, download its minimal version, figure out the line number where to insert the slide and in what format, and modify the sed usage accordingly. For example, I would guess that shower.js uses a similar section tag to indicate each tile, but with class="slide" annotation.

Automating and Packaging using Bash script

For convenience and repeatability, the entire thing can be put in a bash script. This also helps in applying the script to different PDFs, or choosing different templates (e.g. specifying year or styles) and rendering options (e.g. themes).

#!/usr/bin/env bash
# Step1: convert PDF to images
pdftoppm $1 ${2}/slide -png

# Step2: compress PNG images for the web
for f in ${2}/*.png; do
    pngquant $f -s 1 -f -o $f
done

# Step3: insert link to images in HTML template
cp ./template.html ${2}/index.html
slides=($(ls -1 ${2}/*.png | sort -r))
for f in "${slides[@]}"; do
    f=$(basename $f)
    img="<section><img src='$f'></section>"
    sed -i "16i $img" ${2}/index.html 
done;

As with all scripts, be wary of simply downloading someone elses code and running it. Always read through the script to at least see if it makes sense or is a bunch of obfuscate code that hides disasters inside. My go-to rules is if a script doesn't have comments, I won't bother figuring out unless it is absolutely essential. A script executed incorrectly can cause all sorts of pains, like deleted data or broken systems.

Batch script for handling folders with PDFs

The original script worked with a single PDF and folder. Which means it is a lot of work to convert all PDFs manually. Automation is nice. Hence a batch script that navigates through folders and processes PDFs in each sub-folder to generate HTML slideshows in each of those folders.

Some nice to have features that make this chore convenient is ignore git folders so that the script can be safely run on folders with version control or other hidden folder. Another feature is checking if a HTML slideshow already exists - and if yes ignoring that PDF to avoid repeating processing.

It is also possible to generate an index page by using the tree command which automatically generates a HTML page with hyperlinks to files. The following example creates a tree listing for the folder. Clicking on the index.html link or the folder will open the HTML slideshow, while clicking on the other files will download them.

$> tree -r -I '*.png' -H '<URL>' > page.html

The use of parameters is as follows:

  • -r sorts the items in reverse order; for folders organised by year this will cause the most recent year to be listed first.
  • -I '*.png' will ignore all PNG files in the folder. This helps clean up the images and only list the 'slideshow' or 'download' files.
  • -H URL specifies HTML as the output and indicates the URL (which can be absolute or relative) for hyperlinks to files. A good option is to generate the output file in the same folder and use relative links (./) so the folder can be hosted anywhere.