RDF Website Generator

How harshp.com gets generated from RDF metadata
published:
by Harshvardhan J. Pandit
is part of: harshp.com
harshp.com hosting website

My website is currently based on using RDF to generate static web pages which are then served via GitHub Pages. This post is a documentation of the architecture and data models for how content gets converted to HTML.

Conceptual Motivation

A request is received containing a URL - the path for which content must be rendered and presented in the web browser. This could be the homepage, a specific post, or a list of items. For static web pages, the URLs map to the filepath i.e. a request for /a/b/c is translated to serve c.html in the folder /a/b relative to the root folder for that website.

Conventional static web page builders will require content to be stored in some folder (/content) which is then compiled, or converted from one format to HTML, into the required filepath to serve it at the URL it represents. So /content/a/b/c.md is compiled and served at /content/a/b/c.html. Here, there is implicit understanding that the content folder represents an exact mapping between the files within it and the HTMLs to be generated for serving.

Conventional static web page builders incorporate logic and functinoality to enable creating dynamic content such as index pages, lists, and collections. These are either implicitly provided through inbuilt functions or require writing some code that represents a query and rendering of results.

What got me started down the route of writing my website generator using RDF was trying to write a fun little project for generating static web pages similar to the conventional features. I started generating HTML content by using Jinja2 and a Python script. Then I wanted to create a list of my publications, which was easily done in JSON. But as I increasingly started crafting more complex needs - such as associating code or preprint copies - I found myself struggling with creating a data model and using it within the script because I constantly needed to edit the JSON, then the python script, and then the HTML template and pages generated. I could have gone with a simple CSV file and/or an SQLite file to store and query the data, but that had been done so many times before. So at some point, I decided to experiment with writing all metadata in RDF and using a SPARQL query to generate the necessary static web pages. For fun.

Ruben Verborgh has written a nice article about embedding metadata within webpages using RDFa and using an ETL pipeline to extract that metadata and do fun stuff with it - like querying, reasoning, and figuring out where the data and the metadata should live. Much of this work is inspired from Ruben's, and that of Sarven's website building approach. After I had already completed work on my site, I found out about Walder which takes a similar approach for providing an API and views over existing RDF data, and which Sabrina uses to manage her academic website. There's also OWLready2 and rdflib-orm" which provide convenient wrappers for using RDF and OWL in Python.

Where my approach differs is that I started with wanting to write my metadata in RDF for its appeal in being used as a graph and query model, and ended up including informating on how to associate content with IRIs, how these should be rendered, can it be made programmatic, what queries and results are required to be generated and how to associate them in a modular and reusable fashion, and how to use RDF within Python in the simplest form possible. It gave perspective on how easy or difficult it is to roll one's own RDF-based system using Python to work with documents and web publishing. The end result is a hacked together system which isn't elegant, but works, and has so far not given me much trouble. It is also quite easy to work with (for me), and I can adapt my wanton ideas with little effort. The aim wasn't to provide a product or even be competitive - it was simply to have fun creating my own website builder using semantic web - and I can say it has been fun to build something like this from scratch.

Basics: Page URLs as RDF IRIs

RDF is based on IRIs - which can be considered URLs for our purpose - as the unique identifiers with which each node or 'data' is identified and referenced by. For example, an IRI of <https://harshp.com/> refers to the website root, and is also the URL of the webpage it represents. I use this to consider that every page or URL to be served on my website is an RDF node with the URL acting as its IRI.

To distinguish what IRIs should be served as URLs, and what should not be, I create a class called RenderedItem - which represents rendering the given IRI into a web page to be served with the IRI as its URL. So declaring <https://harshp.com/> as RenderedItem means there will be a webpage created in the root folder that will be served at that address. This permits control over which IRIs should be rendered and which should not.

For folders, the convention is to either have a foldername.html file present that is rendered or have an index.html file inside that folder that is used to render the URL. For root URLs, the only option is to have an index.html file. Therefore, if the RenderedItem does not specify what specific filepath is to be generated, a general rule of <IRI>.html is applied, unless the IRI ends in /, in which case the rule <IRI>/index.html is applied.

Attaching Content to an IRI

The post here is https://harshp.com/blog/events/eswc-2018 which is part of my blog. The first piece of metadata is to use this URL to write RDF metadata about it.

<https://harshp.com/blog/events/eswc-2018> a <https://harshp.com/blog>,
        hpcom:RenderedItem ;
hpcom:tag hptag:academia,
        hptag:blog,
        hptag:conference,
        hptag:personal ;
    schema:author <https://harshp.com/me> ;
    schema:dateModified "2018-08-01T09:03:42"^^xsd:dateTime ;
    schema:datePublished "2018-06-10T05:00:00"^^xsd:dateTime ;
    schema:description "ESWC 2018 conference in Heraklion, Crete, Greece"@en ;
    schema:name "ESWC 2018"@en ;
    schema:url "https://harshp.com/blog/events/eswc-2018"^^xsd:anyURI .

I use schema.org as it is the dominant vocabulary on the web for annoating metadata, and is compatible with RDF (as seen above). For other properties, I used by own vocabulary (prefixed as hpcom) so as to enable me to create arbitrary concepts as needed while I test and build stuff.

The declaration of URL as an instance of <https://harshp.com/blog> means the post is part of my Blog. This permits applying uniform design and control over all blog related items - such as by using different templates or CSS layouts for different sections of my website. I'll get to that later in the post.

The content is declared by associating the file where the content is stored along with information about its format. The file extension can be used to infer the file format as well, but I prefer it to be made explicit since it saves a step, and more importantly - it allows attaching things like functionality to the RDF representation of HTML as a format.

hpcom:content [ a hpcom:Content ;
            hpcom:contentFile "https://harshp.com/code/content/blog/events/eswc-2018.html" ;
            hpcom:contentFileFormat hpcom:formatHTML ] ;

Rendering IRI with Views

When the generator script encounters this metadata, it goes looking for a 'View' to render the file. This is done by first checking the metadata itself to see if it declares a View. In this case, the view is absent. So it looks in the data graph to see if any View is registered to handle that specific IRI. Failing that, it looks for a View that handles the parent class of which the IRI is an instance. Failing that, it looks for View associated with the parent of parent, and so on up the chain. If it finds nothing, it renders with the Generic View provided for all RDF objects.

In the case of this particular blog post, a View has been declared for all instances of the Blog as shown below:

hpview:BlogPostView a hpcom:View ;
    hpcom:view_target <https://harshp.com/blog> ;
    hpcom:view_template "https://harshp.com/code/templates/template_blog_post.jinja2" ;
    hpcom:view_renderer hpcom:Jinja2 .

It informs the View uses Jinja2 as a rendering mechanism and has the template specified by the filepath. The generator then takes this template and uses it to render the content provided by the metadata, and exports the results as a HTML file.

Rendering with SPARQL query results

For blog posts, the content is already present in the HTML file and just needs to be rendered with the template for consistency and design considerations. For more dynamic content, such as the blog index page, a query needs to be issued to retrieve all the posts along with their metadata and to then render this information in the HTML.

For this purpose, Views can contain SPARQL queries embedded within them which is executed and its data is passed to the template which renders it to the specified filepath. For the blog, the metadata is as follows:

hpview:BlogPostView a hpcom:View ;
    hpcom:view_target <https://harshp.com/blog> ;
    hpcom:view_template "https://harshp.com/code/templates/template_blog_post.jinja2" ;
    hpcom:view_renderer hpcom:Jinja2 .

This shows two things: that a View can be specified directly in the metadata, and secondly that not every IRI requires content. In this case, the template is sufficient because the data is provided by the SPARQL query associated within the View.

hpview:BlogIndexView a hpcom:View ;
    hpcom:view_template "https://harshp.com/code/templates/template_blog_index.jinja2" ;
    hpcom:view_renderer hpcom:Jinja2 ;
    hpcom:sparql [
        a hpcom:SparqlQuery ;
        rdfs:label "posts" ;
        hpcom:queryString """
            SELECT ?s WHERE {
            ?s a ?iri .
            ?s schema:datePublished ?date .
        } ORDER BY DESC(?date)
        """ ;
    ] .   

The simple SPARQL query retrieves all instances of ?iri - which in this case is the IRI of the blog, i.e. <https://harshp.com/blog> which have a date of publication and orders it newest first. The replacement of ?iri with the IRI of the node that the View is currently handling is done by the generator. If the same view were to be used for another type of blog, then its IRI would be used instead - thus providing reusability and modularity of content and processes.

Wrapper for dealing with RDF when writing HTML

The template specified for the blog index page needs to access the metadata for the nodes it is required to render or print in HTML. If all metadata to be printed were retrieved using SPARQL, it creates an issue where the order of items retrieved must be consistent between the query and the HTML usage. It also affects when additional metadata is needed to be retrieved - which means that the query must be changed. Changing the query means it can no longer be used generically for other IRIs.

Instead, the generator, after retrieving the query results, simply passes the IRIs to the HTML template (rendered using Jinja2) in the form of native python classes which need no special consideration. To do this, I wrote an ORM for RDF (called rdform) which takes each IRI, creates an instance of a RDFS_Resource class for it, and starts putting data associated with it in the __dict__ variable used by python to store information in an instance. Properties are referenced as class members and properties, and are easy to use. So X rdfs:label "Y" becomes x.rdfs_label = y.

For annotations, such as string labels, it is trivial to create such records. For objects, it becomes more difficult since the IRIs need to be tracked globally. So instead of waiting for the SPARQL query to return and figure out the data and IRIs involved, the generator transforms all RDF data loaded from the data graph into RDFS_Resource objects and uses them to replace IRIs retrieved from a SPARQL query.

This means for any template or function written in Python, the SPARQL query only needs to retrieve the IRIs of the objects required, and the code can use any metadata as required without worrying about whether the query has retrieved that result. Ruben wrote about LDflex which takes this conceptually even further by actually retrieving results from remote IRIs as the data is being used.

With the IRIs and RDF objects converted to native Python instances, using them in Jinja2 template becomes trivial:

{% block content %}
<ol class="list-index">
{% for itemlist in posts %}{% for item in itemlist %}
    <li><a href="{{ item.iri }}">{{ item.schema_name }}</a> <br/>
        <small><time datetime="{{ item.schema_datePublished }}">{{ item.schema_datePublished }}</time></small>&nbsp;
        <small>{% for tag in item.hpcom_tag|sort(attribute='iri') %}<a class="tag" href="{{tag.iri}}">{{tag}};</a>{% endfor %}</small> <br/>
        <small>{{ item.schema_description }}</small>
    </li>
{% endfor %}{% endfor %}

Here, the data variable posts is generated from the rdfs:label delcared within the View's SPARQL query. This further simplifies things as a View may contain multiple SPARQL queries, and each of them gets converted to a variable and is passed to the tempalte for use. The conversion of RDF objects into Python instances also enables use of Jinja2 convenience features like filters, sorting of lists, and loop controls.

SPARQL convenience features - inserting parameters

While a static SPARQL query can only do so much, there are times when the string of the query needs to be modified to specify particular values - such as today's date or year. For this reason, I use the queryParam property to specify parameters associated with a query, which the generator uses to replace their occurence within the string with specified functions or values.

hpcom:sparql [
        a hpcom:SparqlQuery ;
        rdfs:label "current_projects" ;
        hpcom:queryString """
            SELECT DISTINCT ?project ?name WHERE {
                ?project a <https://harshp.com/research/projects> .
                ?project schema:name ?name .
                ?project schema:startDate ?date .
                ?project schema:member ?role .
                ?role schema:endDate ?end_date .
                FILTER(?end_date > ?today) .
            } ORDER By DESC(?date)
        """ ;
        hpcom:queryParam [
            hpcom:queryParamLabel "today" ;
            hpcom:queryParamValue "date-today" ;
        ] ;
    ] ;

Here, the variable label today and value date-today are specified as a query parameter to be replaced, with the date of today derived from the function data-today.

def _today():
    literal = Literal(datetime.datetime.now(), datatype=XSD.dateTime)
    return literal

SPARQL_ACTIONS = {
    'date-today': _today,
    # ...
}

The variable SPARQL_ACTIONS holds such functions or values to be replaced whenever a query parameter is declared in a View.

Common Metadata for all Pages

Since all pages are generated by me for my site, they have common metadata. But I can also have invited posts, or posts authored by others. Additionally, posts can have different parent posts, sections, and so on. Instead of writing multiple tempaltes with the same repeated boilerplate, all of the common stuff is put in to a base template which is used by Jinja2 to render all other templates. This permits writing some basic RDFa for all pages which also works for SEO in the header section. It also permits linking pages together, as in provide links and references to other related pages and resources.

{% block desc %}
{% if item.schema_isPartOf %}
...
{% endif %}
{% if item.schema_subjectOf %}
...
{% endif %}
    
{% if item.rdfs_seeAlso %}
    ...
    {% if item.rdfs_seeAlso is sequence %}
        {% for article in item.rdfs_seeAlso %}
        ...
        {% endfor %}
    {% else %}
        ...
    {% endif %}
   ...
{% endif %}

The above examples outlines and interesting problem associated with RDF - how to handle sequences and lists? Usually they are a pain to handle because of the first-rest iteration pattern. Instead, most common uses just declare multiple values for properties, such as: x prop a, b, c . which when used in Python, returns a list for prop containing [a, b, c]. To iterate through this in Jinja2, the pattern is the same as iterating over a native Python list - just check if something is a sequence (and not a string) and use it as usual.

Smarter Publication Management

Using the above principles and data models, my original task of how to manage my publications, and associate them with resources and items became much easier. Now all I had to do was create a data model for publication using RDF, declare a view for it, and publish it using a template.

Here is my PhD thesis in RDF metadata form:

<https://harshp.com/research/publications/035-representing-activities-processing-personal-data-consent-semweb-gdpr-compliance> a schema:ScholarlyArticle, hpcom:Thesis, hpcom:RenderedItem ;
    schema:name "Representing Activities associated with Processing of Personal Data and Consent using Semantic Web for GDPR Compliance"@en ;
    schema:description "PhD research showing use of semantic web in representing activities and consent for GDPR"@en ;
    hpcom:tag hptag:semantic-web ;
    hpcom:tag hptag:GDPR ;
    hpcom:tag hptag:provenance ;
    hpcom:tag hptag:consent ;
    hpcom:author_lead <https://harshp.com/me> ;
    schema:inSupportOf "PhD in Computer Science"@en ;
    schema:datePublished "2020-05-06T00:00:00"^^xsd:dateTime ;
    schema:identifier "hdl.handle.net/2262/92446"^^xsd:anyURI ;
    schema:url "https://hdl.handle.net/2262/92446"^^xsd:anyURI ;
    schema:publisher <https://www.tcd.ie/> ;
    schema:funder <http://example.com/ADAPT> ;
    hpcom:archived_version [ a hpcom:Link ; schema:name "harshp.com" ; schema:url "https://harshp.com/research/publications/035-representing-activities-processing-personal-data-consent-semweb-gdpr-compliance"^^xsd:anyURI ] ;
    hpcom:archived_version [ a hpcom:Link ; schema:name "web (HTML)" ; schema:url "https://harshp.com/research/phd-thesis"^^xsd:anyURI ] ;
    hpcom:archived_version [ a hpcom:Link ; schema:name "zenodo" ; schema:url "https://doi.org/10.5281/zenodo.3795513"^^xsd:anyURI ] ;
    hpcom:supplementary [ a hpcom:Link ; schema:name "viva slides" ; schema:url "https://www.slideshare.net/HarshvardhanPandit1/phd-viva-representing-activities-associated-with-processing-of-personal-data-and-consent-using-semantic-web-for-gdpr-compliance"^^xsd:anyURI ] ;
    hpcom:supplementary [ a hpcom:Link ; schema:name "repo" ; schema:url "https://github.com/coolharsh55/phd-thesis"^^xsd:anyURI ] ;
    hpcom:content [
        a hpcom:Content ;
        hpcom:contentFile "https://harshp.com/code/content/research/publications/035-representing-activities-processing-personal-data-consent-semweb-gdpr-compliance.html" ;
        hpcom:contentFileFormat hpcom:formatHTML ;
    ] ;
    hpcom:peer_reviewed false .

And here is the PhD Thesis in HTML form generated from this metadata. More examples: A list of all publications generated using SPARQL query is. My research activities.

Code

The code for my website is available at: GitHub: @coolharsh55.

  • The generator.py is the one that contains code for generating HTML pages and doing all of the chores specified here.
  • rdform.py provides the convenience wrapper for using RDF objects as native Python instances
  • views.ttl contains all the views (those not declared in the RDF metadata itself)
  • research contains metadata and content about research activities used here for example
  • template_research.jinja2 contains the template used for rendering the research activities page