How to make the docs generator modular?

Ideas on how the DPV documentation generator could be made modular and easier to use
published:
by Harshvardhan J. Pandit
is part of: Data Privacy Vocabulary (DPV)
DPV DPVCG semantic-web Working Note

Motivation

(The real motivation is I need to do some coding to get away from the thinking and writing and emails that happen all week.)

The DPV specifications or vocabularies consist of RDF and HTML files that are generated in specific folders representing namespaces. The process goes something like this: We make changes in Google Sheets, which are then download locally and extracted as CSVs using a python script (100.py), then another script parses these and produces the RDF outputs (200.py), and then another script parses the RDF and produces the HTML outputs (300.py)

The issue I’m addressing with this effort is how the current DPV documentation generator is a mammoth set of 2700 line configuration file along with 2000 lines of RDF and HTML generation code, which functions like a monolithic single executable where everything is produced together. This means if someone makes a change in one concept in one namespace, they have to deal with producing ALL the outputs and then only filtering/selecting the ones that are relevant and discarding the others. Therefore, the current goal is to break down the monolithic architecture and to produce something that is modular i.e. for a given set of vocabularies, can we only produce those outputs that are relevant.

This has already been done for the CSV downloading script, but it is detached from the other implementations as it has its own notion of what vocabularies exist, how to download and extract those, etc. On the other hand, the vocab_management.py script contains all the metadata and configurations for producing RDF and HTML, and it is a massive file (2700 lines) which is difficult to understand if one doesn’t want to spend a disproportionate amount of time. Therefore, the modularity of the code should also be reflected as the modularity of the configuration data structures that are linked to that code.

Requirements

  • It is capable of producing single indpedent vocabularies in a modular fashion e.g. DPV, TECH, etc. vocabularies should be produced individually if needed.
  • It can produce multiple independent vocabularies e.g. both DPV and TECH should be produced at the same time.
  • It automatically deals with dependencies between vocabularies e.g. when producing TECH there may be concepts reused from DPV – which necessitates reading/loading DPV concepts in memory.
  • All configuration details for a given vocabulary are in the same place i.e. those related to CSV inputs, RDF and HTML outputs, templates, metadata, etc. are in a single location.
  • There are multiple files associated with configuration such that each file details with a specific vocabulary option.
  • The CSVs produced have a consistent format that is correlated with the associated vocabulary e.g. dpv_filename and tech_filename.
  • There should be a single interface for the user to deal with that can download the CSVs, output RDF, and output HTML – individually or together

Example Interface

Simple concept addition

We add some new concepts in the AI extension.

$>python3 generate --vocab=ai --download-csv --generate-rdf --generate-html
$>python3 generate -v=ai -csv -rdf -html

In the above, --vocab and -v are long and short form parameters that express the vocabulary that we are working with. This param is mandatory. It can take the special value ALL to indicate all vocabularies.

The other parameters specify which action to take i.e. whether to download the CSVs, to generate RDF, or to generate the HTML. They can be used individually or in combination (as above). The order of operation should always be CSV to RDF to HTML (as it is now). If for some reason, only the CSV and HTML options are passed, then that is what should happen i.e. download CSV and generate HTML. Trust the user that they know what they are doing.

When dealing with the AI vocabulary, it should know that it should read DPV, TECH, and RISK as direct dependencies. The order of loading these should be DPV and then the others as DPV is the foundational or core vocabulary for all other vocabularies. Even if these vocabularies are loaded, they do not have any corresponding outputs produced.

Multi-vocabulary changes

We add a new concept in the TECH extension and then we use this as a parent in a concept in the AI extension.

$>python3 generate --vocab=tech,ai -csv -rdf -html

Here both TECH and AI should have outputs produced with the same process as described in the simple scenario.

Primer, guides, and other documents

For primer, guides, and other documents which are not dependant on any vocabulary nor are they located in a specific versioned folder, the process should be simple and straightfoward. There should be parameters to control each aspect and possible output.

$>python3 generate --primer
$>python3 generate --guide=ALL
$>python3 generate --guide=data-breach,security
$>python3 generate --mappings=ALL
$>python3 generate --mapping=odrl

Release

To produce the release, there should be a single command that does all the work (which is currently handled by a custom script). The release should be based on the currently configured DPV version in the configuration.

$>python3 generate --release

Config

There should be two distinct sets of configurations – one related to the overall status of the DPV vocabularies with stuff like version number, release/draft status, etc., and the other should be for each vocabulary to specify its own configurations and metadata. This can be done by having a folder structure like below which has the main configuration files at the folder level and then further subfolders for vocabulary metadata. In the scripts, we can then use them as config.vocab.dpv.csv['filename'] and config.vocab.dpv.metadata. This also allows anyone wanting to update a small thing (e.g. fix a typo in title) for a single vocabulary to do so by going to the specific file, or if we want to add new vocabularies then copy the existing structure to a new file and update it.

.
├── collations.py
├── filepaths.py
├── guides
│   ├── consent-27560.py
│   ├── __init__.py
│   └── primer.py
├── __init__.py
├── mappings
│   ├── __init__.py
│   └── odrl.py
├── namespaces.py
├── serialisations.py
├── sparql_hooks.py
├── term-management.py
├── version.py
└── vocab
    ├── __init__.py
    ├── dpv.py
    ├── risk.py
    └── tech.py

The choice to use python for this over something more specific and widely used for configurations, such as TOML (we discard JSON as a data format and not a config format), is that the rest of the code is always and heavily python, and since the configurations are effectively dictating how that python code should behave, there is no added advantage to having them in a non-pythonic format. Further, by using something else other than python, it increases the complexity of the codebase and requires the person using it to know and understand one more thing - which they might not know until working on this code. The python variables on the other hand, can be directly used, manipulated, and even inspected for debugging using the same methods as the rest of the code. So until there is a significant advantage to gain from switching to TOML or something else, such as if the code is being migrated to Go or another faster language, the python files should be sufficient.

Within each of the vocabulary files, all relevant information and configurations should be present without having to visit other files to change how a single vocabulary behaves. This means, all CSV/RDF/HTML configurations should reside here. For example, below is how the config for DPV could look like. In these, it is important to consolidate as much information as possible so that the total amount of information is lesser and more manageable, and that all information that can be derived from a core set of information is not duplicated just because it deals with a different part of the process (e.g. CSV/RDF/HTML all use the same module names but which are declared in three different places)

# config/vocab/dpv.py
modules = ['m1', 'm2', 'm3']
# in CSV script, this is used to generate dpv_m1.csv, dpv_m2.csv, dpv_m3.csv
# in RDF script, this is used to generate dpv/modules/m1 ...
# in HTML script, this is used to read dpv/modules/m1 and generate m1 section in HTML

# however, there is more metadata for modules, so we use a dict
modules = {
    'm1': {
        'title': 'M1',
        'parser': 'taxonomy',  # method to parse the CSV to generate RDF,
        'source': {
            'gsheet_id': '...',  # ID of the Google Sheet
            # not all vocab modules have both classes and properties
            'classes': 'tab name',  # will generate CSV dpv_m1_classes.csv
            'properties': 'tab name',  # will generate CSV dpv_m1_properties.csv
        },
        "html_template": "path...", # optional - will generate HTML output for module
    }
}

folder = "/dpv" 
name = "dpv"
# RDF path is <DPV_VERSION>/<folder>/<name>.ttl
html_template = f"{TEMPLATE_PATH}/template_dpv.jinja2"

metadata = {
    "dct:title": "Data Privacy Vocabulary (DPV)",
    "dct:description": "The Data Privacy Vocabulary (DPV) provides terms (classes and properties) to represent information about processing of personal data, for example - purposes, processing operations, personal data, technical and organisational measures.",
    "dct:created": "2022-08-18",
    "dct:modified": DPV_PUBLISH_DATE,
    "dct:creator": "Harshvardhan J. Pandit, Beatriz Esteves, Georg P. Krog, Paul Ryan, Delaram Golpayegani, Julian Flake",
    "schema:version": DPV_VERSION,
    "profile:isProfileOf": "",
    "bibo:status": "published",
}