How to make the docs generator modular?
published:
by Harshvardhan J. Pandit
is part of: Data Privacy Vocabulary (DPV)
DPV DPVCG semantic-web Working Note
Motivation
(The real motivation is I need to do some coding to get away from the thinking and writing and emails that happen all week.)
The DPV specifications or vocabularies consist of RDF and HTML files
that are generated in specific folders representing namespaces. The
process goes something like this: We make changes in Google Sheets,
which are then download locally and extracted as CSVs using a python
script (100.py
), then another script parses these and
produces the RDF outputs (200.py
), and then another script
parses the RDF and produces the HTML outputs (300.py
)
The issue I’m addressing with this effort is how the current DPV documentation generator is a mammoth set of 2700 line configuration file along with 2000 lines of RDF and HTML generation code, which functions like a monolithic single executable where everything is produced together. This means if someone makes a change in one concept in one namespace, they have to deal with producing ALL the outputs and then only filtering/selecting the ones that are relevant and discarding the others. Therefore, the current goal is to break down the monolithic architecture and to produce something that is modular i.e. for a given set of vocabularies, can we only produce those outputs that are relevant.
This has already been done for the CSV downloading script, but it is
detached from the other implementations as it has its own notion of what
vocabularies exist, how to download and extract those, etc. On the other
hand, the vocab_management.py
script contains all the
metadata and configurations for producing RDF and HTML, and it is a
massive file (2700 lines) which is difficult to understand if one
doesn’t want to spend a disproportionate amount of time. Therefore, the
modularity of the code should also be reflected as the modularity of the
configuration data structures that are linked to that code.
Requirements
- It is capable of producing single indpedent vocabularies in a modular fashion e.g. DPV, TECH, etc. vocabularies should be produced individually if needed.
- It can produce multiple independent vocabularies e.g. both DPV and TECH should be produced at the same time.
- It automatically deals with dependencies between vocabularies e.g. when producing TECH there may be concepts reused from DPV – which necessitates reading/loading DPV concepts in memory.
- All configuration details for a given vocabulary are in the same place i.e. those related to CSV inputs, RDF and HTML outputs, templates, metadata, etc. are in a single location.
- There are multiple files associated with configuration such that each file details with a specific vocabulary option.
- The CSVs produced have a consistent format that is correlated with
the associated vocabulary e.g.
dpv_filename
andtech_filename
. - There should be a single interface for the user to deal with that can download the CSVs, output RDF, and output HTML – individually or together
Example Interface
Simple concept addition
We add some new concepts in the AI extension.
$>python3 generate --vocab=ai --download-csv --generate-rdf --generate-html
$>python3 generate -v=ai -csv -rdf -html
In the above, --vocab
and -v
are long and
short form parameters that express the vocabulary that we are working
with. This param is mandatory. It can take the special value
ALL
to indicate all vocabularies.
The other parameters specify which action to take i.e. whether to download the CSVs, to generate RDF, or to generate the HTML. They can be used individually or in combination (as above). The order of operation should always be CSV to RDF to HTML (as it is now). If for some reason, only the CSV and HTML options are passed, then that is what should happen i.e. download CSV and generate HTML. Trust the user that they know what they are doing.
When dealing with the AI vocabulary, it should know that it should read DPV, TECH, and RISK as direct dependencies. The order of loading these should be DPV and then the others as DPV is the foundational or core vocabulary for all other vocabularies. Even if these vocabularies are loaded, they do not have any corresponding outputs produced.
Multi-vocabulary changes
We add a new concept in the TECH extension and then we use this as a parent in a concept in the AI extension.
$>python3 generate --vocab=tech,ai -csv -rdf -html
Here both TECH and AI should have outputs produced with the same process as described in the simple scenario.
Primer, guides, and other documents
For primer, guides, and other documents which are not dependant on any vocabulary nor are they located in a specific versioned folder, the process should be simple and straightfoward. There should be parameters to control each aspect and possible output.
$>python3 generate --primer
$>python3 generate --guide=ALL
$>python3 generate --guide=data-breach,security
$>python3 generate --mappings=ALL
$>python3 generate --mapping=odrl
Release
To produce the release, there should be a single command that does all the work (which is currently handled by a custom script). The release should be based on the currently configured DPV version in the configuration.
$>python3 generate --release
Config
There should be two distinct sets of configurations – one related to
the overall status of the DPV vocabularies with stuff like version
number, release/draft status, etc., and the other should be for each
vocabulary to specify its own configurations and metadata. This can be
done by having a folder structure like below which has the main
configuration files at the folder level and then further subfolders for
vocabulary metadata. In the scripts, we can then use them as
config.vocab.dpv.csv['filename']
and
config.vocab.dpv.metadata
. This also allows anyone wanting
to update a small thing (e.g. fix a typo in title) for a single
vocabulary to do so by going to the specific file, or if we want to add
new vocabularies then copy the existing structure to a new file and
update it.
.
├── collations.py
├── filepaths.py
├── guides
│ ├── consent-27560.py
│ ├── __init__.py
│ └── primer.py
├── __init__.py
├── mappings
│ ├── __init__.py
│ └── odrl.py
├── namespaces.py
├── serialisations.py
├── sparql_hooks.py
├── term-management.py
├── version.py
└── vocab
├── __init__.py
├── dpv.py
├── risk.py
└── tech.py
The choice to use python
for this over something more
specific and widely used for configurations, such as TOML
(we discard JSON
as a data format and not a config format),
is that the rest of the code is always and heavily python, and since the
configurations are effectively dictating how that python code should
behave, there is no added advantage to having them in a non-pythonic
format. Further, by using something else other than python, it increases
the complexity of the codebase and requires the person using it to know
and understand one more thing - which they might not know until working
on this code. The python variables on the other hand, can be directly
used, manipulated, and even inspected for debugging using the same
methods as the rest of the code. So until there is a significant
advantage to gain from switching to TOML
or something else,
such as if the code is being migrated to Go
or another
faster language, the python files should be sufficient.
Within each of the vocabulary files, all relevant information and configurations should be present without having to visit other files to change how a single vocabulary behaves. This means, all CSV/RDF/HTML configurations should reside here. For example, below is how the config for DPV could look like. In these, it is important to consolidate as much information as possible so that the total amount of information is lesser and more manageable, and that all information that can be derived from a core set of information is not duplicated just because it deals with a different part of the process (e.g. CSV/RDF/HTML all use the same module names but which are declared in three different places)
# config/vocab/dpv.py
= ['m1', 'm2', 'm3']
modules # in CSV script, this is used to generate dpv_m1.csv, dpv_m2.csv, dpv_m3.csv
# in RDF script, this is used to generate dpv/modules/m1 ...
# in HTML script, this is used to read dpv/modules/m1 and generate m1 section in HTML
# however, there is more metadata for modules, so we use a dict
= {
modules 'm1': {
'title': 'M1',
'parser': 'taxonomy', # method to parse the CSV to generate RDF,
'source': {
'gsheet_id': '...', # ID of the Google Sheet
# not all vocab modules have both classes and properties
'classes': 'tab name', # will generate CSV dpv_m1_classes.csv
'properties': 'tab name', # will generate CSV dpv_m1_properties.csv
},"html_template": "path...", # optional - will generate HTML output for module
}
}
= "/dpv"
folder = "dpv"
name # RDF path is <DPV_VERSION>/<folder>/<name>.ttl
= f"{TEMPLATE_PATH}/template_dpv.jinja2"
html_template
= {
metadata "dct:title": "Data Privacy Vocabulary (DPV)",
"dct:description": "The Data Privacy Vocabulary (DPV) provides terms (classes and properties) to represent information about processing of personal data, for example - purposes, processing operations, personal data, technical and organisational measures.",
"dct:created": "2022-08-18",
"dct:modified": DPV_PUBLISH_DATE,
"dct:creator": "Harshvardhan J. Pandit, Beatriz Esteves, Georg P. Krog, Paul Ryan, Delaram Golpayegani, Julian Flake",
"schema:version": DPV_VERSION,
"profile:isProfileOf": "",
"bibo:status": "published",
}