DPCat is a Specification for an interoperable and machine-readable data processing catalogue based on [[[GDPR]]] Requirements and EU DPA guidelines. It extends [[[DCAT]]] and [[[DCAT-AP]]] standards and reuses [[[DPV]]] to enable data governance of ROPA and related information across a wide variety of use-cases.

To demonstrate how DPCat can be useful when applied, we utilised the [[[EDPS-ROPA]]] by analysing its contents and representing them in RDF using DPCat. We then performed information validation, querying, and exporting over it. This page summarises the approach and the results. The code and resources are available at https://github.com/coolharsh55/DPCat.


Application to EDPS ROPA

An example of how DPCat could be used is presented by applying it to [[[EDPS-ROPA]]]. See more at https://w3id.org/dpcat/demo/edps-ropa. The below description summarises the process, and provides a snippet of how DPCat ROPA records and catalogs are represented.

EDPS is the DPA responsible for overseeing compliance by EU institutions, which consists of many employees across the various EU bodies and their associated personal data processing activities. The EDPS has published detailed ROPA documents based on GDPR Art.30 requirements that provide transparency and accountability. As of March 2022, the EDPS has made available 58 ROPA document collections - with each consisting of one more PDF (format) document providing information in English regarding the processing operations. Collections are structured based on ’topics’ - which can be a department (e.g. Administrative and Human Resources, or IT), processes (e.g. Communication, or Public Events), or specific measures (e.g. Access to documents, or Physical Security).

We analysed EDPS ROPA documents and selected four (ids: 01, 05, 13, 55) that covered the U1-U4 use-cases for departments, processors, joint controllers, and data transfers. We did not include the other documents despite their relevance due to the large labour and analysis efforts required, and because the selected documents sufficed in demonstrating DPCat’s application. The documents were PDFs, intended for human comprehension, and lacked consistent semantics - e.g. purpose field also contained legal basis.

We interpreted these documents and their structure as follows: each document (i.e. PDF) represented a single ROPA instance, and the information contained within them structured using ROPARecord instances. We utilised the criteria that each ROPARecord would adhere to a single ’contextual entry’ based on qualitative criteria regarding the complexity of information and separation of concerns. For example, document X specified two processors, which we interpreted as separate ROPARecord instances for each processor to indicate the separation of concern in the controller’s communication and data governance. The entire collection of documents and RDF graphs were then expressed as part of a single ROPACatalog instance reflecting the published set of records on EDPS’ website.

The manually created RDF graphs were enhanced using the Apache Jena RDFS reasoner to create a ‘complete graph’ for simplifying querying and validation. The limited RDFS reasoning was sufficient here to obtain the expansion of subclasses and subproperties within the graph rather than generating inferences using an OWL reasoner. For storing the information and offering a querying interface, we utilised GraphDB Free Edition triple store, as it is a freely available triple-store compliant with relevant standards (e.g. SPARQL) and has several features for convenience, e.g. friendly interface, integrated reasoners, SHACL validation.

To simulate typical tasks performed by a DPO or a DPA, we utilised SPARQL queries for two use-cases: (i) retrieval of information required by GDPR Art. 30; and (ii) overview of practices within an organisation in terms of various organisational units, purposes, legal bases, recipients, data transfers, etc. Here, query (i) relates to common compliance documentation procedures, and query (ii) shows the potential for DPCat to help create internal reports or dashboards based on ROPA information, e.g. for a DPO.

Code Walkthrough

This section provides instructions for the execution and inspection of the demo code in the GitHub repo.

  1. EDPS folder contains the ROPA PDFs manually downloaded from [[[EDPS-ROPA]]].
  2. EDPS folder also contains the RDF serilisations using DPCat of these, denoted using the same identifiers as the ROPA documents, e.g. `U05.ttl`.
  3. EDPS folder contains `vocab.ttl` representing the EDPS' internal RDF vocabulary that it uses in the ROPA serialisations.

The demo, represented by the `run.sh` script, works as follows:

  1. First, all the vocabularies utilised are loaded. This includes [[DCAT]], [[DCAT-AP]], [[DPV]], and DPCat.
    echo "collect all vocab files into a single file"
    riot ../dpcat.ttl vocab/* > vocab.ttl
  2. Then, RDFS inferencing is performed by utilising a reasoner that applies the collected vocabularies over a data graph.
    echo "run RDFS inference over data using vocab"
    riot --rdfs vocab.ttl EDPS/*.ttl > data.ttl
  3. Next, the combined data graph, consisting of all the vocabularies as background knowledge, and the data graph, is constructed to simulate a unified knowledge graph for querying and validation.
    riot vocab.ttl data.ttl > data_combined.ttl
  4. Next, validations are performed in the order of vocabulary inheritence, i.e. as: DCAT, then DCAT-AP, then DPV, then DPCat ...
    echo "validate data file using DCAT-AP shacl shapes"
         -shapesfile ../shapes/dcat-ap_2.0.1_shacl_shapes.ttl 
         -datafile data_combined.ttl
    echo "validate data file using DPCat shacl shapes"
         -shapesfile ../shapes/dpcat_shapes_mandatory.ttl 
         -datafile data_combined.ttl
  5. Finally, data is exported by using SPARQL queries to generate CSV.
    arq --data=./data_combined.ttl 
         --results=CSV > EDPS/GDPR_A30.csv
    arq --data=./data_combined.ttl 
         --results=CSV > EDPS/GDPR_A30_org_overview.csv

Note, in the above, we utilised the Apache Jena tools, which includes the `riot` reasoner for combining RDF graphs and RDFS inferencing, and `arq` query engine for executing SPARQL queries and storing output as CSV. We utilised the TopQuadrant SHACL tool for performing SHACL validations. These tools and utilities can be substituted with any others provided they are functionally equivalent and standards-conformant.

For a more convenient tool, it is possible to create a python script that uses RDFLib to interact with RDF graphs and perform SPARQL query executions, pySHACL for validation using SHACL constraints, and exports required data using OpenPyXL to create MS-Excel spreadsheets intended for human-comprehension.

The SHACL validations will produce errors - this is intentional, and left unresolved to demonstrate the challenges of converting an unstructured human-intended text from the EDPS ROPA to a formally structured representation as expected by DPCat.


Funding: This research has received funding from Uniphar PLC, and the ADAPT Centre for Digital Content Technology which is funded under the SFI Research Centres Programme (Grant 13/RC/2106_P2) and co-funded by the European Regional Development Fund. Harshvardhan J. Pandit has received funding under the Irish Research Council’s Government of Ireland Postdoctoral Fellowship Grant#GOIPD/2020/790.