dpv-x 2021-08-15

Some ideas for extending and improving DPV and documentation
published: (updated: )
by Harshvardhan J. Pandit
is part of: Data Privacy Vocabulary (DPV)
DPV DPVCG semantic-web

I'm in a community called the Data Privacy Vocabulary and Controls Community Group (DPVCG) where we develop vocabularies for describing how personal data is collected, used, stored, shared, and the associated justifications in law, use of technology and measures, and so on. I got into the group when I was doing my PhD, given the relation of it to my research. The group itself started out from the SPECIAL project - and I joined as a member some 6 months after its inception. Through continued used, I volunteered to chair the group.

The primary and chief outcome of the DPVCG is the Data Privacy Vocabulary (DPV) - a taxonomy or vocabulary or ontology for describing what is happening with the (personal) data in terms of its purposes, processing, stakeholders, legal justifications, technical measures, rights, and so on.

As I continue to handle the progress of the group and its outcomes, I decided to write this post as a reflection of things I want to see happen within the DPVCG or with the DPV. I titled my 'experimental' thoughts and activities as dpv-x so as to keep the code and approaches separate from the primary work on dpv..

Documentation

Specifying the Specification

The DPV spec, currently at http://w3.org/ns/dpv, is a monolithic document that tries to do everything in one place. It aims to introduce the DPV, explain its rational, describe the concepts and their motivation, formally describe the concepts, relations between them, and explain how they should be use. As a result, it is intimidating and exhaustive for a newcomer to grasp and comprehend the DPV in its entirety when pointed directly to the spec.

Instead, the spec should be broken down into layered documents, each explaining part of what information is implicit, or intended for specific audiences e.g. newcomers, practioners, or adopters in specific domains or using specific technologies. The spec itself should contain the overview of concepts, their formal definitions, relations, sources, and so on - providing a consolidated authoritative documentation without being burdened of their motivation or usage or issues.

DPV is a vocabulary, there are no arguments about that. But is it or does it want to be more? For example, does it want to provide specific applications or controls or tooling or frameworks that can be used? Maybe. But regardless of these, the DPV spec, as it stands, is about the DPV as a vocabulary, and so how those concepts are to be used is beyond its scope.

At the same time, how to use a concept or where it is applicable are genuine questions to the reader and/or adopter, and for this, it is essential to provide some guidance on the 'use-cases' where that concept can be applied. For example, consider the following scenario: Reading section for 'access control', the use-cases might detail how this is used as a technical measure to protect data by the controller, by the processor, or how some access control measure is specifically associated for some sensitive/special categories of personal data. Here, the concept 'access control' is not sufficient to be provided with just a definition and its existence as a 'technical measure', but must be demonstrated through use-cases that show where it occurs and how its use is dependant on the context.

Providing use-cases for concepts, and ensuring they are consistent and applicable is important because as the DPV is an extremely high-level functional vocabulary, its application can vary greatly depending on the use-cases it is applied within as well as restrictions applicable in the technical or semantic use (e.g. use of specific data model methods such as OWL2, or integration with ODRL, or within Java/Python tooling) which are not universal.

Priming using a Primer

A primer is a good guide for newcomers and offers. an overview of the concepts without overwhelming the reader. Its a gentle introduction that explains why the concepts and models exist as they do without delving into the complexities of design and the intricacies of applications. Imagine a car, then a primer explains the 'common sense' components of the chasis, wheels, engine, fuel, steering wheels, driver, passengers, and how they work together. It does not explain how fuel is combusted in the engine chambers, nor how the gears (manual or automatic) are used to control the speeds and directions. The specification is the exhaustive and authoritative document for the car. The 'manual' is distinct from either of these, in being a documentation for the 'user' of the car in a specific scenario.

In this way, the primer is the first source of information for someone who doesn't know what DPV is, or what it provides, for what reasons, and what are the concepts one must be aware of before delving further/deeper into it. A primer is also the first source of information intended for someone who wants to know what they can do with DPV, whether they have to follow specific formally specified requirements, or what aspects of their use and application they need to figure out themselves.

From this, the Primer should contain a brief description of the aims/goals of the DPV(CG) in terms of what information is being modelled. It should provide an overview of the 'base' or 'top' concepts that give a good idea for what the vocabulary consists of. Everything else then is either specialisation or expansion of these concepts, or exists to supplement them with additional information. As it currently stands, the Primer should explain: Purpose, Processing, Personal Data Categories, Entities (Controller, Processor, Third Party, Data Subject), Technical and Organisational Measures, Legal Bases, Rights, and Risks.

The Primer should also clarify whether these concepts must be utilised within a semantic web environment, or they can be adopted and used outside in other tools and technologies (the latter is desired). It should explain how the DPV can be applied to specific common situations and use-cases (e.g. compliance documentation, privacy notices, compliance checking) and what are the available avenues for doing so. More importantly, it should point to other related vocabularies (e.g. ODRL, PROV) and resources (e.g. publications, projects, tools) that can assist the reader in using DPV in some environment.

Adopters Guide

A guide for how the DPV can be used in different ways, and what each way offers in terms of benefits and limitations. Also explains where the DPV needs to be extended, changed, or modified for use in specific scenarios and use-cases.

Should explain what to do if DPV is needed in specific shapes and forms: e.g. RDFS, OWL2, SKOS, and so on. Should also explain how to utilise DPV if all that is needed is a part of the taxonomy e.g. list of purposes. And how to utilise or rather extract/convert DPV to something non-sem-web, e.g. JSON list.

One thing I often get asked is how to use the DPV concepts in assessing compliance states. Since DPV doesn't model the norms and obligations, it is left up to the adopter to specify what they want to use and that will then tell them how DPV is applicable. The guide can explain this, and point to existing resources in terms of projects that use DPV or offer ways to do compliance related tasks, e.g. SPECIAL, MIREL, BPR4GDPR, TRAPEZE, MOSAICROWN.

Programmatic Generation of Diagrams

Diagrams are a nice way to visually comprehend information, and in the case of DPV, especially the overview of a taxonomy (or specific sub-group) as well as the modelling and relations for a specific concept. However, creating diagrams manually is time-consuming, as well as laborius to keep updated, and difficult to get correct when unfamiliar with the progress of the work.

As a solution, programmatic generation of diagrams automatically from the concepts can assist in easing some if not all of the work. Diagrams of this form are of two types: (i) overview diagrams for presenting all concepts associated in a particular taxonomy (e.g. categories of purposes); and (ii) overview diagrams for presenting relationships associated with a given concept (e.g. properties and parent/children for both the concept and its ancestors).

As an experiment, I tried to use GraphViz and a python script to see if I can create programmatic diagrams. While the code and result (see below) need more work, the approach does hold promise. Once generated, such systems can continue functioning without requiring much effort as more concepts are added and the DPV continues to evolve and expand, including as extensions.

PersonalDataHandling PersonalDataHandling PersonalDataHandling DataController DataController PersonalDataHandling->DataController dpv:hasDataController DataSubject DataSubject PersonalDataHandling->DataSubject dpv:hasDataSubject PersonalDataCategory PersonalDataCategory PersonalDataHandling->PersonalDataCategory dpv:hasPersonalDataCategory LegalBasis LegalBasis PersonalDataHandling->LegalBasis dpv:hasLegalBasis Processing Processing PersonalDataHandling->Processing dpv:hasProcessing Purpose Purpose PersonalDataHandling->Purpose dpv:hasPurpose Recipient Recipient PersonalDataHandling->Recipient dpv:hasRecipient Risk Risk PersonalDataHandling->Risk dpv:hasRisk TechnicalOrganisationalMeasure TechnicalOrganisationalMeasure PersonalDataHandling->TechnicalOrganisationalMeasure dpv:hasTechnicalOrganisationalMeasure ResearchAndDevelopment ResearchAndDevelopment ResearchAndDevelopment Purpose Purpose ResearchAndDevelopment->Purpose subclassOf Non-Commercial Research Non-Commercial Research Non-Commercial Research->ResearchAndDevelopment subclassOf Academic Research Academic Research Academic Research->ResearchAndDevelopment subclassOf Commercial Research Commercial Research Commercial Research->ResearchAndDevelopment subclassOf

Extensions for specific jurisdictions and domains

GDPR specification: The GDPR iteration of DPV. An extension based on equating or extending the concepts within DPV so they match the expected vocabulary required for working with GDPR and its compliance. Currently, this is the DPV-GDPR specification.

ISO and CCPA specifications: More iterations of DPV regarding ISO and CCPA applications respectively. An extension is where the concepts are applied in a specific setting or use-case, and which cannot be modelled in the base or primary iteration of DPV without forcing everyone else to only limit themselves to the defined concept. For example, ISO has 'PII' and CCPA has 'sell', where these two concepts are often equated to 'personal data' and 'sharing' at a broad level for global understanding, but where their use implies the specific notions within their respective scopes (ISO activities and California respectively). Here, when I say it involves CPRA as its successor. There is no point in modelling these two as separate vocabularies given that they target the same jurisdiction and stakeholders. Whether there should be a separate extension for each jurisdiction or use-case is dependant on how much control and changes are expected (e.g. if each USA state creates new terms that are only applicable in their respective jurisdictions, the DPV should not include these).

DUO Alignment: DUO is a good vocabulary for sharing and management of health/medical datasets - specifically genetics, but it has broader applications. In comparison with DPV, DUO is much more focused on its scope, and as a result has a richer set of concepts that demonstrate practicality of use-cases. It encompasses how datasets can be acquired, collected, and shared along with specific permissions and prohibitions. It is difficult to model such permissions/prohibitions within the DPV because of the much higher and broader scope of both the concepts and legal compliance. That said, DUO and DPV can be aligned through either a) creation additional concepts for permission expression; or b) using existing vocabularies such as ODRL to express the combination. The end (or resulting) semantics permit the same type of usage - where each concept (or code) is intended to convey a specific set of applicable conditions.

Extensions for domains: The DPV aims to be domain agnostic and doesn't care where the purpose is applied or interpreted. This is by design so that it can be utilised in as much a wide areas as possible. However, some use-cases or domains, like jurisdictions, require specific interpretation and definitions for applications. Often times, these domains also have specific separate and specialised laws that introduce terms or define processes which must be incorporated along side other privacy and data protection concerns. Extensions are a good mechanism for providing usefulness of DPV to these domains without them affecting other 'normal' applications and use-cases. Examples of such domains are: healthcare (as in medical, hospital), finance (as in banks, insurance), government (as in governing bodies, authorities, police). Creating such extensions require (multi-discplinary) experts specifically with knowledge of the domain requirements as well as how it applies to more generalised areas that are targeted by DPV.

Extensions for Convenience: There are terms and concepts used in several instances which are not formalised in any domain or law, but are adhoc creations of the industry or common usage across media and users. Examples of these are: app (as in smartphone, but also desktop), app (separetly, as a possible synonym of both app and company together), service, product, company, and so on. These terms can be loosely defined in a separate extension so that they do not necessitate a large amount of impact within DPV or have tight integrations with concepts within the core vocabulary - such as Data Controller and app. As things mature or are made more clear, concepts can be moved into the core vocabulary (primary, as in DPV).

Technologies and Tools: The DPV, in its current state, does not model specific technologies - such as devices, storage (hard disk, cloud), specific databases, or even cookies. Whether these should be provided as an ad-hoc list within the DPV or as an extension is for discussion. My opinion is that by starting this as an extension, we can quickly provide these terms without worrying too much about implication on other DPV terms. Then when we have had time to discuss their relation and usefulness, they can be integrated as necessary, or only as top-level abstract concepts within the DPV. One of the challenges in directly including them is the association between technology and technical measure. For example, using a database with replication across locations and use of specific access control and encryption. It is difficult to find an ideal way to represent all this information, and at the moment it is trivial to say some technical measures are associated with the processing and/or purpose.

SKOS Modelling

Currently, the DPV is exlusively modelled using RDFS semantics. Even though it contains statements based in OWL, these are trivial in terms of the impact they have on the model and reusability in non-OWL environments. A common issue I hear about is the problematic application of subclass as a relation expression in tooling that expects to have instances of classes - such as in dropdowns; which is difficult to provide while staying in the current set of logic constraints we have put ourselves into. A SKOS iteration of DPV can vastly improve the quality of application when the DPV terms are not needed to instantiate (e.g. say what specific type of access control is being used), but are only used as a vocabulary (e.g. select access control from available technical measures).

# existing model
dpv:Concept, dpv:ParentConcept a rdfs:Class .
dpv:Concept rdfs:subClassOf dpv:ParentConcept .
dpv:ParentConcept rdfs:subClassOf dpv:TopConcept .
# SKOS model
dpv:TopConcept a skos:Concept .
dpv:Concept, dpv:ParentConcept a skos:Concept, dpv:TopConcept .
dpv:Concept skos:broader dpv:ParentConcept .

Arguably a complicated way to express the same relations, but here any instance of dpv:TopConcept can be used, while still having relations between them expressed using SKOS.

I'm not an expert (or experienced) person with the implications of such type of semantic alignments between RDFS, OWL, and SKOS, especially when reasoning is involved. Still, as long as the reasoning for RDFS and SKOS is satisfied and the OWL semantics are not complicated (or involved), I think this is valuable to work with in terms of practical tools and developing frameworks.

OWL2 model

Converse to the application of SKOS, this is all about going as deep in to OWL2 as is possible. The origins of DPV are based in SPECIAL, which had a nice minimal layer of OWL subsumption upon which a modified reasoning engine calculated compliance by comparing two policies (preference and request) to identify compliance, all in a fast few microseconds optimised fashion. In the interest of continuing this effort, the current state of DPV can benefit from a native OWL iteration that can be readily plugged into and used alongside tooling from SPECIAL and its successor TRAPEZE projects.

My hope is that the SPECIAL compliance checker is translated to be capable of running in the browser with minimal performance penalties or requirements, which can open up an entire new world of online preference matching and compliance tooling to be developed. Putting semantics back on the web. The TRAPEZE project is working on something related (I think), given that they are working on a minimal JSON specification for expressing policies which can be converted to formal OWL2 ones.

Use-Cases and Examples

Without use-cases, it is getting more and more difficult (and absurd) to develop the DPV. Its a vaccum of what concepts might be useful, with lots of implicit references and requirements (mostly from GDPR), and no implications on how the modelling of terms will impact specific areas of application. For example, when developing purposes as a taxonomy, having a clear picture of where the purposes are defined, or what use-cases require specific notions of purpose (such as where commercial profit making is intended as opposed to non-commercial social benfit). These have an impact on how the purposes should be designed.

A list of use-cases, and a way to identify them as being applicable for a given concept would be a massive help for both the creators and adopters of the DPV. However, doing this is not an easy task, because each use-case must be documented along with information on what concepts (from DPV) are associated with it. This can be done by giving the use-case a specific identifier, creating a file for each use-case, storing the RDF representation of concepts as being applicable, and then integrating this with the documentation system for finding and listing related use-cases (trivially done using a SPARQL query). More complex iterations can identify use-cases for a given concept where parent and ancestor concepts are relevant. Thus a use-case for a top-concept can easily provide guidance for more specialised and narrower concepts down the taxonomy.

In a way, each use-case can also be made into an example for where the given concept is to be defined and/or applied. The difference is that the example should be declared with some specific formal representation e.g. RDFS, OWL2, or SKOS. Like the use-case, the example can also be given an identifier and stored into a separate file. However, the application of a parent concept can be tricky in some serialisations such as OWL2 or SKOS if there are other assertions are play (e.g. disjoint sets for OWL2 or non-transitive relations for SKOS). Still, the idea is nice and simple, and can help cover a large number of concepts with use-cases and examples from a small carefully chosen set targeted towards the top concepts.

Datasets

While the DPV is doing a good effort at providing a vocabulary, it would be nice to have some collection of real-world data reflecting the practices of apps, companies, privacy-policies, consent dialogues, and so on. Currently, AFAIK, no such dataset exists, though there is some significant technological capability and demonstration (e.g. UsablePrivacy project and PriBot project) who have shown use of ML and NLP to analyse privacy policies and show summaries of personal data use. Sadly, none of the researchers involved have released any of this data or their models or any resources, other than well crafted websites.

A lot of good can be created from such datasets, especially when paired with exhausitve and interoperable vocabularies such as the DPV. For example, NOYB launched hundreds of complaints using an automated system. If real-world data is available, researchers can initiate compliance assessments and documentations by linking it with legal obligations and requirements and assist practioners in developing such tools for social good.