A Design Pattern Describing Use of Personal Data in Privacy Policies
Advances in Pattern-Based Ontology Engineering
✍ Harshvardhan J. Pandit* , Declan O'Sullivan , Dave Lewis
publication 🔓copies: harshp.com
Information in Privacy Policies
Automated approaches are based on interpreting textual information into a structured and machine-readable form that can be stored and analysed programmatically. To date, each automated approach has developed and utilised its own unique method and vocabulary to persist and analyse the information within privacy policies. This has resulted in duplication of both information and analysis, primarily due to a lack of annotated datasets - but also due to a lack of shared semantics and vocabulary. Sharing information extracted from privacy policies enables the creation of datasets that can be used to evaluate the effectiveness of approaches. Similarly, developing a common vocabulary will aid in the task of developing such datasets and comparing their results towards improving these approaches. In addition, such a common vocabulary also has applications in related domains such as legal compliance for expression and evaluation of information based on legal rules and obligations.
Competency Questions for information regarding personal data
What personal data is collected? e.g. email
Does the data have a category? e.g. contact information
What was its source? e.g. user
How is it collected? e.g. given by user, automated
What is it used for? e.g. creating an account, authentication and verification
How long is it retained for? e.g. 90 days after account deletion
Who is it shared with? e.g. name of partner organisation(s)
What is the legal basis? e.g. given consent, legitimate use
What processes/purposes was the data shared for? e.g. analytics, marketing
What is the legal type of third party? e.g. processor, controller, authority
How can personal data be rectified or corrected?
How can personal data be deleted or removed?
How can a copy of the personal data be obtained?
How can personal data be transferred to another party?
How can information about the personal data be obtained?
What measures exist to safeguard personal data?
Representing Concepts & Relationships
While a design pattern usually does not use external vocabularies (except standardised ones) in order to provide an ‘abstract implementation’, we describe our pattern using the GDPRtEXT and GDPRov ontologies that provide concepts relevant to the GDPR. GDPRtEXT provides definitions of concepts and terms used within the text of the GDPR using SKOS. GDPRov is an ontology for describing the provenance of consent and personal data life-cycles using GDPR relevant terminology, and is an extension of PROV-O and P-Plan. Though it is possible to define the pattern using new abstract concepts, we reuse existing concepts and properties where applicable in the interest of practical applicability. The abstract pattern can be created as an ontology and legislation independent implementation by extracting the concepts and relationships into an empty namespace.
The pattern is described here in terms of its concepts and relationships. A visualisation of the pattern is presented in [fig:pandit_personaldata_fullpattern] - created using yEd3 and using with the Graffoo4  specification for diagrammatic representation of ontologies. The pattern is available online along with its documentation5 and has been submitted to the ontology design patterns collaborative wiki6.
Source of Data
The source of personal data depicts where the data is obtained or collected. This is described using the dct:source property with the range defined as an instance of gdprtext:Entity, or its subclasses such as User or ThirdParty. Since every personal data must have at least one source, this provides the axiom: PersonalData ⊑ ≥ 1source.Entity
Data is collected through a gdprov:DataCollectionStep, and is represented using the property gdprov:collectsData. The data provider is represented using prov:Agent through the property gdprov:collectsDataFromAgent.
Data Usage & Processing
Legal Basis for Data Usage
Every use of personal data within a process must have a legal basis under the GDPR. Examples of such legal basis defined within GDPRtEXT include consent, legitimate interest, compliance with the law, and performance of contract. To represent this, the pattern uses the property gdprov:hasLegalBasis with the range gdprtext:LawfulBasisForProcessing. Since every data use must have at least one legal basis, this provides the axiom: Process ⊑ ≥ 1hasLegalBasis.LawfulBasisForProcessing
The sharing of data involves the entity the data is shared with, the purposes for sharing, and their legal basis. This is represented within the pattern through the use of gdprov:DataSharingStep and the property gdprov:sharesData. The entity the data is shared with is represented using the gdprov:sharesDataWith property with the domain as gdprov:DataSharingStep and the range as a type of gdprov:Agent, such as another Data Controller, Data Processor, or an Authority. The purpose of sharing is represented using gdprov:Process and the property gdprov:sharesDataForProcess to model the data being used in that process after sharing. The legal basis of processes for which the data is shared is represented using gdprov:hasLegalBasis as specified earlier. Since it is mandatory to inform who the data is being shared with, along with its intended purposes, and the specific legal obligation, we have the following axioms: DataSharingStep ⊑ ≥ 1sharesData.PersonalData DataSharingStep ⊑ ≥ 1sharesDataWith.Agent DataSharingStep ⊑ ≥ 1sharesDataFor.Process
The example use-case is illustrated in Fig. [fig:pandit_personaldata_use-case] using Graffoo  and shows the classes, properties, and instances. The corresponding code is presented in the following listing using the Turtle11 notation for RDF.
@prefix dct: <http://purl.org/dc/terms/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xml: <http://www.w3.org/XML/1998/namespace> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix : <http://example.com/personaldata#> .
:PaymentProcess a gdprov:DataSharingStep ;
rdfs:label "Payment Process"^^xsd:string ;
gdprov:sharesData :EmailAddress ;
gdprov:sharesDataForProcess :IdentityVerification ;
gdprov:sharesDataWith :PaymentsController .
:PlatformServices a gdprov:Process ;
rdfs:label "Provide, Improve, and Develop Platform"^^xsd:string ;
gdprov:hasLegalBasis gdprtext:LegitimateInterest ;
gdprov:usesData :EmailAddress .
:Registration a gdprov:DataCollectionStep ;
rdfs:label "Registration for new users"^^xsd:string ;
gdprov:collectsData :EmailAddress ;
gdprov:collectsDataFromAgent :User ;
gdprov:hasCollectionMechanism gdprtext:GivenByUser .
:AccountInformation a rdfs:Class, owl:Class ;
rdfs:label "Account Information of an User"^^xsd:string ;
rdfs:subClassOf gdprov:PersonalData .
:IdentityVerification a gdprov:Process ;
rdfs:label "Identity Verification"^^xsd:string ;
gdprov:hasLegalBasis gdprtext:Contract ;
gdprov:usesData :EmailAddress .
:PaymentsController a gdprov:Controller,
rdfs:label "Payments Controller"^^xsd:string .
:User a gdprov:DataSubject,
rdfs:label "User of Service"^^xsd:string .
:EmailAddress a :AccountInformation,
rdfs:label "Email Address"^^xsd:string .
The answers to the competency questions corresponding to the use-case are provided below. The RDF code has been truncated by removing annotation properties to save space. The example in its entirety is available online12.
What personal data is collected: Email Address
Does the data have a category: Account Information
What was its source: User
How is it collected: Given by user
What is it used for: Platform Services, Payments
How long is it retained for: indefinitely (no end duration)
Who is it shared with: Payments Controller
What is the legal basis: Legitimate Interest, Contract
What processes/purposes was the data shared for: Identity Verification
What is the legal type of third party: Data Controller
Discussion: Application to State of the Art
CLAUDETTE  discusses creation of a gold standard based on information presence and requirements, and highlights challenges regarding context of information (e.g. at a sentence level) and omission of information. Degelin et al.  also analyse privacy policies and reiterate multilingualism as a challenge. Amos et al.  have recently published a corpus consisting of over a million policies and their iterations over two decades. In addition to the data being available for reuse, the corpus marks the largest dataset of policies currently available.
Conversely, other approaches that do model the relevant semantics do not apply it to privacy policies. These include approaches that model privacy and/or data protection concepts - such as GDPRtEXt  and GDPRov ; and also approaches that model preferences such as the Privacy Preference Ontology . A notable addition to these is the Data Privacy Vocabulary13 (DPV) , which is a vocabulary providing a taxonomy of concepts associated with use and processing of personal data, and an outcome of the W3C Data Privacy Vocabularies and Controls Community Group.
With these, it is clear that on one hand are approaches for automation utilising machine-learning and natural-language processing to extract and analyse information from privacy policies, and on the other hand are approaches that model information relevant to privacy and data protection. Through this ODP, we hope to encourage the two approaches to interleave and produce an annotated corpus that utilises semantics to share and reuse information. In this, the ODP is useful in automation by representing the ‘core’ information currently utilised in analysis of privacy policies, while also encouraging the ontological approaches to develop a more comprehensive representation of privacy policies.
We consider our work on this ODP as an initial effort towards consolidating information within privacy policies. Using the pattern to reflect information from several distinct real-world privacy policies will demonstrate its feasibility and applicability in real-world scenarios. This presents a challenge as the pattern currently assumes the presence of all required information which may not be the case for some use-cases, particularly where interpretations of information are ambiguous. However, capturing such ambiguities through a meta-pattern can possibly aid in flagging them for review by legal experts.
In addition to the above, the pattern faces other challenges for the modelling of information it aims to represent. For example, it is not clear what level of abstraction should be represented in the pattern regarding concepts such as storage and sharing. Should there be a DataStorageStep which can be further annotated to represent various pieces of information relating to the storage of personal data? Abstractions can help to represent different storage duration and formats for the same instance of personal data, such as storing the actual data for 6 months while a (pseudo-)anonymised copy is stored for 2 years. However, tacking on such abstractions in to the pattern can make it rigid (in terms of modelling) and complex. More work needs to be undertaken to evaluate whether such abstractions are necessary in the pattern, and how they should be represented.
Another challenge is the representation of storage duration (or retention period). Concrete values such as 6 months or 2 years can be represented using appropriate ontologies, but ambiguous statements are difficult to represent using such ontologies. An example of this is the statement "data may be stored for as long as necessary..." in which there is no end to the duration for storage. Representing this as a time:Duration instance is problematic as there is no clear method to represent its end period. Not defining an end period is also not a solution due to the open world assumption. Our approach towards solving this issue is to abstract the storage activity as described earlier. However, we are open for other approaches and solutions towards this problem.
Capturing this information is essential towards quantifying the privacy policies into machine-readable data, with the paper demonstrating the suitability of ODPs for this task. This motivates creation of granular, context-specific ODPs regarding privacy policies which can represent information regarding concepts such as rights and technical measures, and which can be used in the existing implementations of automated analyses and information extraction approaches.
This work is supported by the ADAPT Centre for Digital Content Technology which is funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund.
The UsablePrivacy project provides a dataset of tagged privacy policies, which has been utilised by most of the other approaches. However, this dataset lacks a formal vocabulary describing its contents, and is out-of-date compared with the privacy policies of today.↩︎