Extracting Provenance Metadata from Privacy Policies
International Provenance and Annotation Workshop (IPAW)
✍ Harshvardhan J. Pandit* , Declan O'Sullivan , Dave Lewis
publication 🔓copies: TARA , zenodo
📦resources: poster , slides
Discussing how information about data provenance can be extracted from privacy policies and modelled in semantic web
Extraction using Keyword-based entity recognition
Manual efforts to extract this provenance information do not scale well across a large number of policies, nor can they be automated. Entity extraction techniques ,  can help in identification and categorisation of methods. Identification and extraction can take place by searching for certain keywords known to refer to provenance information. For example, the word “collect” is almost always accompanied with the type of information collected. A starting point for GDPR relevant keywords is the GDPRtEXT ontology  that defines GDPR terms and concepts using the SKOS vocabulary.
Extraction using Machine learning models
Provenance metadata expressed using PROV-O concepts are assertions about the past (execution) and should not be used to depict a ‘model’ or abstraction of how things are supposed to be happen. To this end, we created GDPRov , an OWL2 ontology that extends PROV-O and P-Plan (an extension of PROV-O) for modelling data-flows involving consent and data using relevant GDPR terminology. An example representation of the use-case is depicted in Fig 1 with its representation as RDF triples.
:User a gdprov:DataSubject, prov:Agent . :AccountInformation rdfs:subClassOf gdprov:PersonalData . :FirstName a :AccountInformation . :LastName a :AccountInformation . :Email a :AccountInformation . :DOB a :AccountInformation . :AccountSignUp a gdprov:DataStep ; dct:source :User ; gdprov:collectsData :AccountInformation ; gdprov:hasLegalBasis gdprtext:LegitimateInterest .
Easier representation of privacy policies
Approaches related to privacy preferences
Through this paper, we presented our early stage work for the identification, extraction, and representation of provenance metadata present in privacy policies. We describe our approach that uses keyword-based entity extraction based on GDPR terms and concepts provided by the GDPRtEXT resource. This approach adopts the machine-learning model used by the UsablePrivacy project to create annotated privacy policies. We represent the extracted provenance metadata using GDPRov, which extends PROV-O and P-Plan, and allows for an abstract model of the policy to be represented. We describe the potential application of this work to augment several important topics related to privacy and data practices.
This work is supported by the ADAPT Centre for Digital Content Technology which is funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund.