Modelling AI Training in DPV - Part 2

Notes analysing AI training concepts for use with DPV
published: Mon Feb 24 2025
by Harshvardhan J. Pandit
is part of: Data Privacy Vocabulary (DPV)
DPV DPVCG semantic-web Working Note

In the previous post, I proposed AI training concepts that are categorised by technical details. In this post, I continue that discussion by exploring the feasibility of those concepts with specific use-cases and examples, and exploring the privacy / data protection implications of AI training.

Process or Processing?

In DPV, dpv:Process is a broad concept that allows combining different concepts into cohesive groups, whereas dpv:Processing represents specific operations over data (e.g. collect, store). This allows expressing information such as who performed the collection of data, for what purposes, for which data, etc. without focusing only on 'processing' or 'purpose' or 'data' -- but over their combination. Similarly, for ai:Training it would be advantageous if a similar arrangement was identified so that it can be indicated that the 'process' involved training, using specified data, for the stated purpose, etc. For dpv:Processing, the relation dpv:hasProcessing provides the link with dpv:Process. For training, currently no such relation exists.

If we consider training as operations over data, then it becomes a form of processing. At the broadest level, training is use of data. In practice, it may also involve other operations such as transformations to clean, anonymise, combine, etc., storing transformed data, organising and structuring it for training tasks, and separately obtaining or collecting this data. However, in our definition, we limit operations only to those that must always be present - which would be dpv:Use. Thus, if we define ai:Train as a processing operation that is a specific form of dpv:Use and dpv:Transform - then it means that AI training is an operation over data that uses it and transforms it. In this case, the transformation produces a ai:Model or more data.

Based on the above, we have ai:Train as the top-level concept, and the taxonomy from the previous post then expands it with -Training concepts. The relation ai:hasTraining is used to associate these, and is a specialised version of dpv:hasProcessing, which means any time training is associated with a process, it implies data is being processed.

The example below shows a process where supervised training takes place within the device for the purposes of personalised recommendations with the legal basis of consent. Note we don't explicitly the training itself, but of the process that involves training. This also allows later swapping specific technologies with others without changing the abstract process.

ex:Process a dpv:Process ;
    dpv:hasPurpose dpv:ProvidePersonalisedRecommendations ;
    dpv:hasPersonalData pd:Location ;
    ai:hasTraining ai:SupervisedTraining ;
    dpv:hasLocation dpv:WithinDevice ;
    dpv:hasLegalBasis dpv:InformedConsent ;
    dpv:isImplementedUsingTechnology ex:SomeTech .

One could also directly specify this information with the training instance as the focal point, as shown in the example below -

ex:SomeTraining a ai:SupervisedTraining ;
    dpv:hasPurpose dpv:ProvidePersonalisedRecommendations ;
    dpv:hasPersonalData pd:Location ;
    dpv:hasLocation dpv:WithinDevice ;
    dpv:hasLegalBasis dpv:InformedConsent ;
    dpv:isImplementedUsingTechnology ex:SomeTech .

However, this design complicates use of information as it is not clear where personal data would be associated, and how to query it. With dpv:Process we have a consistent grouping of related concepts that is flexible to incorporate many relations and can be expanded and mixed in composite patterns as desired. Therefore, it is recommended to use dpv:Process, and it is not advised to directly annotate instances with information like this. The DPV Primer also lays out the case for this.

Data as Input of Training

Modelling training as processing in this manner also helps with legal compliance. For example, GDPR applies over the processing of personal data. The AI extension contains the relation ai:hasTrainingData which is a specialisation of tech:hasInputData and indicates the data is being used for training AI. This is a clearer indication of the role of data being for training than dpv:hasPersonalData. However, this relation does not directly state that the training uses personal data, therefore the data being used must be correctly categorised as personal data. This is shown in the example below.

ex:Process a dpv:Process ;
    dpv:hasPersonalData pd:Name ; # clear category, unclear training role
    ai:hasTrainingData ex:Name ; # clear training role, unclear category
    ai:hasTraining ai:SupervisedTraining . # new prop
# Name must be declared as personal data to avoid above issue
ex:Name a dpv:PersonalData .

Therefore whether personal data is involved in training is possibly indicated in two ways using the explicit relation, and relying on the categorisation of personal data. For best practice, the categorisation takes precedence as it indicates the data is always personal data and does not limit its role to a specific process.

However, an important considering in the use of ai:hasTrainingData is that its domain is declared as including ai:AI, which means either the dpv:Process is an instance of ai:AI or that the domain scope should be expanded to also include dpv:Process. If a process is marked as 'AI' then it reflects the use of an 'AI based Process' which could be helpful in organising and discovering information. Whereas if the domain scope if expanded, then it means a process could specify involvement of training data but there is not indication that AI is involved. Therefore, it is better to declare the process as using AI. Alternatively, we can also create the concept ai:AIProcess that says the same.

ex:Process a dpv:Process, ai:AI ; # an AI process
    ai:hasTrainingData ex:Name ;
    ai:hasTraining ai:SupervisedTraining .

ai:AIProcess a skos:Concept, rdfs:Class ;
    rdfs:subClassOf dpv:Process, ai:AI .
ex:SomeAI a ai:AIProcess ;
    ai:hasTrainingData ex:Name ;
    ai:hasTraining ai:SupervisedTraining .
ex:SomeProcess a dpv:Process ;
    dpv:hasProcess ex:SomeAI ; # valid - points to process
    ai:hasAI ex:someAI . # also valid - points to 'AI'

This mixture of process and AI also has benefits in stating risks, involvement of entities, and other information while continuing the use of dpv:Process as before for cases that are describing abstract plans or processes. For specific concrete implementations, as ai:AI extends dpv:Technology, it can also be used with dpv:isImplementedUsingTechnology. Further, this also allows use of dpv:Process to describe an instance of ai:AI.

ex:SomeAI a ai:AI ;
    dpv:hasProcess [
        a ai:AIProcess ;
        ai:hasTrainingData pd:Name ;
        ai:hasTraining ai:SupervisedTraining ;
    ] .

As before, it is tempting to directly use training instances to dictate information, like so:

ex:SomeAI a ai:AI ;
    ai:hasTraining [
        a ai:SupervisedTraining ;
        ai:hasTrainingData pd:Name ;
        ai:hasTraining ai:SupervisedTraining ;
    ] .

This should be okay if the goal is to model only the details about how the AI is structured. However, as mentioned before, this creates lots of different ways to model where personal data and other relevant information might be included, and can results in graphs that differ widely between use-cases. Therefore, the first method of using dpv:Process should be preferred, and if the second method is absolutely necessary, then it should be additionally declared an instance of dpv:Process.

Stating Limitations

To represent limitations or restrictions regarding AI training, there are two modes - explicit and implicit. In explicit mode, we state the restriction or limitation as a specific rule. In implicit mode, we state the restriction or limitation by stating the intended action without a rule. In the example below, the limitation of using training only on device and using only the name as personal data is expressed in both forms.

ex:SomeProcess a ai:AIProcess ;
    dpv:hasProcessing ai:Train ;
    ai:hasTrainingData pd:Name ;
    dpv:hasLocation dpv:WithinDevice ; # implicit
    dpv:hasPermission [ # explicit
        a dpv:Permission ;
        dpv:hasProcessing ai:Train ;
        dpv:hasLocation dpv:WithinDevice ;
    ] .

The implicit information is not enough to convey it as a limitation or a restriction as it only refers to the location of the process, and not a rule. By contract, the rule clearly states the location as a permission (considering that only permitted activities can be carried out). Depending on the use-case and need for explicit limitations - the appropriate method can be used. DPV rules don't have the expressivity to state that processing outside the device is not permitted - which is where efforts such as ODRL should be explored and used.

In the above example, the use of ai:AIProcess and dpv:hasProcessing ai:Train also shows the utility of modelling AI training as a processing activity, as all governing mechanisms for how processing must take place (dpv:Technology), or where (dpv:Location) and when (dpv:Duration) it takes placecan be expressed within the process along with restrictions.

Training as an AI Technique

The AI extension already provides a taxonomy of techniques which includes ai:SupervisedLearning - which is a form of training. At the moment, it does not distinguish between ML techniques that are/not training. Therefore, this needs to be improved. We should not directly put ai:Train or ai:Training as concepts under ai:MachineLearning because "training" can include other approaches beyond machine learning - such as adding facts to symbolic reasoning. So we have the option of adding ai:Training as a concept under the ai:Technique taxonomy, and then having a property ai:hasTrainingTechnique as a subproperty of ai:Technique. This allows the use of the technique taxonomy and avoids the requirement to create a separate taxonomy just for training when the same concepts are used in both training and deployment phases.

ai:TrainingTechnique a skos:Concept, rdfs:Class ;
    rdfs:subClassOf ai:Technique .
ai:hasTrainingTechnique a rdf:Property ;
    dcam:domainIncludes ai:AI ;
    dcam:rangeIncludes ai:Technique .

ex:SomeProcess a ai:AIProcess ;
    ai:hasTrainingMethod ai:SupervisedLearning ;
    ai:hasTrainingData pd:Name .

However, in doing this, we lose the ability to indicate training as the processing of data. To rectify this, we could declare ai:TrainingTechnique as a subclass of dpv:Processing. Though it is unclear if ai:Technique itself should be a subclass -- as this would mean that all techniques require data processing, which is not true for example for symbolic techniques which operate without data. To avoid this kind of ambiguity, we can have the taxonomy as ai:DataProcessingTechnique which includes ML etc. which always require data, and which are defined as types of dpv:Processing, and ai:NonDataProcessingTechnique which include symbolic, etc. which do not always involve data -- they can, but not by definition. Then, when ai:hasTrainingTechnique is used, we can check whether the concept is part of the data processing or non-data processing taxonomy.

Model is Output of Training

Training is a process through which an ai:Model is produced as the output. To express this, we can directly associate the model using ai:hasModel.

ex:SomeProcess a ai:AIProcess ;
    ai:hasTrainingData pd:Name ;
    ai:hasTrainingMethod ai:SupervisedLearning ;
    ai:hasModel ex:SomeModel .

However, this approach may be ambiguous for cases where multiple models are involved - some as inputs for example. To avoid this, we should explicitly use tech:hasInput and tech:hasOutput.

ex:SomeProcess a ai:AIProcess ;
    ai:hasTrainingData pd:Name ;
    ai:hasTrainingMethod ai:SupervisedLearning ;
    ai:hasModel ex:EarlierModel ;
    tech:hasOutput ex:LaterModel .

Technical Measures for Training

With the above exploration of dpv:Process and the creation of ai:AIProcess, this should be straightforward to express. The below example represents training done with anonymisation. If there are specific measures that apply to different parts (e.g. data, then process, then some oversight) -- these should be represented separately within subprocesses.

ex:SomeProcess a ai:AIProcess ;
    ai:hasTrainingData pd:Name ;
    ai:hasTrainingMethod ai:SupervisedLearning ;
    dpv:hasTechnicalMeasure dpv:Anonymisation .

Notice for Training

A notice for training communicates that there will be training, and stating what data will be involved, who will perform the training, etc. We have the concept dpv:AINotice for these kind of scenarios. The below notice states that there will be training using names for providing personalised recommendations e.g. in spell checks with the training occuring on device and based on informed consent.

ex:SomeNotice a dpv:AINotice ;
    dpv:hasProcess [
        a ai:AIProcess ;
        ai:hasTrainingData pd:Name ;
        dpv:hasPurpose dpv:ProvidePersonalisedRecommendations ;
        dpv:hasLocation dpv:WithinDevice ;
        dpv:hasLegalBasis dpv:InformedConsent ;
    ] .

The notice does not state that no data will leave the device (it also does not state that it will). This is not currently possible to express using DPV. There are a couple of ways to do this. First, we can create a concept called dpv:OutsideDevice and use a prohibition over it to say data cannot be transferred outside the device. However, this means we will need concepts for outside app, outside location, and so on. These concepts are useful only for expressing conditions such as this. The second approach is where we use rules, but create a new type called dpv:Restriction which means whatever the rule is expressing is restricted to only what is in the rule. For permissions and obligations, it means only the content of these should be followed and anything outside of that is not followed. For prohibitions, it means only the contents are prohibited and anything outside of it is not. This is part of a broader discussion around rule interpretation and applicability - and it will introduce complexity here. The third approach is to use the existing property dpv:hasProcessingCondition to state data is stored on device explicitly -- with the understanding that data won't go outside the device since that's not part of the condition. In terms of explicitness and ease of interpretation, the use of processing condition in method 3 should be used, and if it is to be expressed as a prohibition, then the method 1 should be used in addition to it.

ex:SomeNotice a dpv:AINotice ;
    dpv:hasProcess [
        a ai:AIProcess ;
        ai:hasTrainingData pd:Name ;
        dpv:hasPurpose dpv:ProvidePersonalisedRecommendations ;
        dpv:hasLocation dpv:WithinDevice ;
        dpv:hasProhibition [ # method 1
            a dpv:Prohibition ;
            dpv:hasProcessing dpv:Transfer ;
            dpv:hasLocation dpv:OutsideDevice ; # new concept
        ] ;
        dpv:hasProcessingCondition [ # method 3
            dpv:hasProcessing dpv:Store ;
            dpv:hasLocation dpv:WithinDevice ;
        ] ;
        dpv:hasLegalBasis dpv:InformedConsent ;
    ] .