Modelling AI Training in DPV
published:
by Harshvardhan J. Pandit
is part of: Data Privacy Vocabulary (DPV)
DPV DPVCG semantic-web Working Note
The DPV's AI extension provides concepts associated with AI technologies, including the use of data in processes such as training and validation, types of techniques such as machine learning, risks such as data bias, and AI technologies such as models and AI systems. From a practical perspective, there is a need to represent concepts associated with training, which are currently missing from this extension. A simple approach is to add training as a process through which a trained model is produced. However, consider the following use-cases to understand why training as a concept needs to be expressed with a greater range of concepts in a taxonomy.
ISO/IEC 22989:2022 (which is free to access) defines training as a "process to determine or to improve the parameters of a machine learning model, based on a machine learning algorithm by using training data". This means the concept Training
can be expressed as a dpv:Process
which takes some tech:InputData
and produces a ai:TrainedModel
. Expressing training in this manner also provides a way to associate more contextual information such as who performs the training (dpv:isImplementedByEntity
), where it takes place (dpv:hasLocation
), and whether it has a legal basis such as consent (dpv:hasLegalBasis
).
Training as a concept has implications in terms of privacy as it uses data to create or enhace an AI model. These implications exist due to the nature of data (e.g. dpv:SensitiveData
or new:ConfidentialData
), where the training takes place (e.g. dpv:LocalLocation
or dpv:RemoteLocation
), under whose control (e.g. training is new:UserControlled
), and whether the trained model is shared (e.g. new:ClosedWeightsModel
or new:OpenWeightsModel
). When creating policies, writing contracts, or recording audit logs - stating how the training was taking place and what was permitted or prohibited is essential. Therefore, representing each of these cases as concepts (i.e. an interoperable vocabulary) is needed.
Further, training itself can be differentiated based on:
new:TrainingByStrategy
new:SupervisedTraining
that usesai:SupervisedLearning
withnew:LabelledData
- where contextual information involves provenance of labelled data such as its source, who created the labels and its categorisation as sensitive etc.;new:UnsupervisedTraining
that usesai:UnsupervisedLearning
withnew:UnlabelledData
- where contextual information involves provenance of unlabelled data such as its source;new:ReinforcementTraining
that usesai:ReinforcementLearning
by usingnew:Feedback
that act asnew:Reward
ornew:Punishment
- where contextual information involves the algorithm deciding the feedback;new:SelfSupervisedLearning
that usesnew:UnlabelledData
- where contextual information involves provenance of unlabelled data.
new:TrainingByAdapting
new:TransferLearning
reuse a trained model for a new task in another model;new:FineTuning
where a trained model is refined using new data - in particular for a specific domain or use-case;new:FewShotTraining
where a trained model is given a few labelled data points to learn from - where the sample is small and not specific enough to be considered fine tuning.
new:TrainingByFrequency
new:StaticTraining
where the model is trained once;new:PeriodicTraining
where the model is trained periodically;new:ContinousTraining
where the model is trained continuously e.g. as new data arrives;new:IncrementalTraining
where the model is trained in increments that are small and do not cause a full or significant retraining;new:FederatedTraining
where the model is trained in a federated manner e.g. locally on device;
Other than training, AI models and systems are also augmented with other techniques which require data. For example, new:RAG
i.e. Retrieval-Augmented Generation where ai:InformationRetrieval
is used to provide context for generating outputs. These are also important to consider as the information generated or retrieved as part of such systems can be transmitted to another entity as part of the process.
Information about the model in terms of its "open-ness" is also relevant for making decisions, and understanding the nature of the model and its implications towards privacy. For example, whether the trained model will be open and accessible to others, or it will be proprietary. An enumeration of these is as follows:
new:OpenWeights
- weights are publicly availablenew:ClosedWeights
- weights are not publicly availablenew:PartiallyOpenWeights
- weights are not publicly available but can be accessed in specific contextsnew:OpenSource
- it can be used, studied, modified, and shared as defined by the OSI which includes information on data used to train the model, source code, and parameters/weightsnew:ClosedSource
- it is not open source
Based on the above, the following potential use-cases can be envisioned:
- A patient agrees to let their medical records be used by a hospital for training of AI models, including any retraining and fine-tuning, but only if the models are developed by the hospital and released as open source that benefit all researchers.
- A user agrees for their contacts on their phone to be used to fine-tune a model but only if the training occurs locally (so no contacts are shared outside the device) and only if the model will be executed locally (on device).
- A customer allows their purchase history to be used to train a model and for it to be used to recommend products but only if they can indicate whether the recommendations are useful or not (via reinforcement learning)