Modelling AI Training in DPV

Notes analysing AI training concepts for use with DPV
published:
by Harshvardhan J. Pandit
is part of: Data Privacy Vocabulary (DPV)
DPV DPVCG semantic-web Working Note

The DPV's AI extension provides concepts associated with AI technologies, including the use of data in processes such as training and validation, types of techniques such as machine learning, risks such as data bias, and AI technologies such as models and AI systems. From a practical perspective, there is a need to represent concepts associated with training, which are currently missing from this extension. A simple approach is to add training as a process through which a trained model is produced. However, consider the following use-cases to understand why training as a concept needs to be expressed with a greater range of concepts in a taxonomy.

ISO/IEC 22989:2022 (which is free to access) defines training as a "process to determine or to improve the parameters of a machine learning model, based on a machine learning algorithm by using training data". This means the concept Training can be expressed as a dpv:Process which takes some tech:InputData and produces a ai:TrainedModel. Expressing training in this manner also provides a way to associate more contextual information such as who performs the training (dpv:isImplementedByEntity), where it takes place (dpv:hasLocation), and whether it has a legal basis such as consent (dpv:hasLegalBasis).

Training as a concept has implications in terms of privacy as it uses data to create or enhace an AI model. These implications exist due to the nature of data (e.g. dpv:SensitiveData or new:ConfidentialData), where the training takes place (e.g. dpv:LocalLocation or dpv:RemoteLocation), under whose control (e.g. training is new:UserControlled), and whether the trained model is shared (e.g. new:ClosedWeightsModel or new:OpenWeightsModel). When creating policies, writing contracts, or recording audit logs - stating how the training was taking place and what was permitted or prohibited is essential. Therefore, representing each of these cases as concepts (i.e. an interoperable vocabulary) is needed.

Further, training itself can be differentiated based on:

  1. new:TrainingByStrategy
    1. new:SupervisedTraining that uses ai:SupervisedLearning with new:LabelledData - where contextual information involves provenance of labelled data such as its source, who created the labels and its categorisation as sensitive etc.;
    2. new:UnsupervisedTraining that uses ai:UnsupervisedLearning with new:UnlabelledData - where contextual information involves provenance of unlabelled data such as its source;
    3. new:ReinforcementTraining that uses ai:ReinforcementLearning by using new:Feedback that act as new:Reward or new:Punishment - where contextual information involves the algorithm deciding the feedback;
    4. new:SelfSupervisedLearning that uses new:UnlabelledData - where contextual information involves provenance of unlabelled data.
  2. new:TrainingByAdapting
    1. new:TransferLearning reuse a trained model for a new task in another model;
    2. new:FineTuning where a trained model is refined using new data - in particular for a specific domain or use-case;
    3. new:FewShotTraining where a trained model is given a few labelled data points to learn from - where the sample is small and not specific enough to be considered fine tuning.
  3. new:TrainingByFrequency
    1. new:StaticTraining where the model is trained once;
    2. new:PeriodicTraining where the model is trained periodically;
    3. new:ContinousTraining where the model is trained continuously e.g. as new data arrives;
    4. new:IncrementalTraining where the model is trained in increments that are small and do not cause a full or significant retraining;
    5. new:FederatedTraining where the model is trained in a federated manner e.g. locally on device;

Other than training, AI models and systems are also augmented with other techniques which require data. For example, new:RAG i.e. Retrieval-Augmented Generation where ai:InformationRetrieval is used to provide context for generating outputs. These are also important to consider as the information generated or retrieved as part of such systems can be transmitted to another entity as part of the process.

Information about the model in terms of its "open-ness" is also relevant for making decisions, and understanding the nature of the model and its implications towards privacy. For example, whether the trained model will be open and accessible to others, or it will be proprietary. An enumeration of these is as follows:

  1. new:OpenWeights - weights are publicly available
  2. new:ClosedWeights - weights are not publicly available
  3. new:PartiallyOpenWeights - weights are not publicly available but can be accessed in specific contexts
  4. new:OpenSource - it can be used, studied, modified, and shared as defined by the OSI which includes information on data used to train the model, source code, and parameters/weights
  5. new:ClosedSource - it is not open source

Based on the above, the following potential use-cases can be envisioned:

  1. A patient agrees to let their medical records be used by a hospital for training of AI models, including any retraining and fine-tuning, but only if the models are developed by the hospital and released as open source that benefit all researchers.
  2. A user agrees for their contacts on their phone to be used to fine-tune a model but only if the training occurs locally (so no contacts are shared outside the device) and only if the model will be executed locally (on device).
  3. A customer allows their purchase history to be used to train a model and for it to be used to recommend products but only if they can indicate whether the recommendations are useful or not (via reinforcement learning)