Student Project Ideas

Ideas for student to implement as projects
by Harshvardhan J. Pandit
academic projects students

click to filter by: all

Recording online consent via browser extension

#privacy #consent #javascript #GDPR

Description: The "I Agree" button has become inescapable while browsing the web. While it is present as a legal requirement for collecting consent, once we have clicked the button, we have no record of what we just agreed to. In this project, you will be creating a digital receipt to record the given consent and the information associated with it.

The goal is to create a browser extension that automatically captures the information in a consent dialogue box, and enables the user to later view it in a dashboard. It will use existing standards such as Consent Receipt [1] and Data Privacy Vocabulary [2] to record this information.

This project will provide exposure on front-end development in real-world websites, and an opportunity for increased transparency online regarding privacy. It will also provide a learning experience for use of programming tools (e.g. git) and research based workflows.

Pre-requisites: Good working knowledge of Javascript/CSS and its use in web-pages

[1] Consent Receipt
[2] DPV

Recording online agreement via browser extension

#privacy #privacy-policy #javascript #GDPR

Agreeing with a privacy policy involves reading the policy legally, but few utilise the time or opportunity to do so. However, once clicked, the button conveys consent attached to the privacy policy and the person is not offered the opportunity or proof to record what they have agreed with. In this project, you will be creating a digital receipt to record the acceptance of a privacy policy by capturing the context in which the button was clicked as well as the privacy policy associated with it.

The goal is to create a browser extension that can assist the person in recording their agreements and the related policies by saving them as a 'notice receipt' within the browser. The saved receipts can then be viewed and interacted with via a dashboard. The project will use existing vocabularies such as the DPV [1] and Notice Receipt Schema [2] to record the information.

This project will provide exposure in front-end development in real-world websites, as well as an opportunity to work towards increased transparency and accountability in the online transactions of privacy. The project will also provide a learning experience with use of programming tools (e.g. git) and being involved in a research project.

[1] DPV
[2] OPN: Open Notice Receipt Schema

Privacy Policy Generator

#privacy #privacy-policy #python #GDPR

Description: Privacy Policies are too long, difficult to read, and complex to understand. There is a lot of ongoing research for using AI to make privacy policies easier to comprehend by identifying relevant information. This project approaches the problem from the other end - generating privacy policies for given information. Such a tool is useful for researchers to investigate the effect of layouts, language used, as well as for organisations to create simpler and more effective policies.

The goal would be to accept information in a structured format (e.g. via csv, JSON, or an old-fashioned form) and generate privacy policies for different use-cases (e.g. online website, shopping store). It will use existing vocabularies to represent the information, including Data Privacy Vocabulary [1], GDPRov [2], and GDPRtEXT [3].

The project will provide exposure on text-based programming techniques such as templating engines, as well as the state of privacy online. It will also provide a learning experience for use of programming tools (e.g. git) and research based workflows.

Pre-requisites: Good working knowledge of Python, Web Development (Javascript/CSS)

[1] DPV
[2] GDProv

Assisting with Ethical Clearance in Universities

#ethics #javascript

Description: Modern organisations must take into account GDPR concerns when conducting data collection and in universities this role is taken by the Ethics committee and its sub-committees. This project will develop a new stand-alone web-based tool to help students completing Ethics/GDPR applications for their projects. It is planned to create a high usability web interface that simplifies the current form and known patterns of use, e.g. a simple survey, to guide students through the form generation process. The outcomes will be a transportable JSON object that expresses the student’s wishes in a machine-readable way for further stages in an approval workflow.

The project involves work on building a web dashboard for assisting researchers with documentation for data protection and ethical clearance, and will provide an opportunity to participate in real-world applications of technology in the areas of data protection and ethics. The chief task would be to build an information system that will allow users to receive suggestions and guidance for addressing risks regarding ethics and privacy.

Extracting Structured Metadata From Privacy Policies

#privacy #privacy-policy #NLP #ML

Privacy policies are notoriously difficult to read. One of the challenges is the use of legal and intentionally obfuscating language. Though the GDPR has made it a legal requirement to make use of clear languages in policies, there is yet a barrier towards effective transparency regarding the information presented in such policies. This project aims to extract information from the policy, such as - sources of data, their requirement in processes, legal basis, storage periods - and express it as structured metadata for use in research that aims to simplify privacy policies via techniques such as summarisation and visualisation.

The project goal is to use NLP techniques to identify relevant information by using classifiers and ML [1][2][3], which would enable extraction of information from the text of privacy policies, and to represent it using vocabularies such as DPV [4] and GDPRov [5].

[1] Usable Privacy Project
[2] Pribot
[4] DPV
[5] GDPRov

Extracting Structured Metadata from Consent Dialogues

#privacy #consent #NLP #ML

Consent dialogue boxes are everywhere on the web - with the information geared towards making it easy for users to comprehend how their personal data is being used. However, this information is presented in human-readable format with no way for machines to analyse it. The aim of this project is to extract this information and represent it as structured metadata to enable automation and analysis of privacy based approaches. For example, the statement "We use your address to deliver goods you buy on our website" can be represented by: address - personal data, deliver goods - purpose, use - processing.

The project goal is to use NLP techniques to identify such categories by using classifiers and ML similar to existing work regarding privacy policies [1][2][3]. The extracted information would then be represented using vocabularies such as DPV [4].

[1] Usable Privacy Project
[2] Pribot
[4] DPV

Categorising News by Topic

#news #NLP #ML

When a new product is launched, such as the latest Apple iPhone, the news is dominated by articles targetting one product. This project investigates combining such related news into one coherent summary or article for better presentation to the reader. It will use NLP to identify categories and topics in a news article, and combine related news article together in a corpus. It will then extract novel details from different articles and combine the results into a single summary.

Finding Related News Articles For Issues on Privacy

#news #NLP #ML

Every issue raised about privacy has several related news articles which offer evidence or commentary on it. These are often isolated and collecting them requires significant effort and time. The aim of this project is to assist in the finding of relevant news article given a privacy issue. For example, given the topic 'cloud data storage', news article of relevance include those about 'data breaches', 'cloud security', 'legal obligations in a jurisdiction'. The identification requires creation of a taxonomy for privacy issues and using NLP to identify and categorise news articles based on concepts within the taxonomy.

Scoring Privacy Policies For Transparency and Readability

#privacy #privacy-policy #NLP #ML

Privacy policies are notoriously difficult to read and understand, chiefly because of the obfuscated legal language used and the confusing structure. Though the GDPR has strived to provide more transparency in the language used, there is no measurement of how to evaluate such policies. The aim of this project is to identify metrics for transparency and readability in the privacy policy and to score a given policy using them. An example metrics could be categorisation of information, where the policy has separate structures explaining data collection, sharing, etc. The project will use NLP to identify relevant clasues in the text of the policy, and ML to classify the policy using generated metrics.

autoDIXIT: Generating Clues for DIXIT from Image Analysis

#ML #NLP #ImageAnalysis #python

DIXIT is a game where cards containing images are displayed to the players along with a clue and the players have to identify the correct card matching the clue. The aim of the project is to generate such clues for a given set of images using image analysis. This involves identification of the contents of an image, matching it with popular trivia such as a song or a movie, and generating the clue using NLP.

Generating a Privacy Policy Corpus

#python #privacy #privacy-policy

The analysis of privacy on the web is highly dependant on the privacy policies made available on websites. However, such policies neither have a fixed address nor structure. Therefore, research involving them first needs to identify the privacy policies and collect them, which is a time-consuming task. This project will create a web crawler that identifies and saves privacy policies from websites. The crawler will generate a corpus of privacy policies from the web by identifying the URL of a privacy policy in a given website, extracting and cleaning its text, and saving it in an interoperable format for future use.

Summarising Event Coverage from Tweets

#twitter #NLP #ML

It has become nearly customary to tweet when an event is going on. Such tweets provide coverage of the event and offer valuable crowd-sourced insights. The aim of the project is to collect and analyse tweets for a given event. An example could be a conference, where attendees tweet about the speakers, venue, food, coffee, and also their opinions. Therefore, the project will also involve identification and categorisation of the tweets along these topics. These tweets will then be summarised in a single article offering an overview of the event.

Evaluating TRI over Tweets

#twitter #NLP #ML

Temporal Random Indexing (TRI) [1] is a technique that maps words into vectors (word spaces) along different time periods to enable analysis of how the meaning of words changes over time. This project will apply TRI to a selective corpus of tweets to identify how words and concepts change in use over a popular social platform. An application is identification of culturally relevant words or topics which spike in usage over a period by being associated with different contexts, such as the use of words in Twitter trends [2][3].

[1] "Analysing Word Meaning over Time by Exploiting Temporal Random Indexing" P. Basille, A. Caputo, and G. Semeraro. CLiC-it 2014

Unfolding summarised text to generate articles

#NLP #ML #python

Text analysis techniques can now summarise an article with satisfying efficiency. The advantages of summarised text are its short reading time and concentration of important details. This project investigates the creation of larger articles from a given summary by identifying the context of the article and retrieving additional information about it from the web. This requires the use of NLP to identify topics of relevance within the summary, and ML to retrieve relevant information based from an existing corpus or source such as news articles.

Crumple: Folding Privacy Policies via Summaries

#privacy #privacy-policy #python #NLP

Privacy policies can be made easier to digest if they are provided as efficient summaries which the users can read and understand quickly. This project will attempt to assist in the understanding of a privacy policy by abstracting or folding larger sections into shorter summaries. This will done by analysing the text using NLP and identifying relevant information to provide a summary.

Using AI to detect AI biases

#ethics #ML #python

Bias in the use of AI is a rising cause of concern and one of the major ethical hurdles in adoption of new technologies. While bias can occur in many forms, one specific form is skewing the outcome with a positive or negative bias - such as reducing the likelihood of women getting jobs because of their sex. While such biases are difficult to detect and address due to their invisibility and the complexities of a system, one possible answer would be to represent a system as a black box which takes an input and produces some output. This project investigates the approach of detecting biases in an algorithm by providing inputs and measuring the statistical likelihood of outcomes. The outcome of the project will be a tool for analysing system logs consisting of inputs and outputs and detecting potential biases encoded within them.

Cookie Monster: Detecting Privacy-invasive cookies in browser

#privacy #javascript

Cookies are tiny little packets of data stored in the browser for a variety of reasons - from preserving your login session to saving your form data and shopping cart. However, they have also been used nefariously to track users across the web, and has given them a negative perception. Laws such as the ePrivacy directive and GDPR require such cookies to be transparent about their purpose, mainly through the cookie notice. However, the user does not have the technological capabilities nor expertise to inspect whether such notices are consistent with the use of cookies. This project will target user privacy in a browser by providing a way for users to analyse the different cookies stored and investigate their applications and purposes. This will involve building a browser extension that collects cookies and analyses them to identify known trackers and bad actors and produce a report in a dashboard for the user.

Attaching Risk Factors to Consent Dialogues

#privacy #ethics #consent #javascript

The notion of informed consent in GDPR requires the individual to also be notified about potential risks associated with the processing of their personal data. However, consent dialogues contain only information associated with use of personal data and do not provide any information about the risks of sharing that information. The aim of this project is to attach risks with appropriate information within a consent dialogue to enable the user to make a balanced judgement about their consent and use of personal data. The project involves detecting categories of information within a consent dialogue and associating an existing corpus of privacy risks by visually annotating the consent dialogue.

Python Engine for Prose Objects


Prose Objects [1], part of the Common Accord project [2], are a form of templates for legal documents that provide a consistent and machine-readable representation for policies and contracts. The aim is assist in the creation of legal documents in a consistent manner. In this, a legal document is represented as a form, which can be populated using metadata defined externally, such as in a JSON file. This project involves the construction of a python engine for generating prose documents. This will involve reading metadata from files in JSON and Markdown format, and applying them to legal documents represented using template systems such as Jinja2.

[1] Prose Objects
[2] Common Accord