Adding ISO 3166-2 subvisions to LOC in DPV 2.1-dev

Adding ISO 3166-2 subdivisions to DPV's LOC extension
published: Sat Aug 03 2024
by Harshvardhan J. Pandit
is part of: Data Privacy Vocabulary (DPV)
DPV DPVCG semantic-web

Collect the data

this was difficult, as there are lots of datasets on github and elsewhere but they have issues e.g. labels are sometimes in french rather than consistently being in english, or they are missing some concepts
after searching for approx. 3 hours, went with whatever data is existing on Wikidata (no, DBPedia doesn’t have this data either) using the SPARQL query:

SELECT ?s ?code ?label
WHERE {
  ?s wdt:P300 ?code .
  ?s wdt:P373 ?label .
} ORDER BY ?code

Clean the data

The data had some issues e.g. there was an entry for “IN.AP.VZ” which is not in the correct ISO format. (I have suggested removing the entry on Wikidata)
The Wikidata query only retrieves the subdivision code. We also need to country code to associate it as the broader/parent concept. Instead of using a more complex form of SPARQL query to identify the country and then its ISO 3166 code, we can trivially derive the code by splitting the subdivision code as it contains as prefix the country code, e.g. “AB-CD” means “AB” country and “CD” subdivision.
removed “KY” which is a country and not a subdivision - leaving this in caused a recusion maximum depth error as the parent of KY was KY itself and therefore the hierarchy was infinite. Same errors were also present for VI (which had a duplicate entry) and VG.
To handle the amount of data, a simple python script was used to generate the output. The encoding of the CSV/data is important as there are non-ascii characters in names. I had issues opening the output data in Excel and LibreOffice Calc, but Numbers opened it correctly.

import csv 

with open('countries.csv') as fd:
    reader = csv.reader(fd)
    countries = {}
    for term, label, *_ in reader:
        countries[term] = label
with open('query.csv', 'r') as fd:
    reader = csv.reader(fd)
    next(reader)
    with open('data.csv', 'w', encoding='utf-8') as fd2:
        writer = csv.writer(fd2)
        for wdt, code, name in reader:
            print(code, name)
            country = code.split('-')[0]
            if '.' in country: continue
            writer.writerow((
                code, name, f'Concept representing region {name} in country {countries[country]}',
                f'loc:{country}', 'dpv:Region', code,))

Copy the data into the spreadsheets

create a new tab for location subdivisions, and copy over the data
add in annotations for status accepted, date of creation, and contributor
after consideration, I moved the data into the same spreadsheet tab as the countries as this will make it easier to load the data without requiring a separate entry and code for handling the new spreadsheet

Download data using 100.py

add the new spreadsheet tab to DPV_FILES
download and extract the concepts using --ds=location_jurisdiction

Generate RDF
Generate HTML

modify the template for locations template_locations.jinja to display the subdivisions for each country