SEMANTiCS 2019

published: 2019-09-10 20:42, updated: 2019-09-10 20:42
academic; conference; Germany;
SEMANTiCS 2019 conference in Karlsruhe, Germany

image for SEMANTiCS 2019

summary: SEMANTiCS took place at Karlsruhe this year over two days (10-11) with workshops on 9 and DBpedia day on 12th. There was a mix of industry and academic talks about the use of semantics (~50% industry), which is expected from the conference. This year, there were special tracks for LegalTech (10th) and Cultural Heritage (11th), which consited of academic research papers as well as invited talks from industry. There were interesting keynote talks, from Oracle on the use of semantics and KG, from Michel Dumontier regarding FAIR data, and from Valentina Presutti regarding representation of common sense using sem-web.

There were quite a variance in the topics presented (though they had the commonality of semantics). There was research regarding Wikipedia/Wikidata/DBpedia, Cultural Heritage, Building/Transport data, Query Processing - though there were almost no papers which were based on core ML or NLP tasks which was a good sign of the focus moving back to semantics.

In terms of papers, the conference had 20 full and 8 short papers with an acceptance rate of 27% and 31 posters with an acceptance rate of 66%. Awards were given to RSP-QL* - Statement level annotations in RDF streams for Best Paper, Transfer Learning for Biomedical NER with BioBERT for Best Poster/Demo.

I presented a paper titled "Test-driven approach towards GDPR compliance" in the LegalTech track based on PhD work in the validation and linking of GDPR compliance using sem-web (paper and resources: https://w3id.org/GDPRep/semantic-tests) as well as a poster titled "OPN: Open Notice Network" based on work done on incorporating semantics to create an open notice for transparency with Mark Lizar (industry partner). Additionally, a poster advertising the Data Privacy Vocabulary was also on display at the conference, which was attended by DPVCG members Sabrina, Fajar, and Javier. Some interesting networking opportunities took place based on my PhD work (with Sabrina of SPECIAL for GDPR compliance, and Heiko Paulheim of Uni. Mannheim and Jan of SAP regarding domain specific ontology matching), as well as that of the DPVCG (Maria Pieper from FZI who also works in LegalTech).

I also had the opportunity to chair Session 4.4 on Knowledge Graphs on the 11th. There were two talks on the day. The first was an industry talk by i-views on the use of KG for exploration of experience in projects. The second was from the research team at Uni. Bonn, which was presented by Shimaa Ibrahim, and featured their work on Multi-lingual ontology enrichment. For me, the multi-lingual talk was of particular interest based on past experience with trying to generate multi-lingual thesauri of GDPR concept, and the eventual frustration of not being able to use existing techniques due to their ineffectiveness in domain specific ontologies.

Wi-Fi password KA2019FIZ

Day 1

2019-09-10

Plenary - introduction

  • Semantics 2020
    • April 21-23 - Austin, TX, USA
    • September 07-10 - Amsterdam, NL
  • Papers: 88 submissions (20 full, 8 short accepted, 27% acceptance rate); Posters: 47 submitted (31 accepted, 66% acceptance rate)
  • 28 papers, 37 industry presentations, 7 workshops, 2 tutorials, 31 posters

Keynote: "Making sense of and taking control of enterprise silos" - Michael J. Sullivan, Oracle

  • large-scale print documentation, Linotype to genrate, focus on creating general pattern language for technical documentation (inspired by Tufte)
  • investigate implementing "taxonomy as a service" based on Oracle OCI for use in Oracle CX apps
  • Oracle DB one of the first to implement RDF, but not on anyone's radar from KG perspective
  • <50% of structured data used to make decisions, <1% of unstructured data analysed - Harvard Business Review (2017)
  • Why graphs / RDF / semantics are a solution:
    • RDF requires URIs not strings for resources - makes integration easier e.g. no duplicates
    • SPARQL/SHACL has reasoninreasoners that can make semantic sense out of disparate data e.g. owl:sameAs, differentFrom, inverseOf
    • RDF middleware can hide complexity
    • Oracle's implementation of RDF/Semantics reuses database features (materialised views, RMAN, RAC, etc.)
  • RDF solving data warehouse challenges
    • contexts - schema on read
    • conformed dimensions - sameAs inference
    • slowly changing dimensions - forward chaining
    • time series queries - events, dataWeb, Class/subClass inference, multiple inheritance, foward chaining
  • Amazon's solution for big data warehouse is complex and requires a lot of tools
  • Using a semantic data warehouse provides a semantic warehouse than can span across dimensions/silos
  • Problems: reconciling common URIs is problematic - mapping is active area of research
  • methodology to solve semantic heterogeneity
    • collect set of use-cases / queries to be answered across silos
    • create top-level schema (T) to answer use-cases (just enough information)
    • map each silo schema to top-level schema using OWL/SHACL axioms (A) just for that silo
    • create entailment E using A over T for silos
    • create virtual model V for E + T for silos
    • query V to answer use-cases
    • repeat as use-cases come in
    • won't work if all silos were attempted to be mapped at start itself
  • pattern for reconciling known issues
    • create SPARQL endpoint for each silo
    • expose knowledge as schemas/data streams, aggregtes, analytics - APIs don't work
    • use A+T --> E to create read only master views over all silos
    • could create multiple virtual models
  • instead of named graphs, have multiple instances (as silos) for scalability
  • Oracle DB 19c supports these features / methods
  • final thoughts: should have people in graphs (knowledge) for serendipity

LegalTech, "To whom it may concern" - Christian Dirschl, Walters Kluwers

  • Wolters Kluwers slide: Legal is very local - dependant on language and jurisdiction
  • Law firms sell advice, consultancy; LegalTech firms sell (digital) service
  • Global survey about future of law: two outcomes - significant transformation (disruption), rapid acceleration (in next 3 years)
    • independently conducted, 700 professionals across US and 10 countries across Europe in law firms, corporate legal departments
  • LegalTech companies index provided by Standford techindex.law.standford.edu
  • legalcomplex.com collect (financial) data about companies and startups, show analytics of money by sector, domain - sector 8 legaltech, sector is making money (income flow, not profit)
  • most of the money is going towards AI - search, IR, legal analysis, blockchain
  • survey outcomes: >60% lawyers expect impact from LegalTech, only fraction think they can cope with them
  • reasons new technology is resisted
    • 36% lack of technology knowledge, understanding of skills
    • 34% existing organisatonal are efficient
    • 30% financial constraints
  • >50% layers expect transformational technologies, <24% have a good understanding of them
    • big data and predictive analytics, machine learning, AI, robotic process automation, blockchain
  • smartlaw.de forms, resources for german legaltech
  • study download: info.wolterskluwer.de/studie-future-ready-lawyer
  • Q&A: domain knowledge differentiates (provides advantage) from big players who have more data, more resources

LegalTech, Ensuring GDPR Compliance with KG at large German powertool manufacturer - Magnus Knuth, Eccenca

  • GDPR data is interdependant
    • data (ID, name) → processes (CRM, accounting) → purposes (marketing, billing) → lgal validation (Consent, contract) → legal framework (GDPR, trade law)
    • data is not centrally managed, have separte data owners, for their use-case >200 separate data silos
    • there is a directory of processes and applications which lists all data being handled by processes within comapny or sub-contractor
  • company gets data subject request to DPO (A15)
    • first challenge is to identify applications that contain personal data of requesting data subject
    • same for personal data categories
    • not practical to ask every data owner
    • information about processing is stored in directory
    • legal basis can be disparate for different silos
  • eccenca solution - connect everything via a middleman/middle layer
    • data from heterogenous sources is integrated into one knowledge graph, and GDPR team has access via dedicated interfaces
    • personal data search, meta data catlog, compliance dashboards, data import → only import metadata
    • PII is stored in search index to identify data subject
  • ontologies to represent domain knowledge e.g. GDPR
  • summary
    • data is linked with requirements of GDPR (consent, purpose, processing)
    • record of processing activities, legal bases, and retention periods
    • application/metadata discovery
  • SAR
    • rights: access A15, rectify A16, erasure A17, restriction of processing A18, data portability A20
    • upon request, JIRA ticket is created
    • identify data subject using search index in different applications / data silos
    • level 1 report: meta data about data subject
    • level 2 report: data export / deletion / update confirmation
  • search interface
    • personal data categories
    • search resuls returning personal data with source
  • sub-tickets in JIRA for tasks and targeted applications
  • Auditing and Reporting
    • proof of compliance to stakeholders and authorities
    • compliance for all internal data processing operations
    • personal note: compliance dashboard is handling compliance requests
  • For internal stakeholders, JIRA is used to track internal issue status, trackers, analytics
  • explore views and complex queries e.g. instance x data object x consent (personalised offers)
    • identify data for future processing activities - compatibility
  • integrate power BI for GDPR compliance dashboard - consent midding by data object for data categories
  • summary
    • deliver metadata to external applications (BI, analytics, dasboards)
    • existing IT infrastructure is not affected, does not replace legacy system
    • DPO: identification of data subjects across applications, control and transparency of compliance, no duplication of data through separating instance data from metadata
    • application owners: efficient processing of SAR without interfacing legacy system, operational processes and established workflows are no disrupted
  • booth at SEMANTiCS

Open Government and Semantic Web: A Field Report - Guido van der Wolk, Taxonic

  • new dutch legislation (1/1/2021)
    • every government org must publish their notices in official gazete online
    • every 18+ citizen will receive customisable IDs
  • officla gaazette of NL
    • 300,000+ publications a year, official source since July 1, 2009, cetralised government (states, parliament) and decentralised government (provinces, municipalities, water boards) - bulk
    • XML, HTML, PDF, ODT, Metadata (XML)
    • search platform, geo-based email subscription
    • open data (CC0)
    • officielebekend.makingen.nl
  • currently the data is not usable
  • working on data hub using FAIR - retrieve data and metadata, deploy enriched data
    • data wrangling, define semantic model, make data linkable, data is represented using XML
    • combine with other data, query combined data, monitoring and visualisation
    • MarkLogic, Java, Python, vue.js - MarkLogic chosen because data is stored in MarkLogic document storage
    • data wrangling challenges - prefixes are not stored in same field, data clearning required, fields are concatenated
    • semantic enrichment - dct, dcam, dcat, foaf, geo, prov, rdf/s, skos, legal domain dutch law: bwb, ecli, lido, internal references: oep, overheid, overheidop
  • combine with other data - registry of government organisations, geo and demographic information, judicial information
  • interactive queries using YASGUI
  • enriching publications with KG
    • enalbes data quality monitoring/analysis
    • enables custom open government
    • prepares for ML enrichment

Keynote: FAIR Data - Michel Dumontier

  • FAIR
    • unique identifiers to retrieve all forms of digital content and knowledge
    • high quality metadata to enhance discovery of digital resources
    • use of common vocab
    • etablish community standards
    • detailed provenance
    • registered in appr. repos
    • social and technological commitments
    • simpler terms of use to clarify expectations and intensify innovation
  • FAIR != Open - open as possible, closed as necessary
    • document your data (with metadat) for potential findability and reuse, not necessarily make it open (publish)
  • why should I go FAIR?
    • easy to use my data for new purpose
    • easy for other people to find, use, city my data, and understand what I expect in return
    • easy to verify my work
    • ensure data are available in future
    • satisfy expectations around management from institution, agencies, peers
  • semantic web provides ways for publishing data, metadata, frameworks, ecosystems
  • Bio2RDF - OSS uses sem-web for reusing biomed data
  • reproduce original research
    • reimplement PREDICT: inferring novel drug indications with application to personalised medicine
    • original result: AUC 0.91, new result over new data: AUC 0.83
  • efficiently explore web of data: explore probabilistic drug (re-)use using a KG to identify potential applications of existing drugs and potential candidate drugs
  • FAIR metadata
    • metadata identifier
    • resource identifier
    • standardized, machine readable format
    • use of community vocabularies
    • license ???
    • provenance ???
  • W3C HCLS Community Profile w3.org/TR/hcls-dataset/
    • ShEx validator (github, convertable to SHACL)
  • In addition to FAIR, there are 15 guiding principles fairmetrics.org
    • 14 universal metrics covering FAIR sub-principles
    • Metrics demand evidence, not standards
    • machine-readable metadata, resource management plan, additional authorisation procedures
    • publically registered, identifier schemes, access protocols, KR lang, licenses, provenance spec, community standards
    • evidence resource can be located in search results
  • automatically assess FAIRness of digital resource w3id.org/AmIFAIR
    • tests metrics
    • evaluating FAIR maturity through a scalable, automated, community-governed framework
    • each metric is registered as an API service, which can be executed automatically
  • mine distributed, access retricted FAIR datasets in a privacy preserving manner
    • privacy preserving machine learning
    • made available through FAIR data stations
  • semantics, coupled with AI, may enable humans, aided by intelligible machine agents, to exploit internet of shared data and services
  • Q&A: FAIR is a gradient of increasing competencies, not an absolute target. It will evolve and move as we churn through technologies.

Talk: An innovative semantic solution to turn transport data in EU compliance - Marco Comerio

  • EU Reg 2107/1226
    • requirements
    • impact on transport stakeholders
    • challenge & opportunities
  • establish interop framework enabling EU players for interop business applications
    • barriers: insufficient accessibility of transport data, lack of service and data interop
    • key enablers: data sharing mechanism, data interop by means of common set of data exchange standards
  • Each EU member state is required to setup NAP by regulation
  • rely on in-house support for data conversion process, which may lack knowledge and skills related to regulation - or turn to external providers that provide custom and expensive solutions
  • impact on transport stakeholders:
    • obligations: provide datasets to NAP compliant to the requested data formats, provide metadata description of datasets
    • challenge: turn available data into requested formats, and enrich them with additional data sources
    • benefit: additional data sources
  • reference ontologies - unambigiously describe operational aspects of transport domain
    • metadata profiles - harmonise metadata description of datasets
    • data converters: turn available transport data into specific formats, and enrich with additional sources e.g. translate schedule, fare info
  • contributions
    • conceptualisation - acquire domain knowledge, data formats, standards → define reference ontology
    • sharing - asset types, asset descriptors
    • governance - identify actors, roles, tasks; define lifecycle
  • SNAP solution
    • uplift from source format into reference ontology → chimera provides options for RML (CSV,DB,etc), Java
    • downlift to target format → chimera provides two options Apache Velocity, Java annotations
      • Apache velocity template: beginning of template has SPARQL query binding variables to data required in template
      • Java annotations: annotations identify mappings
    • chimera converter: lifeting, data enrichment, inference enrichment, lowering;; based on Apache Camel
  • SWOT
    • strengths: flexibility, reusability
    • weaknesses: handmade mappings no tooling, semantic/logic skills required
    • opportunities: conceptualisation of domain, applicability to different domains, semantic NAPs with transmodel RDF data
    • threats: bad ontology and mapping
  • transmodel-cen.eu

Talk: A legal knowledge graph for improved law accessibility - Erwin Fitz, WU

  • Legal data is expressed natural language, most times it is heterogenous, and has incomplete metadata
  • searched 'car' via Eurlex in datasets for Austria, Germany, Italy, EU
    • Austria: case number and dates
    • Germany: some additional data
    • Italy: no central database, heterogenous data from court cases; 'auto' returned 1 result
  • problems:
    • court decisions have references to other laws, dates, documents
    • Austria linking law - problems with versioning as laws might change
    • mainly keyword based
    • need to filter
    • ambigious terms
  • Solution:
    • central search interface, ideally across EU
    • using semantic search
    • linked documents to support better information lookup
    • add external sources
    • standardised document classification schema
  • desired
    • interlinked legal documents
    • use standardised identifiers (ECLI, ELI)
    • minimum set of metadata for legal documents
  • ECLI / ELI is not implemented / adopted by all countries in the EU
  • which sources can be used?
    • EU source influence national law
    • EurVoc and EUR-lex
  • information extraction
    • patterns using regex
    • gazetteers - compare text to lists/trees (existing list)
  • EuroVoc for thesaurus
  • approaches
    • TF-IDF
    • Word2Vec, Doc2Vec + combine with TF-IDF
    • fast.ai deep learning
    • JRC-acquis V3, KE-Darmstadt corpus
  • ADORN - automatic document RDFa annotator
    • GUI
    • query/store documents in file
    • automatically annotate and classify documents
    • export in RDFa to display in HTML

Day 2

2019-09-11

talk: Industry proven AI applications based on Enterprise KG - Klaus, Jan i-views

  • why is semantics difficult to introduce in interprise?
    • experimental applications dominate
    • not managed by domain experts
    • how do we escape the sandbox?
  • do knowledge graphs help? old wine in new bottle? hype cycle?
  • consultants are more interested in talking to other consultants about experience and seeing reference projects
    • e.g. project diet - diet consultant simulating real dietician
  • system and KG should not only close the gap (in knowledge, application) but also allow exploration
    • should be able to edit information (easily)
    • should be able to react to new vocabularies as they permeate
  • related terms and applications can be explored (semantics: related to, subtopics)
  • learn from user behaviour by prioritising edges that lead to desired result (in search)
    • personal note: perhaps this construcuts a form of weighted graph for IR
  • Rather than worrying about RDF, SPARQL, worry about CSV and integration of structured sources
  • learning
    • analytics, text analytics, modelling
    • along with knowlegde engineers we also need to expose end users to the KG otherwise it won't grow
  • services
    • authorisation: on objects/relations, on meta-data, via the graph itself
    • auditing: audit log for all data access
    • security: access control, integration, encryption

talk: From monolingual to multilingual ontologies: the role of cross-lingual ontology enrichment - Shimaa, Uni. Bonn

  • multi-lingual ontology
    • entities and relations are present in multiple natural languages
    • e.g. dbpedia - en, fr, de
  • processes for multi-lingual
    • cross-lingual matching - match source to target in different natural language
    • cross-lingual ontology enrichment
      • depends on matching
      • expand the target ontology with additional information extracted from external resources
  • motivation - 73.46% EN in LOV, 7.92% FR, 4.84% DE
    • manual enrichment is error-prone and difficult
    • monolingual ontoglogies are not easily understandable to other language speakers
  • previous work: OECM (ESWC 2019 Poster)
  • new approach
    • use semantic simlarity
    • enrich by ading new classes in addition to elated classes in hierarchy
    • automated
    • non EU languages
  • steps
    • extract ceoncepts and translate using Google Translate, for multiple matches, select all
    • pre-process: use NLP tokenisation, POS-tagging, Stop words removal, lemmatization, true casing
    • identify potential match and select best match based on similarity score - Jacquard (string) & WordNet (semantic)
    • output is matched terms between source and target ontologies
    • triple-retrieval: takes matched terms and retrieves triples for matched terms and related classes
    • enrichment: retrieve triples and enriched terms and add to target ontology
    • validation: semantic (reasoners), syntactic (W3C ontline validation)
  • personal question:
    • will Word2Vec from source and target languages reveal similarity in labels???
    • labels might not be valid dictionary words, can we also utilise definition and other annotations
  • use-case: SEO Scientific Events Ontology
    • enrich SEO (49 classes) from another ontology Conference (DE, 60 classes)
    • new ontology had 20 new classes
    • enrich SEO from ConfOf (Arabic)
    • new ontology has 37 new classes
  • evaluation
    • MultiFarm benchmark - 7 ontologies, translation into 9 languages
    • evaluate effectiveness - compare with reference alignment
    • compare with SotA
    • evaluate enrichment process quality - manually enrich by expert to create gold standard, compare for evaluation, results 80%

keynote: Looking for common sense in the Semantic Web - Valentina Presutti

  • adoption and usage of semantic web in industry
  • smart agents currently
    • do not reason
    • are not aware of surrounding context
    • do not have 'common sense'
    • have the answer built-in in the best case
    • borrow from Wikipedia or other sources
    • issue a query on Google
  • lack of common sense
    • "common sense" is knowledge which we share but do not explain explicitly
    • common sense facts/knowledge is needed for reasoning
    • existing knowledge graphs encode domain specific knowledge
  • Role of semantic web - to create a graph of common sense (Dagstuhl Report)
  • research on common sense in semantic web
    • search on scholarly data website → 2197 papers, only 3 papers excluding Wikipedia, only 1 unique papers
  • conceptnet.io
    • labeled graph (semantic network) targetting text processing
    • accumulated ~1M english facts
    • crowdsourced, reusing Wiktionary and WordNet and aligned partially to DBpedia
    • provides JSON-LD APIs
    • e.g. knife has three sub-graphs
      • knife as a noun
      • knife as a verb
      • knife as a object
    • we need to give it formal semantics (e.g. via OWL) in order to utilise this for reasoning/inferences
    • there is no information about situational semantics, validity, and applicability
  • NELL rtw.ml.cmu.edu
    • ML system that reads the web and extracts facts from textual web documents
    • running since 2010, ~50M candidate beliefs, ~2.8M high confidence beliefs
    • candidate beliefs encoded as KN of facts and ontology of categories and relations
    • available as LOD
    • no formal semantics no categorisation, no constraints or dependency
  • Atomic homes/cs/washington.edu/~msap/atomic/
    • textual descriptions of inferential knowledge (if-then clauses) based on 3 types of if-then associated with 9 dimensions of inferential and casual types
    • accumulated ~877k textual descriptions
    • crowdsourcing of blank placeholders put in 24k event phrases
    • no formal semantics
  • Human Know-How dataset datashare.is.ed.ac.uk/handle/10283/1985
    • dataset and ontology (PROHOW)
    • labeled with a sequence
    • nodes are labeled in text, not semantics
  • FrameNet
    • lexical resource, which describes frames which are situations
    • frameelements are nodes which add semantics to the frame
    • associated actions or situations can be related to 'evoke' the frame e.g. slide for cutting
  • Framester w3id.org/framester
    • LOD resource that connects linguistic data with factual and ontological data
    • encodes 50M links between 21 resources
      • resources: DBpedia, WordNet, DOLCE, FrameNet, SentiWordNet, ConceptNet etc
      • linking: skos closeMatch etc
  • FOX w3id.org/fox
    • do foundational distinctions match common sense?
    • are they present in LOD?
    • class vs instance e.g. is a building a class or an instance

Closing Session

  • SEMANTiCS 2019 has 426 participants
  • 28 papers, 37 industry presentations, 7 workshops, 2 tutorials, 31 posters
  • Best Paper RSP-QL* - Statement level annotations in RDF streams
  • Best Poster/Demo - Transfer Learning for Biomedical NER with BioBERT for Best Poster/Demo
  • Industry Innovation Award - Upstream - Managing Knowledge in the Oil & Fas Industry into the digital age
  • pre-proceedings https://cutt.ly/semantics2019
  • proceedings will be Open Access (coming soon)