How to Design a Successful SemTech PoC

Alex Popov, Andrey Tagarev and Nikola Tulechki - Ontotext

2019-09-26

0.1 Workshop Overview

Start- End	Topic	Duration
13:00 - 13:15	Introduction	15 min
13:15 - 14:00	Session 1 - Model creation	45 min
14:00 - 14:10	Break	10 min
14:10 - 15:00	Session 2 - ETL process	50 min
15:00 - 15:30	Break	30 min
15:30 - 16:30	Session 3 - Data integration & visualization	60 min
16:30 - 16:40	Break	10 min
16:40 - 17:30	Session 4- Demonstrators (Alex)- Q&A	50 min

0.2 URL of this presentation

0.2.0.1 https://presentations.ontotext.com/semtech-poc.html

1 Typical Use Case

1.1 Steps in Typical Use Case

What is a typical use case for a PoC?

Start with a question we want answered
Have some messy data that partially answers the question

Piece of the picture is missing but we can find it in LOD

Create abstract model presenting the ideal data

Transform messy sources from tabular to graphical form (ETL)

Merge sources into a single dataset

Further transformation to match data to our ideal data model

Use finalized dataset to get answers

2 Our Use Case

2.1 Our Use Case

2.1.1 Nepotism in Hollywood

. . .

nepotism /ˈnɛpətɪz(ə)m/

The practice among those with power or influence of favouring relatives or friends, especially by giving them jobs.

. . .

Mid 17th century: from French népotisme, from Italian nepotismo, from nipote ‘nephew’ (with reference to privileges bestowed on the ‘nephews’ of popes, who were in many cases their illegitimate sons).

2.2 Simplified IMDB dataset

Our dataset is a simplified version of the public IMDB dataset
- information on actors, directors and movies
- no information on family relations of actors

2.3 Sources for Semantic Integration: LOD Cloud

https://lod-cloud.net

2.4 Sources for Semantic Integration: Datahub

https://datahub.io

2.5 Sources for Semantic Integration: Google

https://toolbox.google.com/datasetsearch

Newest development
- Not linked data but can easily be converted
- Very rich
- Growing very quickly
https://console.cloud.google.com/marketplace
- Large and dynamic datasets
- Need to learn BigQuery and use (and pay for) Google Cloud

2.6 Sources for semantic Integration: DBPedia

http://dbpedia.org/page/Harrison_Ford

3 Analysis

3.1 Analyse and document peculiarities of source data

number of records

coverage of features (i.e. how many missing features)

range of numeric features

range of date features

repetitive values in text features

4 Auxiliary Resources

4.1 Select existing auxiliary resources relevant to use case

FOAF (Friend of a Friend ontology)

RDFS & OWL - Semantics for Classes, Properties and sameAs equivalence

Movie Ontology, schema.org, (not needed for this POC)

5 Perform semantic conversion (ETL procedure)

5.1 ETL Basics

Extract
Transform
Load

5.2 Normalise values

OntoRefine text facets allow quick bulk-editing of values

United States is normalised to USA in 122 cells

5.3 Create new columns

Split columns according to a separator character

5.4 Urlify

Edit the text in the cells

Remove whitespace so that the string can be used in a url/iri

5.5 Reconcile

Use a reconciliation service to match strings to real world objects.

Bulgaria > https://www.wikidata.org/wiki/Q219

5.6 Tabular to Linked Data

Moving from tabular data to linked data

5.7 Tabular to Linked Data

Here is what our cleaned up table looks like…

5.8 Tabular to Linked Data

… but here it is transformed into RDF.

5.9 Tabular to Linked Data

6 Data Integration

6.1 Initial data model

Output from our ETL procedure

Does this model contain all the data we need?

6.2 Expanding the initial model

Incorporating data from an additional data source.

Can we simplify things?

6.3 Creating a new property

Single symmetric relation to use in a straightforward manner.

Can we simplify things further?

6.4 Creating a second new property

Three relations transformed into a single one.

But we are still working with two disconnected parts.

6.5 Connecting the dataset

Now we have everything we need to ask our question.

What if we want to ask a more complex question?

6.6 Changing the model

At later stages we can rework the model which will then require corresponding changes to the procedure.

7 Ontotext GraphDB Visualization

7.1 Google charts

Visualize data in google charts in GDB

7.2 Visual graph

Highly configurable network visualisation using SPARQL

7.3 Visualize family Relations

7.4 Visualize family Relations 2

7.5 Is there nepotism in Hollywood?

7.6 Break

Download materials:
- http://presentations.ontotext.com/etl-files.zip
Download and install GraphDB-free for your OS:
- Windows: GraphDB_Free-8.11.0.exe
- Mac: GraphDB_Free-8.11.0.dmg
- Linux graphdb-free-8.11.0.deb

7.7 Building a Semantic POC - Steps

Start with a question we want answered
Have some data that partially answers the question
Piece of the picture is missing but we can find it in LOD
Create abstract model presenting the ideal data
Transform sources from tabular to graphical form (ETL)
Merge sources into a single dataset
Further transformation to match data to our ideal data model

8 Part II: ETL Process with OntoRefine

8.1 Short Break

Load post-ETL repository:

https://presentations.ontotext.com/movieDB_ETL.trig

Download SPARQL queries for next section:

http://presentations.ontotext.com/queries.zip

9 Parth III: Transformation and Visualization with GraphDB

10 Part IV: Demonstrators

10.1 Web application

10.2 FactForge

10.3 Family relations with AgReLon

Context
- EHRI EC Project
- 5M Records of Holocaust survivors and victims (HSV)
  - Transcripts of lists of names
Data problem
- Explicit family relations for 142K pairs (manually constructed by historians)
- Many more in the data
  - Families referenced by common number
  - People listed with their address
  - Relationships between people present (in all european languages)

10.4 Input Data

Flat CSV format

10.5 Mapping to AgRelOn

10.6 Results

Much easier querying (agrelon:hasRelative)
Inference of 10K links between previously unconnected nodes (10%)(only grandparents)
More possible (cousins)