How to Design a Successful SemTech PoC

Andrey Tagarev and Nikola Tulechki - Ontotext

2019-09-09

0.1 Preparation for Hands-On Exercises

0.2 Workshop Overview

  • Part I: Conceptual Walkthrough
  • Part II: ETL with OntoRefine
  • Part III: SPARQL and GraphDB Visualization
  • Part IV: Demonstrators

1 Typical Use Case

1.1 Steps in Typical Use Case

What is a typical use case for a PoC?

  • Start with a question we want answered
  • Have some messy data that partially answers the question
  • Piece of the picture is missing but we can find it in LOD
  • Create abstract model presenting the ideal data
  • Transform messy sources from tabular to graphical form (ETL)
  • Merge sources into a single dataset
  • Further transformation to match data to our ideal data model
  • Use finalized dataset to get answers

2 Our Use Case

2.1 Our Use Case

2.1.1 Nepotism in Hollywood

. . .

nepotism /ˈnɛpətɪz(ə)m/

The practice among those with power or influence of favouring relatives or friends, especially by giving them jobs.

. . .

Mid 17th century: from French népotisme, from Italian nepotismo, from nipote ‘nephew’ (with reference to privileges bestowed on the ‘nephews’ of popes, who were in many cases their illegitimate sons).

2.2 Simplified IMDB dataset

Our dataset is a simplified version of the public IMDB dataset

2.3 Sources for Semantic Integration: LOD Cloud

2.4 Sources for Semantic Integration: Datahub

2.6 Sources for Semantic Integration: Google Cloud Public Datasets

## Sources for semantic Integration: DBPedia

3 Analysis

3.1 Analyse and document peculiarities of source data

  • number of records
  • coverage of features (i.e. how many missing features)
  • range of numeric features
  • range of date features
  • repetitive values in text features

4 Auxiliary Resources

4.1 Select existing auxiliary resources relevant to use case

  • FOAF (Friend of a Friend)
  • OWL
  • Movie Ontology (not needed for this POC)
  • Others to consider

5 Perform semantic conversion (ETL procedure)

5.1 ETL Basics

5.2 Normalise values

OntoRefine text facets allow quick bulk-editing of values

United States is normalised to USA in 122 cells

5.3 Create new columns

Split columns according to a separator character

5.4 Urlify

Edit the text in the cells

Remove whitespace so that the string can be used in a url/iri

5.5 Reconcile

Use a reconciliation service to match strings to real world objects.

Bulgaria > https://www.wikidata.org/wiki/Q219

5.6 Tabular to Linked Data

Moving from tabular data to linked data

5.7 Tabular to Linked Data

Here is what our cleaned up table looks like…

5.8 Tabular to Linked Data

… but here it is transformed into RDF.

5.9 Tabular to Linked Data

6 Data Integration

6.1 Initial data model

Output from our ETL procedure

Does this model contain all the data we need?

6.2 Expanding the initial model

Incorporating data from an additional data source.

Can we simplify things?

6.3 Creating a new property

Single symmetric relation to use in a straightforward manner.

Can we simplify things further?

6.4 Creating a second new property

Three relations transformed into a single one.

But we are still working with two disconnected parts.

6.5 Connecting the dataset

Now we have everything we need to ask our question.

What if we want to ask a more complex question?

6.6 Changing the model

At later stages we can rework the model which will then require corresponding changes to the procedure.

7 Ontotext GraphDB Visualization

7.1 Google charts

Visualize data in google charts in GDB

7.2 Visual graph

Highly configurable network visualisation using SPARQL

7.3 Visualize family Relations

7.4 Visualize family Relations 2

7.5 Is there nepotism in Hollywood?

8 Part II: ETL Process with OntoRefine

8.1 Break

Load post-ETL repository:

https://presentations.ontotext.com/movieDB_ETL.trig

Download SPARQL queries for next section:

http://presentations.ontotext.com/queries.zip

9 SPARQL Intro

9.1 What is SPARQL?

  • SQL-like query language for RDF data
  • SPARQL 1.0 only allowed accessing the data (query)
  • SPARQL 1.1 introduced:
    • Query extensions: Aggregates, Subqueries, Negation, …
    • Data management updates: Insert, Delete, Delete/Insert
    • Graph management updates: Create, Load, Clear, Drop, Copy, Move, Add

9.2 What is a SPARQL Query?

Main Idea: Pattern matching

  • Queries describe sub-graphs of the queried graph
  • Graph patterns are RDF graphs specified in Turtle syntax, which contain variables (prefixed by either “?” or “$”)
  • Sub-graphs that match the graph patterns yield a result

9.3 Query Types

There are four types of queries in SPARQL

  • ASK – test whether a query patterns has a solution (yes/no)
  • SELECT – returns variables & their bindings
  • CONSTRUCT – returns an RDF graph specified by a graph template
  • DESCRIBE – returns an RDF graph containing (all) triples about one or more resources

9.4 Query Type ASK

Test whether a query patterns has a solution

ASK WHERE {?movie mdb:starring ?actor}

9.5 Query Type SELECT

Returns variables & their bindings

SELECT ?movie ?actor WHERE {?movie mdb:starring ?actor}

9.6 Query Type CONSTRUCT

Returns an RDF graph specified by a graph template

CONSTRUCT {?movie1 mdb:hasCommonActorWith ?movie2}
WHERE {
    ?movie1 mdb:starring ?commonActor .
    ?movie2 mdb:starring ?commonActor .
}

9.7 Query Type DESCRIBE

Returns an RDF graph containing all triples about one or more resources

DESCRIBE ?movie WHERE {?movie mdb:releaseDate “2011-05-06"}

9.8 Components of a SPARQL Query

  • List of namespace definitions
    • PREFIX
  • Query form + variables
    • SELECT, CONSTRUCT, ASK, DESCRIBE
  • List of data sources (optional)
    • FROM, FROM NAMED
  • Query patterns and filters
    • WHERE: subqueries, expressions, BIND, VALUES, FILTER
  • Solution modifiers
    • ORDER BY, LIMIT, OFFSET

9.9 Components of a SPARQL Query Examples

PREFIX mdb: <http://www.example.org/movieDB/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX dbr: <http://www.dbpedia.org/resource/>
PREFIX foaf: <http://xmlns.com/foaf/>

SELECT ?movie ?actor
FROM <http://www.example.org/movieDB/>
WHERE {
    ?movie mdb:starring ?actor .
} 
ORDER BY ASC(?actor)

9.10 Graph Patterns

  • Basic graph patterns
    • A conjunction of triple patterns
  • Optional graph pattern
    • Specifies optional parts of a pattern (similar to an “outer join” in SQL)
  • Union graph patterns
    • Specifies disjunctions (alternatives)

9.11 SPARQL 1.1 Query Extensions

  • Aggregates
  • Sub-queries
  • Negation (NOT, NOT EXISTS)
  • Expressions in the SELECT clause
  • Property Paths
  • Assignment (BIND)
  • Enumeration (VALUES)
  • A short form for CONSTRUCT
  • An expanded set of functions and operators

9.12 Building a Semantic POC - Steps

  1. Start with a question we want answered
  2. Have some data that partially answers the question
  3. Piece of the picture is missing but we can find it in LOD
  4. Create abstract model presenting the ideal data
  5. Transform sources from tabular to graphical form (ETL)
  6. Merge sources into a single dataset
  7. Further transformation to match data to our ideal data model

10 Parth III: Transformation and Visualization with GraphDB

11 Part IV: Demonstrators

11.1 TAG

Named Entity Recognition and Linking

11.2 NOW

News on the Web Demonstrator

11.3 FactForge

Linked Data Hub

11.4 Questions