Linked Open Data
John O’Gorman (john@og.co.nz)
10 May 2015
1 Intro
The internet was created in the early 1970s and Tim Berners-Lee invented the World Wide Web in 1990 linking hypertext pages to the internet. The hypertext pages depended on HTML (HyperText Markup Language) and the Hypertext Transfer Protocol (HTTP). Browsers came very soon after - Mosaic, then Firefox, Safari, Chrome, and others allowing users access to linked pages containing text content, images, and links to other pages.
Now Berners-Lee has proposed extending the Web to include Linked Open Data. Already projects such as Wikipedia have been embedding data into their webpages as infopages using an extension to HTML called RDF (Resource Definition Framework).
The Linked Open Data (LOD) project makes the World Wide Web into a global database where literally any knowledge can be shared and combined.
Linked Data is a set of techniques for the publication and accessing of data using standard formats and interfaces defined by the World Wide Web Consortium (W3C) such as OWL, RDF, RDFs, RDFa, etc.
The original HTML standard has now been extended to a new standard called HTML5 which encompasses RDF, RDFa, RDFs, and XHTML. Current browsers are being enhanced to support HTML5.
To test your browser, point it to
https://html5test.com. It will give your browser a score (e.g. 396 out of 555 points) indicating all the specs it can and cannot support. The HTML5 standard is evolving and has not been finally ratified so browsers cannot be expected to pass the test with a 100% score. All HTML5 tags are lower case even where they were upper case in HTML. e.g. <br /> instead of <BR>.
A further problem with HTML5 is that it is a subset of XML rather than SGML which had more permissive syntax. Many older web pages will not conform and HTML5 parsers will struggle to cope with the defective tags allowed by the more permissive HTML. e.g.:
-
<br> instead of <br />
-
<p> before paragraphs instead of <p> ... </p> surrounding a paragraph
The new HTML5 standard also specifies the DOM (Document Object Module) and the Javascript language which together allow web pages to be animated.
To view the HTML page source do the following
-
Firefox: Right click
-
Safari: Enable “Show Deveop Menu in menu bar” in the Safari Preferences. Advanced tab. Then use the Develop Menu -> Show Page Source.
1.2 The Basics
-
Everything that exists is defined as an Entity. The whole system of entities is called an ontology.
-
Knowledge is represented by triples: subject, predicate, object where the predicate identifies a relationship between the subject and the object.
-
The triples and the ontologies are stored in XML (eXtensible Markup Language) in a format called RDF (Resource Data Framework) and freely available tools are available to display them in browsers.
-
In order for the data to be linked, the subjects and predicates must be in the form of URIs (Uniform Resource Identifiers), while the objects may be quoted strings or URIs.
-
The RDF family of standards are now built in to the new HTML5 standard which current browsers are beginning to support.
The Web Ontology Language was an early W3C standard. It is whimsically named OWL in respect of the Winnie the Pooh character Owl who mispelled his name: WOL. Ontology (from the Greek word ontos meaning being) is borrowed from metaphysics and means the formal definition of types, properties, and relationships of entities.
RDFS (RDF Schema) is now the favoured standard for defining vocabularies of entities used in Linked Data and is compatible with pre-existing OWL constructs.
Resource Description Framework (RDF) is a W3C specification for describing entities as triples: subject, predicate, object.
A single RDF statement describes two things and a relationship between them. Technically this is is called an Entity-Attribute-Value but Linked Data people often call the 3 elements the subject, the predicate, and the object. e.g cats eat mice is a triple where cats is the subject, mice is the object, and eats is the predicate. The simplest representation of this triplet is Turtle:
“cats” “eat” “mice” .
or in RDF syntax:
<cats rdf:about=”mice”>
<eat>
</cats>
The example above shows data but not Linked Data. In order for it to be linked, the cats and mice need to be stored and retrieved as URIs and eats needs to be a standard defined value. So a more realistic Turtle representation might be:
@prefix dbpedia: <http://dbpedia.org/resource> .
dbpedia:cats dbpedia:eat dbpedia.mice .
assuming that cats, eat, and mice have been defined in dbpedia. If they have then you can trace linkages from cats and mice to felines, rodents, mammals, etc and link into the dbpedia world of knowledge.
1.3 DBPedia
The WikiPedia project is a free collaborative encyclopedia with over 4 billion entries and has been going since 1994. It has become the custom to put RDF based data entries called infoboxes at or near the top of each entry. DBPedia is also a free collaborative enterprise based in the University of Leipzig which extracts what it can from the infoboxes and builds a linked open database from them. DBpedia has been available since 2007 and Tim Berners-Lee has described it as one of the more famous parts of Linked Data effort.
1.4 Tools for finding LOD
There are some sites which offer access to Linked Open Data and provide tools for gathering and using the information available.
Point your browser to any of the following:
|
Site
|
URL
|
Comment
|
|
LOD Cloud
|
|
Linked Open Data Cloud
|
|
DBPedia
|
|
Extracts from WikiPedia
|
|
Sindice
|
|
Italian for Semantic Index (pronounced sin-dee-chey)
|
|
SameAs.org
|
|
Identifies equivalent URIs
|
|
Data Hub
|
|
Community run catalogue
|
1.5 History
The BBC faced the challenge of producing web pages for 1500 television and radio programmes with a staff of only a handful of people. They also needed to publish web content for every band and the songs they record, updated each day. They also needed web pages for for each animal species and its habitat when the organisation did not have that information. They met this challenge during a period of staff cuts using Linked Data. Point your browser at any of the following:
The BBC collects, filters,and reuses Linked Data from from various sources, including the World Wildlife Fund, MusicBrainz, and the DBpedia project.
Wikipedia embeds in its web pages Linked Data in tables called infoboxes usually at thr top right of pageswhich can be accessed from
http://dbpedia.org
whence it can be used by others.
2 RDF
Resource Description Framework (RDF) is a W3C specification for describing entities as triples: subject, predicate, object.
A single RDF statement describes two things and a relationship between them. Technically this is is called an Entity-Attribute-Value but Linked Data people often call the 3 elements the subject, the predicate, and the object. e.g. The following Turtle representation
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .
<http://en.wikipedia.org/wiki/Tony_Benn>
dc:publisher "Wikipedia" .
<http://en.wikipedia.org/wiki/Tony_Benn>
dc:title "Tony Benn" .
<http://en.wikipedia.org/wiki/Tony_Benn>
foaf:primaryTopic [
a foaf:Person ;
foaf:name "Tony Benn"
] .
The above RDF turtle listing show 3 triples:
|
Subject
|
Predicate
|
Object
|
|
<http://en.wikipedia.org/wiki/Tony_Ben>
|
publisher
|
“Wikipedia”
|
|
<http://en.wikipedia.org/wiki/Tony_Ben>
|
title
|
“Tony Benn”
|
|
<http://en.wikipedia.org/wiki/Tony_Ben>
|
name
|
“Tony Ben”
|
But the RDF specifies that the redundant repetition of the subject can be eliminated by using semi-colons for the shared subject:
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .
<http://en.wikipedia.org/wiki/Tony_Benn>
dc:publisher "Wikipedia" ;
dc:title "Tony Benn" ;
foaf:primaryTopic [
a foaf:Person ;
foaf:name "Tony Benn"
] .
Similarly where several triples share the same object you can use commas.
@prefix dbpedia: <http://dbpedia.org/resource/> .
@prefix dbpedia-owl: <http://dbpedia.org/ontology/> .
@prefix rdf: <http://www.w3c.org/1999/02/22-rdf-syntax-ns#> .
dbpedia:Bonobo
rdf:type dbpedia-owl:Eukaryote , dbpedia-owl:Mammal , dbpedia-owl:Animal .
The @prefix statements provide a means of reducing verbiage. e.g. the dc prefix allows you to use dc: instead of the full URL.
The above format is called Turtle and is the easiest and most readable format of RDF. Each Turtle RDF statement is terminated by a full stop.
The use of RDF should keep in mind 4 principles:
-
Use URIs as names for things
-
Use HTTP URIs so that people can look up those names
-
When someone looks up a URI, provide useful information , using the standards (RDF*, SPARQL)
-
Use links to other URIs, so people can discover more things.
Uniform Resource Identifiers (URIs) used to name things in Linked Data are a generalised version of the Uniform Resource Locators (URLs) used to locate web pages.
URIs are often long and difficult to read in triples. So a shorthand is allowed in RDF whereby you can declare a short prefix to be equivalent to a long URI. When you refer to the URI within the document, you can substitute the prefix.
The following prefixes are in common use:
|
Prefix
|
URI
|
Name
|
Describes
|
|
air:
|
http://www.daml.org/2001/10/html/airport-on
|
Airport Ontology
|
Nearest Airport
|
|
bibo:
|
http://purl.org/ontology/bibo
|
BIBO
|
Bibliographies
|
|
bio:
|
http://purl.org/vocab/bio/0.1
|
Bio
|
Biographicl info
|
|
cc:
|
http://creativecommons.org/ns#
|
CC rights expression
|
Software Licences
|
|
doap:
|
http://usefulinc.com/ns/doap#
|
DOAP (Description of a Project)
|
Projects
|
|
dc:
|
http://purl.org/dc/elements/1.1
|
Dublin Core Elements
|
Publications
|
|
dct:
|
http://purl.org/dc/terms
|
Dublin Core Terms
|
Publications
|
|
foaf:
|
http:/xmins.com/foaf/0.1
|
FOAF (Friend of a Friend)
|
People
|
|
pos:
|
http://www.w3.org/2003/01/geo/wgs84_pos#
|
Geo
|
Positions
|
|
gn:
|
http://www.geonames.org/ontology#
|
GeoNames
|
Locations
|
|
gr:
|
http://purl.org/goodrelations/v1#
|
Good Relations
|
Products
|
|
ore:
|
http://www.openarchives.org/ore/terms
|
Object Reuse and Exchange
|
Resource Maps
|
|
rdf:
|
http://www.w3.org/1999/02/22-red-syntax-ns#
|
RDF
|
Core Framework
|
|
rdfs:
|
http://www.w3.org/2000/01/ref-schema#
|
RDFS
|
RDF Vocabularities
|
|
sioc:
|
http://rdfs.org/sioc/ns#
|
SIOC
|
Online Communities
|
|
skos:
|
http://www.w3.org/2004/02/skos/core#
|
SKOS
|
Controlled vocabularies
|
|
vcard:
|
http://www.w3.org.2006/vcard/ns#
|
vCard
|
Business Cards
|
|
void:
|
http://rdfs.org/ns/void#
|
VoID
|
Vocabularies
|
|
owl:
|
http://www.w3.org.2002/07/owl#
|
Web Ontology Language
|
Ontologies
|
|
wn:
|
http://xmins.com/wordnet/1.6/
|
WordNet
|
English Words
|
|
xsd:
|
http://www.w3.org/2001/XMLSchema#
|
XML Schema Datatypes
|
Data Types
|
2.2 RDF Formats
Files containing RDF statements originally were stored in a variant of XML. But more recently other formats have arisen and been standardised, in particular Turtle.
-
Turtle (Terse RDF Triple Language) Human readable RDF with filename suffix .ttl or .n3
-
RDF/XML the original RDF format in XML
-
RDFa RDF embedded in HTML attributes
-
JSON-LD JavaScript Object Notation for Linked Data a newer format aimed at web developers
The above formats can be used for creating, storing, and translation of Linked Data.
Turtle (Terse RDF Triple Language) is a format for expressing data in the Resource Description Framework (RDF) data model with the syntax similar to SPARQL. RDF, in turn, represents information using "triples", each of which consists of a subject, a predicate, and an object. Each of those items is expressed as a Web URI.
Turtle provides a way to group three URIs to make a triple, and provides ways to abbreviate such information, for example by factoring out common portions of URIs. For example:
<http://example.org/person/Mark_Twain>
<http://example.org/relation/author>
<http://example.org/books/Huckleberry_Finn> .
Turtle is an alternative to RDF/XML, the originally unique syntax and standard for writing RDF. As opposed to RDF/XML, Turtle does not rely on XML and is generally recognized as being more readable and easier to edit manually than its XML counterpart.
RDF is a standard model for data interchange on the Web. RDF has features that facilitate data merging even if the underlying schemas differ, and it specifically supports the evolution of schemas over time without requiring all the data consumers to be changed. RDF extends the linking structure of the Web to use URIs to name the relationship between things as well as the two ends of the link (this is usually referred to as a “triple”). Using this simple model, it allows structured and semi-structured data to be mixed, exposed, and shared across different applications. This linking structure forms a directed, labeled graph, where the edges represent the named link between two resources, represented by the graph nodes. This graph view is the easiest possible mental model for RDF and is often used in easy-to-understand visual explanations.
RDFa (or Resource Description Framework in Attributes is a W3C Recommendation that adds a set of attribute-level extensions to HTML, XHTML and various XML-based document types for embedding rich metadata within Web documents. The RDF data-model mapping enables its use for embedding RDF subject-predicate-object expressions within XHTML documents. It also enables the extraction of RDF model triples by compliant user agents.
JavaScript Object Notation for Linked Data is a method of transporting Linked Data using JSON. It was a goal to require as little effort as possible from developers to transform their existing JSON to JSON-LD. This allows data to be serialized in a way that is similar to traditional JSON. It is a World Wide Web Consortium Recommendation that has been developed by the JSON for Linking Data Community Group before it has been transferred to the RDF Working Group for review, improvement, and standardization.
JSON-LD is designed around the concept of a "context" to provide additional mappings from JSON to an RDF model. The context links object properties in a JSON document to concepts in an ontology. In order to map the JSON-LD syntax to RDF, JSON-LD allows values to be coerced to a specified type or to be tagged with a language. A context can be embedded directly in a JSON-LD document or put into a separate file and referenced from different documents (from traditional JSON documents via an HTTP Link header).
Sample:
{
"@context":{
"name": "http://xmlns.com/foaf/0.1/name",
"homepage": {
"@id": "http://xmlns.com/foaf/0.1/workplaceHomepage", "@type": "@id"
},
"Person": "http://xmlns.com/foaf/0.1/Person"
},
"@id": "http://me.markus-lanthaler.com",
"@type": "Person",
"name": "Markus Lanthaler",
"homepage": "http://www.tugraz.at/"
}
3 Using Linked Data
The project Linking Open Data (LOD pronounced ell-oh-dee) is an ambitious project whose aim is to make data available to everyone.
4 DBPedia
DBPedia is a community project devoted to extracting and manipulating RDF data embedded in WikiPedia pages. It was originally implemented using PHP but has been re-developed using Scala which is a object-functional language which creates Java byte-code files.
4.1 Prerequisites
-
Java JDK (version 1.7 or later)
-
Scala for development programming. Scala is an elegant and concise programming language that integrates both functional and object oriented paradigm. Scala runs on the JVM so Java and Scala stacks can be mixed for totally seamless integration.
-
Maven for project management.
5 FOAF
Friend of a Friend (FOAF) is a project devoted to linking people and information using the Web. Regardless of whether information is in people’s heads, in physical or digital documents, or in the form of factual data, it can be linked.
6 SPARQL
SPARQL is a recursive acronym for SPARQL Protocal And RDF Query Language. The intention is that it can be used with Linked Data analogously to SQL with Relational Databases.
6.1 The ARQ tool
export ARQROOT=’/Applications/ARQ-2.8.8’
/Applications/ARQ-2.8.8.8/bin/arq -h
SPARQL queries can be provided remotely by end points. e.g. if you point your browser to
http://dbpedia.org/sparql you will get an HTML query form which will accept a SPARQL query.
6.2 Sample SPARQL Query
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?name ?email
WHERE {
?person a foaf:Person.
?person foaf:name ?name.
?person foaf:mbox ?email.
}
The ? prefix is used for variables which will be instantiated when the SPARQL query runs.
7 Sources
8 Callimachus
Callimachus is a Linked Data management system. It is named after Callimachus of Cyrene who was an ancient Greek researcher in the library of Alexandria and first demonstrated the need for graph data structures in his attempts at classifying books. Download Callimachus from
http://callimachusproject.org
Callimachus 1.2 requires Java Development Kit (JDK) 1.7 or later. JRE is not sufficient. Unfortunately it does not run with JDK version 8 (the current version).
9 Virtuoso