Linked Data Fragments

Linked Data means connecting individual pieces of data on the Web, so that automated clients can interpret them more easily. Servers can offer access to such data through different standardized and non-standardized interfaces, the properties of which profoundly influence the characteristics of clients and servers during interactions. This document defines Linked Data Fragments, a uniform view on all possible interfaces to publish Linked Data. This view allows us to analyze the properties of existing interfaces, and to define new interfaces with different combinations of properties. Additionally, this document explains how existing interfaces fit into this uniform view.

Introduction

Interfaces to Linked Data

What Linked Data is

A gigantic amount of digital information exists, and new documents are created every day. Most of them are written in natural languages, which machines cannot fully interpret yet. And even if a document contains machine-interpretable information, the appropriate context is often missing. For instance, what do thousands of numbers in a comma-separated file mean?

Machines prefer structured data using unambiguous identifiers. Linked Data [[LINKED-DATA]] combines both to make it easier for machines to process and integrate data from different sources. URLs—the unambiguous identifiers of the Web—not only identify a resource, they also allow to retrieve a representation thereof. Machine-interpretable structured data is possible using the triple-based model of RDF [[RDF11-CONCEPTS]].

All RDF triples have a subject, predicate, and object, and in the case of Linked Data, these components are dereferenceable URLs.
For example, the following triple expresses that Walt Disney is a person:

<http://dbpedia.org/resource/Walt_Disney>
    <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
        <http://xmlns.com/foaf/0.1/Person>.

Linked Data is thus linked on two levels: on one level, we link “Walt Disney” and “Person” together with the “type” relation; on another level, each of those three components is a link toward more information. This combination of structure and URLs is the essence of Linked Data: if you don't know what http://dbpedia.org/resource/Walt_Disney or http://xmlns.com/foaf/0.1/Person mean, you can look up information about those topics through their URL.

You can convey Linked Data in the RDF model through various concrete forms:

JSON-LD [[JSON-LD]] allows to express Linked Data in the widely used JSON format.
RDF triple formats such as Turtle [[TURTLE]] use a triple-based syntax.
RDF triples can be embedded in HTML through RDFa [[HTML-RDFA]].

How Linked Data can be accessed

The most straightforward way to access Linked Data is to follow the URL of a Linked Data document. In other words, we use the HTTP protocol [[RFC7230]] to retrieve a representation of the resource identified by that URL. This process is called dereferencing. For example, you can copy and paste the URL http://dbpedia.org/resource/Walt_Disney in your browser, which will lead to an HTML document with triples in RDFa. Automated clients might ask for other representations of this resource, for instance, in JSON-LD or Turtle.

However, such an interface based on Linked Data documents and dereferencing has its limitations. For example, while the URL http://xmlns.com/foaf/0.1/Person describes the notion of “a person”, it does not give access to a list of all persons. This would in fact be impossible: the Web server at xmlns.com is not supposed to know which resources from dbpedia.org use this type. The alternative to scan all documents on dbpedia.org and extract this information would be highly impractical. Therefore, if we want to retrieve the members of this list efficiently, we need another interface.

An alternative interface is a data dump, which is a typically large file that contains all triples from a certain dataset. Using a data dump of dbpedia.org, we could find the list of all people. Unfortunately, this would involve downloading a lot of information, even though we are only interested in a small fraction. SPARQL endpoints [[SPARQL11-PROTOCOL]] offer an interface that allows to select data much more granularly. This is more convenient for clients, but individual requests are considerably more expensive for servers.

The above indicates that each type of interface to Linked Data comes with its own characteristics, which can lead to advantages or disadvantages in particular situations.

Aim, scope, and intended audience

The goal of Linked Data Fragments is to provide a uniform view on all possible interfaces to Linked Data. Thereby, we want to provide a conceptual framework to characterize all Linked Data interfaces in order to enable qualitative and quantitative comparisons. Furthermore, we want to stimulate the development of new kinds of interfaces that address the current and emerging needs of the Semantic Web.

This documents defines Linked Data Fragments, and specifies what clients and servers of Linked Data Fragments are. It does not redefine existing interfaces or introduce new ones. Instead, it explains how these interfaces can be seen from the Linked Data Fragments perspective.

If you want to analyze existing Linked Data interfaces or define new interfaces, we encourage you to read this document. If instead you want to implement one of the discussed interfaces, the individual specifications (which are linked from this document) will serve you better.

Document conventions

We write triples in this document in the Turtle RDF syntax [[!TURTLE]] using the following namespace prefixes:

PREFIX rdf:         <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:        <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbpedia:     <http://dbpedia.org/resource/>
PREFIX dbpedia-owl: <http://dbpedia.org/ontology/>

Concepts

Datasets

Any piece of data always occurs in a certain context; it never stands on its own. Unsurprisingly, this also applies to data structured as RDF triples [[!RDF11-CONCEPTS]]. In order to refer to collections of RDF triples, we introduce the following definition, derived from the VoID Vocabulary [[VOID]]:

A Linked Data dataset is a collection of RDF triples that are published, maintained or aggregated by a single provider.

Selectors

Often, we are interested in specific parts of a dataset. Such parts can be a few or several triples in size, ranging from an empty part to the whole dataset. To be able to define what a specific part looks like, we introduce the following concept.

A selector is a boolean function that decides whether or not a certain triple (or graph of triples) belongs to a part of a dataset.

All of the following are examples of selectors:

The triple pattern dbpedia:Germany ?predicate ?object.
The graph pattern { dbpedia:Germany rdf:type ?class. ?class rdfs:subclassOf ?other. }.
The function f(triple) = true (universal selector, matches the entire dataset).
The function f(triple) = false (empty selector, always results in an empty part).
The criterion “true if the triple describes a country; false otherwise”.
A specific SPARQL query.

Basically, every criterion that unambiguously leads to part of a dataset acts as a selector.

Some selectors are more closely related than others. For instance, a group of selectors might have a similar structure or computational complexity. The following definition allows us to talk about them collectively.

A selector type is a class of selectors with similar structural characteristics.

Metadata

Apart from the triples that describe data in a (part of a) dataset, some triples capture data about it. They do not belong to the dataset as such, but they can nonetheless be helpful to understand properties of this dataset.

Metadata of a dataset, or a part thereof, consists of RDF triples that describe data about that dataset or part, but that do not belong to the dataset itself.

Hypermedia controls

Pieces of data and information on the Web can be connected to each other. This is because the Web is filled with hypermedia controls: most HTML pages contain several hyperlinks, some pages also contain forms with text fields and buttons. HTML is not the only format with hypermedia support; specific RDF vocabularies can be used to express hypermedia controls as well. Regardless of format, what all hypermedia controls on the Web have in common is that they somehow lead to an URL a client can visit. The following definition generalizes this notion.

A hypermedia control is a function that generates an IRI [[!RFC3987]] based on zero or more arguments. In particular, a hyperlink is a zero-argument function (i.e., an IRI), and a hypermedia form is a multi-argument function.

Linked Data Fragments

The read aspect of each interface to Linked Data is characterized by its possible set of responses. We therefore introduce a concept to capture such responses.

A Linked Data Fragment of a Linked Data dataset is a set of RDF triples that consists of three parts:

data: all triples of this dataset that match a specific selector;
metadata: triples that describe the dataset and/or the Linked Data Fragment;
controls: hypermedia links and/or forms that lead to other Linked Data Fragments.

The selector, elements of the metadata set, and elements of the control sets are specific to each Linked Data Fragment. Each of the three parts is allowed to be empty. Any (proper or improper) subset of a Linked Data dataset, regardless of how this subset was created, is by thus definition a Linked Data Fragment.

In the general definition of a Linked Data Fragment, there are no restrictions on what selectors should look like. They could be triple patterns, basic graph patterns, SPARQL queries, or even natural language queries. Like selectors, Linked Data Fragments can be organized in types.

A Linked Data Fragment type is a class of Linked Data Fragments with the same selector type and metadata and control sets with similar characteristics.

We can analyze existing and new Linked Data interfaces by characterizing their responses as a specific Linked Data Fragment type.

Linked Data Fragment types of existing interfaces are listed in the next section.

The data part of some Linked Data Fragments can become quite large. For instance, the fragment that contains all triples of a dataset can contain millions of triples. To make such large fragments more manageable, their data can be split across multiple pages.

A Linked Data Fragments page contains a subset of all data triples of a Linked Data Fragment, together with all of its metadata and control triples.

Conceptually speaking, each fragment remains one whole, but its data can be retrieved through several requests. This additionally allows to retrieve the metadata and control set without having to download a disproportionally large part of the dataset. Not all fragments support paging.

Linked Data Fragments servers

A Linked Data Fragments server is a server that offers all possible Linked Data Fragments of one or more specific Linked Data Fragment types of one or more datasets. It MUST support at least one RDF-based representation for each fragment.

Servers can choose what types of Linked Data Fragments they offer, whether or not they support paging, and what representations they provide.

Linked Data Fragments clients

A Linked Data Fragments client is a client that can access Linked Data Fragments of at least one specific Linked Data Fragments type. It MUST be able to consume at least one RDF-based representation of the fragments it supports.

Existing Linked Data Fragments

Since the goal of Linked Data Fragments is to provide a uniform view on Linked Data interfaces on the Web, this section describes how existing types of Linked Data interfaces fit into the Linked Data Fragments definition. Basically, each interface offers Linked Data Fragments of a specific type, which is thus characterized by its data selector, metadata set, and control set.

Data dumps

A data dump of a dataset is an instance of a Linked Data Fragment type with the following characteristics:

data: The selector is the universal selector f(triple) = true for all triples. In other words, a data dump is an RDF representation of all triples of its dataset.
metadata: The metadata set can contain triples that express, for example, the file size of the data dump in a particular representation, the author and/or publisher of the dataset, and/or licensing information.
controls: The control set can contain links to other entities, either through their URLs (dereferencing) or through other means.

Many publishers of Linked Data offer such downloadable data dumps of their datasets. They can be used to set up a local triple store, but are not fit for live querying because of their typically large file size.

Linked Data documents

A Linked Data document of a dataset is an instance of a Linked Data Fragment type with the following characteristics:

data: The selector takes a single entity as argument, and matches triples that are related to this entity.
The precise functional definition of the selector is implementation-specific, but it usually contains those triples that match the triple pattern { <entity> ?predicate ?object. } and possibly triples matching { ?subject ?predicate <entity>. }.
metadata: The metadata set can contain triples about the author and/or publisher of the dataset, and/or licensing information.
controls: A Linked Data document should contain links to other Linked Data documents, in particular through the URLs of entities that are described in the document.
The URLs of all entities of the dataset should point to the Linked Data document about them.

Linked Data documents can be used to browse a dataset, or to execute queries using link-traversal-based query execution.

SPARQL query results

A SPARQL query result of a dataset is an instance of a Linked Data Fragment type with the following characteristics:

data: The selector is a SPARQL CONSTRUCT query; the data consists of those triples that result from executing this query.
metadata: The metadata set is empty.
controls: While SPARQL result representations usually do not contain hypermedia controls (apart from dereferenceable URLs of entities), the SPARQL URL of the endpoint can serve as a URI template to retrieve other SPARQL query results.

SPARQL results allow to extract very specific fragments of a dataset.

The following is a SPARQL query result:

The Turtle representation of the execution on DBpedia 3.8 of

CONSTRUCT {
  ?person rdfs:label ?name; dbpedia-owl:birthPlace ?city.
}
WHERE {
  ?person a foaf:Person;
          rdfs:label ?name; dbpedia-owl:birthPlace ?city.
  ?city dbpedia-owl:country dbpedia:Germany.
}

The following could be a SPARQL URI template to obtain the fragment corresponding to the above query:

http://dbpedia.org/sparql?query={query}

The fact that a SPARQL query result is a Linked Data Fragment means that each SPARQL endpoint is, by definition, a Linked Data Fragments server.

Only results of CONSTRUCT (and thus not SELECT or ASK) SPARQL queries are considered Linked Data Fragments. This is because only the execution of CONSTRUCT queries results in data triples. However, the CONSTRUCT query can contain SELECT subqueries.

Triple pattern fragments

A triple pattern fragment (also known as basic Linked Data Fragment) is an instance of a Linked Data Fragment type with the following characteristics:

data: The selector of is a triple pattern { ?subject ?predicate ?object. }, in which each of the three components can be variable or constant. The data consists of those triples that match the triple pattern.
metadata: The metadata set contains a triple that expresses the estimated total number of matches for the pattern.
Since fragments can be paged, this information cannot always be derived from the data itself.
controls: A triple pattern fragment contains hypermedia controls that allow to retrieve any other triple pattern fragment of the same dataset.

Triple pattern fragments can be used to browse a dataset with more flexibility than Linked Data documents, because they can also select based on predicates and objects (instead of only subjects).

Triple pattern fragments are described in detail in a separate document.