Greg Heartsfield home

Jena ARQ Query Performance

Apache Jena is a Java library for working with Semantic Web technologies, in particular, RDF, OWL, and SPARQL. ARQ is a component of Jena that implements SPARQL, an SQL-like query language for RDF. I’ll be discussing an important lesson I recently learned when struggling to make a query run quickly. That lesson is simply that the order of graph patterns in a SPARQL where clause can dramatically affect performance.

The query I was considering was a straightforward join, something that would be common in most data integration scenarios. I have data from two (or more) sources sharing some inverse-functional property, and I want to identify the resources that are actually the same across those sources. An inverse-functional property is just a fancy way of saying that some property uniquely identifies something. For example, a person may have multiple email addresses, but two people do not share the same email address. I can take advantage of this if I have a list of people from my local contacts database that includes email addresses, as well as an export from a social networking site that also contains email addresses. In theory, I should be able to convert those sources into RDF triples, import that information into a single triplestore, and link that information into a cohesive view of people based on those identifying properties.

In practice, I found that the performance of queries that do this type of integration can be extremely slow. I tested this with a simple script that generates four triples for each person, two from each source to be integrated. Each source produces a distinct URI for a person, but they share a common email address. That produces data like the following (only one person shown):
@prefix foaf: <http://xmlns.com/foaf/0.1/>.
<http://a.example.com/1> a foaf:Person.
<http://a.example.com/1> foaf:mbox <mailto:1@example.com>.
<http://b.example.com/1> a foaf:Person.
<http://b.example.com/1> foaf:mbox <mailto:1@example.com>.

In order to discover which of these people are the same, a SPARQL query can find all Persons with different URIs, but the same email addresses:

PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?pa ?pb ?pm
WHERE {
  ?pa a foaf:Person.
  ?pb a foaf:Person.
  ?pa foaf:mbox ?pm.
  ?pb foaf:mbox ?pm.
  FILTER(?pa != ?pb)
}

This query gives us the results we want, but on my laptop it takes 18 seconds to resolve 1000 persons. This clearly isn’t satisfactory, and doesn’t scale. With ten times as many people, running time increases by two orders of magnitude.

To investigate why this is, a good place to start is the query plan. Query plans can be displayed with a command bundled with ARQ, qparse --explain --query query.arq The plan gives some indication of how things are going wrong:

(prefix ((foaf: <http://xmlns.com/foaf/0.1/>)
         (rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>))
  (project (?pa ?pb ?pm)
    (sequence
      (filter (!= ?pa ?pb)
        (bgp
          (triple ?pa rdf:type foaf:Person)
          (triple ?pb rdf:type foaf:Person)
        ))
      (bgp
        (triple ?pa foaf:mbox ?pm)
        (triple ?pb foaf:mbox ?pm)
      ))))

The sequence operator shows that ARQ is first finding all combinations of resources with type Person that have different URIs. That’s a big graph! For 1000 users, it will produce 4 million triples that the next step needs to pattern match on. Why isn’t it the square of 1,000? Because we aren’t doing duplicate elimination and we have 4 triples for each person. In a real dataset that might have dozens of other properties, you can see this could get much larger!

If we could simply switch the order so that we find all the email addresses pointed to from different people, and then verified that those resources were people, we could eliminate that huge intermediate step. It turns out to be quite easy to do, all that is required is to swap the order of patterns in the where clause so that the email patterns come first:

PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?pa ?pb ?pm
WHERE {
  ?pa foaf:mbox ?pm.
  ?pb foaf:mbox ?pm.
  ?pa a foaf:Person.
  ?pb a foaf:Person.
  FILTER(?pa != ?pb)
}

That generates a much different and more efficient query plan for our data. The output of the first sequence is exactly the number of triples that mention email addresses, so there is no combinatorial explosion.

(prefix ((foaf: <http://xmlns.com/foaf/0.1/>)
         (rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>))
  (project (?pa ?pb ?pm)
    (sequence
      (filter (!= ?pa ?pb)
        (bgp
          (triple ?pa foaf:mbox ?pm)
          (triple ?pb foaf:mbox ?pm)
        ))
      (bgp
        (triple ?pa rdf:type foaf:Person)
        (triple ?pb rdf:type foaf:Person)
      ))))

Running both the original and modified queries shows a dramatic improvement. These times are in seconds, and include starting the JVM and loading the dataset into the triplestore, as well as retrieving the full result set.

Persons       Original       Reordered
1,000 18.0 3.1
2,000 58.0 3.7
3,000 137.0 4.7
4,000 271.0 4.8
5,000 435.9 5.0
6,000 605.4 5.4
7,000 785.3 5.7
8,000 1012.6 6.1
9,000 1352.3 6.3
10,000 1680.0 6.8

I wasn’t able to significantly improve performance with the original query by switching to the more scalable TDB backend, or running database statistics. TDB does have a mechanism to hint it regarding inverse-functional properties, but I did not attempt to configure and test that since SDB was my preferred backend. On the other hand, moving graph patterns around is easy enough, and examining the query plan can confirm potential issues. So, my faith in Jena/ARQ is somewhat restored, and now I know that I can’t rely on the query planner to magically do the right thing. Hopefully this helps someone else who is struggling with poor query performance in Jena/ARQ.

Validate XHTML Validate CSS