[Originally posted by Jessica Dussault, October 9, 2015 on Github Pages]
In part II of this series, we learned how to run a basic query on some RDF. In this part, I’ll be explaining some of the queries that we are running behind the scenes to power the O Say Can You See website. This post needs a theme, and one of the better known individuals from the project is Francis Scott Key, so let’s all welcome him to the stage.
O Say Can You See: Early Washington, D.C., Law & Family
The O Say Can You See (OSCYS) project explores relationships between individuals involved in legal proceedings in Washington, D.C. from 1800 to 1862. Legal documents like summonses, minute books, verdicts, and depositions were located, digitized, and marked up. From those documents, relationships between individuals were gathered, described in a TTL file, and defined in an OWL file.
We use SPARQL on the TTL and OWL files to power a few features that show up for each person in OSCYS. On an individual’s page, we list all of their immediate relationships and link to that person’s network visualization. The visualization has not only links to other people, but the type of link. If you are less interested in a specific person and more interested in a general type of relationship, then head to the “people” page where the types of connections (spouse of, attorney for, etc) between all the individuals in OSCYS are browsable. Additionally, there is a search feature that allows you to look at how many ways the attorneys are related. All of this information is being powered by RDF. I’ve explained the queries that I used to get the information below, if you would like to jump straight to the explanations. However, a little background information about how our metadata expert, Laura Weakly, encoded the relationships is probably the best place to start!
Much of the information about each person is described in a TEI file (available here: caution, large file). However, information specific to the relationships between individuals are stored in a turtle file with this type of setup:
person 1 > attorney of > person 2 person 2 > client of > person 1
More information about each relationship is stored in the OWL file. The owl file says:
attorney of > inverse of > client of attorney of > type > legal relationship client of > inverse of > attorney of client of > type > legal relationship
Using a combination of the two documents, we can now start to answer some questions.
- Who is a given person related to?
- Who is related to someone who is related to a given person?
- Which people have any type of family relationships?
- Which people are the attorneys of someone else?
This is what a typical query setup looks like for OSCYS:
This should look mostly familiar, if you were following along at home with part II, except for the PREFIX declarations at the top. The prefix is a tool to make things more human readable. It’s much nicer to get a result that says
osrdf:per.000001 rather than one that says
If you want to try any of these queries on your own, feel free to visit the sparqler and paste these queries in (ignore the target graph URI field). Except in a few cases, the above generic setup should be sufficient if you simply paste in the “WHERE” portion of the query.
Alright, let’s take a look at probably the simplest SPARQL query behind the scenes of the OSCYS site. Each person’s page has a list of their immediate relationships which is drawn from the TTL. Francis Scott Key’s page looks something like this:
This is the query used to generate the list on Francis Scott Key’s page.
This will return results look something like the following, but in json, which is then manipulated into HTML.
By default, when you view Key’s page, you are presented with the results in HTML along with Key’s other information. However, on any of the people pages in OSCYS you can add
.json to the end of the URLs to view two different formats. This is how you would see some different formats for Francis Scott Key’s query results:
So how does the SPARQL query work? Let’s take a closer look at the query and walk through it.
In the first line we plug in Key’s id, per.000001, in such a way that SPARQL knows we are using an entity from our rdf document. Remember that we defined the prefix
osrdf at the top of our query to allow that style of shorthand. The first line also asks SPARQL to find anything related to Key’s identity in the RDF file.
?rel1 = the relationship and
?per1 = the person (or thing) related to Key.
In the second line, we use that same variable,
?per1 again. Now we’re looking for the fullNames of any of the
?per1 items found on the first line of the query, and anything that matches becomes
?per1 doesn’t have a fullName, then it is omitted from the results which is probably okay – we’re mostly interested in people who can be identified at least with a name.
In addition to the direct connections, each person has a visualization that looks at their immediate social network. A person knows a person who knows a person. That is to say, Mary Bell below is the spouse of Daniel Bell who is the client of Gilbert L Giberson. Mary knows several individuals who in turn seem to have quite a few connections. This is a screenshot of part of her visualization here.
For the visualizations we ask “find the people Bell knows and the people that THOSE people know. While you’re at it, get the type of the relationships and the names of all the people.” As with the direct relationships, you can add
.json to the end of that URL to view the results used to create the visualization (xml) (json).
There’s a lot required from this query. We want the names and the people once removed from Bell. We also want to know the type of each relationship so that we can make family ties distinct from, say, legal ties.
The RDF file can tell you that person A is related to person B as “parentOf” but it does not have the right information to tell you that “parentOf” is a family relationship. That information is stored in the OWL file. So how do we get around that problem? The first clue is that our OWL file can be queried with SPARQL the same way that we have been querying the RDF. Let’s try grabbing all the relationships and their corresponding type. Note that there is an extra prefix at the top in order to get at “subPropertyOf”.
The results we get are probably somewhat predictable. On the left is a list of specific relationships, on the right is the category that they belong to.
Great, now we just need to combine it with results from the RDF so that we can attach information that says “parentOf” is a family relationship while “attorneyOf” is a legal one, etc. We can do that by combining the graphs with two FROM clauses. Here’s the whole shebang:
The first line of the WHERE clause should look familiar.
?osrdf:per.001253 ?rel01 ?per1 is just getting a list of everything related in some way to Francis Scott Key. Then we come across some optional triples.
Using our findings for the
?per1 variable, we look to find anything related and its fullName. An example would be “Mary Bell knows Person A and Person A knows Person B whose name is John Marbury.” The second leg is made optional because it is possible that Bell may be connected to somebody who has no other relationships of their own. Though unlikely, should this situation come up, if the Person A to Person B triple were not optional, it would omit more solitary individuals from the results.
Now it’s time for our new OWL query for the relationship types make an appearance, also wrapped in an OPTIONAL. The subPropertyOf queries are optional because it’s possible that a connection has been added for two individuals that hasn’t been defined in the OWL file (perhaps a misspelling or a first of its kind relationship). AttrneyOf [sic] and personalPhysicianOf might not be yet valid according to the OWL file, but we want to make sure that these orphaned relationships still appear in the results or else we may not notice that something needs to be resolved.
The final line of the query is a FILTER. The filter is simply avoiding some duplication. If Person A knows Person B then we can probably assume that Person B knows Person A, too. We don’t need to return Person A in the search results. If that’s the kind of thing that you would like to know, then just remove the FILTER line to view all the results.
Because the type of relationship is returned as part of the query, now you can do things like this!
You can also look through all of the types of relationships in OSCYS, which is a handy tool if you’re particularly interested in families, or slaveholders, or judges, or whatever sort of connection suits your curiosity. The search is here. This is example of what the list of people deposed by other people looks like:
This is another pretty straightforward query. No need to consult an OWL file this time. Just ask “find a person connected to another person with that relationship.”
As attorneys rubbed elbows quite a lot, we built a tool to explore the various interactions between attorneys. This search looks for immediate connections, attorneys who might have been two removed from each other, and more, all the way up to four people removed. Some attorneys may not have ever had vague connections with each other, even after several jumps through a social network, which is in itself noteworthy. Try it out yourself by visiting the Attorney Relationship Finder of Science.
For the purpose of this blogpost, we’ll be looking at the ways that Francis Scott Key worked with Elias Boudinot Caldwell who was an attorney, clerk, War of 1812 vet, and Presbyterian minister.
This is just a small snippet of the ways in which Francis Scott Key knew Elias Boudinot Caldwell:
I struggled a bit when trying to put together this aspect of the relationship querying. What I wanted was “given one node, find any possible connections to another given node.” I don’t know of a way to do this with SPARQL, and I doubt that it is even possible. If there is a way, please drop me a line! Undaunted, I began work on a mega-query that would approximate finding any possible connection by checking for immediate connections, then if that was a bust, trying to find a two removed connection, etc, all the way up to four people removed. My mega-query was quickly revealed to be a horrible monster.
Somewhere in the layers and layers of OPTIONAL clauses, I started losing confidence that my query was actually doing what I wanted it to do. I also wasn’t sure if I cared anymore. It took an eternity to run (and by that I mean like 8 minutes). Time to try something else.
Instead of one big query with lots of finicky moving pieces, I decided to send four queries. It’s not perfect, I’m not super excited about it, but it is working so I can’t complain too much. The direct relationship query is the simplest, as you might expect. It’s nearly identical to a query already being used on an individual’s page except that in this case, I wanted both a specific starting person and a specific ending person.
Besides the above query for direct relationships, I send queries for two and three connections removed, and then eventually end up with the query for four connections removed. You’ll notice that I’m filtering out some of the relationships. This is because if Person A knows person B, there’s a pretty good chance person B knows person A, too. It seems silly to have the results show up in big chains of the same people.
At the top I use BIND to tell SPARQL to use
?goal. Makes the query look a bit more readable, in my opinion. After that it’s pretty straightforward – find a person related to Francis Scott Key, then find people that those people know, then find people that those people know, etc. I’m also returning their names which means that unnamed individuals and their connections will not show up in the results.
These are just a sampling of the types of questions we can be asking of our RDF data. We could potentially use it to construct family trees. Maybe there would be something interesting revealed if we looked at networks of people related to a specific case. If we marked up other aspects of the OSCYS data besides the people (like locations, cases, events), there might be even more fascinating connections to find.
|osrdf:per.000738||Brent, William Leigh||https://viaf.org/viaf/38728091|
|osrdf:per.000001||Key, Francis Scott||https://viaf.org/viaf/824500|
That brings an end to a longwinded blogpost about the queries that are working behind the scenes for OSCYS. In the next installment, I’ll be talking about how I chose to query RDF with Fuseki from the Ruby on Rails framework!
- Figuring out RDF and SPARQL: Part I Triples
- Figuring out RDF and SPARQL: Part II Getting Set Up
- Figuring out RDF and SPARQL: Part III Some Queries