-
Notifications
You must be signed in to change notification settings - Fork 1
/
sparql.Rnw
315 lines (247 loc) · 8.52 KB
/
sparql.Rnw
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
\documentclass[11pt,a4paper]{article}
\usepackage[OT1]{fontenc}
\usepackage{a4wide}
\usepackage{hyperref}
% Sweave
\usepackage{Sweave}
% Biblatex
\usepackage[style=chem-angew,hyperref,citestyle=numeric-comp,articletitle,sorting=none]{biblatex}
\setlength{\bibitemsep}{0cm}
\addbibresource{sparql.bib}
\title{Accessing the ToxBank Gold Compound wiki from R}
\date{}
\author{Egon Willighagen\\
Division of Molecular Toxicology\\
Institute for Environmental Medicine\\
Karolinska Institutet}
\begin{document}
\maketitle
\tableofcontents
\section{Introduction}
This document gives an overview of the content of the ToxBank Gold Compound Wiki.
The first section will introduce how the wiki is accessed, and further section
will discuss the content of the wiki.
SPARQL is the query language for the Semantic Web. The ToxBank Gold Compound Wiki
has a SPARQL end point where SPARQL queries can be run. This can be done via
a wiki page (\url{http://wiki.toxbank.net/w/index.php/Special:SPARQLEndpoint})
\section{Accessing the ToxBank Wiki}
The ToxBank wiki can be accessed from data analysis software via its SPARQL end point.
For example, the rrdf package for R can be used for this~\cite{rrdf}. After starting
an R session (from the command line, with RStudio, RStet, Bioclipse, or anything else),
you can load this library with:
<<>>=
library(rrdf)
@
To simplify
the R code we define a \textit{toxbank} variable for it:
<<>>=
toxbank = "http://wiki.toxbank.net/w/index.php/Special:SPARQLEndpoint"
@
Because the wiki is protected with LDAP-based security to limit access only to
SEURAT-1 participants, we need to authenticate when running a SPARQL query, but
the \textit{rrdf} recently gained support for that. I syggest to keep the
user credentials in a separate file, for which you can easily control the
access on your local machine. For example, you can create a \textit{user.R} that
looks like:
\begin{verbatim}
user = "username"
password = "password"
\end{verbatim}
This file can then be used in any further script to run some analysis of the
wiki content, by sourcing it:
<<>>=
source("user.R")
@
We can then run SPARQL queries against our wiki with the following code.
<<>>=
results = sparql.remote(toxbank,
user=user, password=password,
"SELECT DISTINCT ?predicate { [] ?predicate [] }"
)
@
This results matrix returns all unique predicates defined by the wiki
(currently \Sexpr{print(nrow(results))}).
In the remainder of this document SPARQL queries will be used which I
will assign to the sparql variable, which is then used in this fashion:
<<>>=
sparql = "SELECT DISTINCT ?class { [] a ?class }"
results = sparql.remote(toxbank, sparql,
user=user, password=password
)
@
The remainder of this document will simply give the SPARQL query, rather
than the exact R command, but all output is created by running the
query in R. It should be noted that SPARQL queries can easily copied
into your R code, including quoted strings, using the following R syntax:
\begin{verbatim}
sparql = '
SELECT ?class {
?class ?predicate "D00395" .
}
'
\end{verbatim}
<<echo=F>>=
sparql = '
SELECT ?class {
?class ?predicate "D00395" .
}
'
results = sparql.remote(toxbank, sparql, user=user, password=password)
@
This query would return \textit{troglitazone} which has D00395 as associated
KEGG identifier. This class has the URI
\url{\Sexpr{results[1,"class"]}}.
\section{Classes}
The SPARQL query all classes used in wiki can be done with this SPARQL:
\begin{verbatim}
SELECT DISTINCT ?class { [] a ?class }
\end{verbatim}
This query makes use of the shortcut 'a', which maps to the type predicate
from RDF (\textit{rdf:type}).
This will return these classes:
\begin{scriptsize}
<<echo=F>>=
sparql = '
SELECT DISTINCT ?class { [] a ?class }
'
sparql.remote(
toxbank, sparql,
user=user,password=password
)
@
\end{scriptsize}
And we here see a typical feature of the Semantic MediaWiki (SMW): if a page is
categorized into a particular \textit{Category}, the SMW will automatically
make the topic of that wiki page of \textit{rdf:type} of that category.
We can make use of this to ask for all substances associated with hepatotoxicity.
\subsection*{Substances}
To list all substances the wiki associates with hepatotoxicity, we just query
for all page topics of type \url{http://wiki.toxbank.net/w/index.php/Special:URIResolver/Category-3AHepatotoxicCompounds}.
To do this, we will use another SPARQL trick: prefixes. Prefixes allow us to simplify the
query, to make it easier to read for use mere humans:
\begin{verbatim}
PREFIX cat: <http://wiki.toxbank.net/w/index.php/Special:URIResolver/Category-3A>
PREFIX wiki: <http://wiki.toxbank.net/w/index.php/Special:URIResolver/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?substance ?name WHERE {
?substance a cat:HepatotoxicCompounds ;
rdfs:label ?name .
}
\end{verbatim}
This will return to R this matrix:
\begin{scriptsize}
<<echo=F>>=
sparql = '
PREFIX cat: <http://wiki.toxbank.net/w/index.php/Special:URIResolver/Category-3A>
PREFIX wiki: <http://wiki.toxbank.net/w/index.php/Special:URIResolver/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?substance ?name WHERE {
?substance a cat:HepatotoxicCompounds ;
rdfs:label ?name .
}
'
sparql.remote(
toxbank, sparql,
user=user,password=password
)
@
\end{scriptsize}
Furthermore, for any compound we can then list all properties with this SPARQL:
\begin{verbatim}
PREFIX cat: <http://wiki.toxbank.net/w/index.php/Special:URIResolver/Category-3A>
PREFIX wiki: <http://wiki.toxbank.net/w/index.php/Special:URIResolver/>
PREFIX prop: <http://wiki.toxbank.net/w/index.php/Special:URIResolver/Property-3A>
SELECT ?predicate ?value WHERE {
wiki:Acetaminophen ?predicate ?value .
}
\end{verbatim}
This will return to R this matrix:
\begin{scriptsize}
<<echo=F>>=
sparql = '
PREFIX cat: <http://wiki.toxbank.net/w/index.php/Special:URIResolver/Category-3A>
PREFIX wiki: <http://wiki.toxbank.net/w/index.php/Special:URIResolver/>
PREFIX prop: <http://wiki.toxbank.net/w/index.php/Special:URIResolver/Property-3A>
SELECT ?predicate ?value WHERE {
wiki:Acetaminophen ?predicate ?value
}
'
sparql.remote(
toxbank, sparql,
user=user,password=password
)
@
\end{scriptsize}
\section{Predicates}
With the basic tools in place, we can start analyzing the content of the wiki.
We can list all predicates defined in the wiki with the following SPARQL:
\begin{verbatim}
PREFIX prop: <http://wiki.toxbank.net/w/index.php/Special:URIResolver/Property-3A>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT DISTINCT ?predicate ?name
WHERE {
[] ?predicate [] .
?predicate rdfs:label ?name .
FILTER regex(?predicate, "toxbank")
}
\end{verbatim}
This returns:
\begin{scriptsize}
<<echo=F>>=
sparql = '
PREFIX prop: <http://wiki.toxbank.net/w/index.php/Special:URIResolver/Property-3A>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT DISTINCT ?predicate ?name
WHERE {
[] ?predicate [] .
?predicate rdfs:label ?name .
FILTER regex(?predicate, "toxbank")
}
'
sparql.remote(
toxbank, sparql,
user=user,password=password
)
@
\end{scriptsize}
Now, this SPARQL actually asked for all predicates which had 'toxbank' in the URI,
and it is therefore surprising to see the FOAF predicates \textit{foaf:homepage}
and \textit{foaf:weblog} show up. This is because the ToxBank predicates
\url{http://wiki.toxbank.net/w/index.php/Property:Has_Webpage} and
\url{http://wiki.toxbank.net/w/index.php/Property:Has_Weblog} are made
synonymous to the FOAF equivalents.
\section{Chemical properties}
OK, now that we covered the basics, we can start analyzing the data in the wiki.
Let's plot the pKa values for the all Gold Compounds (accepted, rejected,
and proposed). We use this query to retrieve the data:
\begin{verbatim}
PREFIX prop: <http://wiki.toxbank.net/w/index.php/Special:URIResolver/Property-3A>
SELECT ?pka WHERE {
?substance prop:Has_pKa ?pka
}
\end{verbatim}
and make a histogram of the values in the wiki:
<<echo=F>>=
sparql = '
PREFIX prop: <http://wiki.toxbank.net/w/index.php/Special:URIResolver/Property-3A>
SELECT ?pka WHERE {
?substance prop:Has_pKa ?pka
}
'
@
<<label=histpkas,fig=TRUE,include=FALSE>>=
pkas = as.numeric(sparql.remote(toxbank, sparql, user=user,password=password))
hist(pkas, main="", xlab="pKa", breaks=5)
@
The results are shown in Figure~\ref{fig:pkas}.
\begin{figure}[t!]
\begin{center}
\includegraphics[width=3in]{sparql-histpkas}
\end{center}
\caption{Distribution of pKa values of all gold compounds.}
\label{fig:pkas}
\end{figure}
\printbibliography
\end{document}