Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DBPedia Direct Lookup #389

Open
sfolsom opened this issue Nov 6, 2023 · 5 comments
Open

Add DBPedia Direct Lookup #389

sfolsom opened this issue Nov 6, 2023 · 5 comments
Assignees

Comments

@sfolsom
Copy link
Contributor

sfolsom commented Nov 6, 2023

DBPedia's SPARQL Endpoint: https://dbpedia.org/sparql.

This might be too complicated: https://github.com/dbpedia/ontology-driven-api.

Likely want to be searching on a combination of rdfs:label and foaf:name. The rest of the modeling is too uneven across the different entity types to do anything general. I was considering dbo:abstract as a possibility for a display value, but the abstracts are too long to present to a cataloger in a lookup.

@chrisrlc chrisrlc self-assigned this Nov 7, 2023
@chrisrlc
Copy link

chrisrlc commented Nov 7, 2023

Possible query: CONSTRUCT { ?uri rdfs:label ?label; foaf:name ?label. } WHERE { { ?uri rdfs:label ?label FILTER bif:contains(?label, '"Cristiano Ronaldo*"') FILTER (langMatches(lang(?label),"en")) } UNION { ?uri foaf:name ?label FILTER bif:contains(?label, '"Cristiano Ronaldo*"') FILTER (langMatches(lang(?label),"en")) } }

Lang can be set by the user in an optional lang= parameter in the QA endpoint, but will be set to "en" by default.

@sfolsom Does this sparql query return the entities you expect? Are there additional fields you'd like to display with the context=true parameter set in the QA endpoint?

@sfolsom
Copy link
Contributor Author

sfolsom commented Nov 7, 2023

This is starting to get beyond my SPARQL experience, but I'm wondering about how to have a single response for each URI. That might be as simple as constructing only the rdfs:label.

Does the query above require for both the foaf:name and rdfs:label to be present and the same? (I don't have a lot of experience with UNION, and usually if you use the same variable for different statements like ?label, the value has to be the same.)

re: context, maybe we should add dbo:abstract. It's the only property I've found that's used consistently over all entity types. If Sinopia or other apps think the abstracts are too long they don't have to use them, or could set the app up to display up to a character limit with an ellipsis.

@chrisrlc
Copy link

chrisrlc commented Nov 7, 2023

The html table format for a CONSTRUCT query displays a row for each subject-predicate-object triple, so it'll look like the uri is listed multiple times, but the rdf-xml groups by uri correctly: CONSTRUCT { ?uri rdfs:label ?label; foaf:name ?name. } WHERE { { ?uri rdfs:label ?label FILTER bif:contains(?label, '"Cristiano Ronaldo*"') FILTER (langMatches(lang(?label),"en")) } UNION { ?uri foaf:name ?name FILTER bif:contains(?name, '"Cristiano Ronaldo*"') FILTER (langMatches(lang(?name),"en")) } }

The UNION should just be working on the ?uri subject, so the number of distinct returned uris should be correct, but I've updated the query above to make the results differentiate properly between matching rdfs:label vs foaf:name - good call on that.

@chrisrlc
Copy link

chrisrlc commented Nov 8, 2023

This has been deployed to lookup-int and is ready for the first round of testing. Example endpoint: https://lookup-int.ld4l.org/authorities/search/linked_data/dbpedia_direct?q=Cristiano%20Ronaldo

With context: https://lookup-int.ld4l.org/authorities/search/linked_data/dbpedia_direct?q=Cristiano%20Ronaldo&context=true

It uses the following sparql query:
CONSTRUCT { ?uri rdfs:label ?label; foaf:name ?name; dbo:abstract ?abstract. } WHERE { { ?uri rdfs:label ?labelMatch FILTER(bif:contains(?labelMatch, '"Cristiano Ronaldo*"') && langMatches(lang(?labelMatch),"en")) } UNION { ?uri foaf:name ?nameMatch FILTER(bif:contains(?nameMatch, '"Cristiano Ronaldo*"') && langMatches(lang(?nameMatch),"en")) } OPTIONAL { ?uri rdfs:label ?label } OPTIONAL { ?uri foaf:name ?name } OPTIONAL {?uri dbo:abstract ?abstract } FILTER((!bound(?label) || langMatches(lang(?label),"en")) && (!bound(?name) || langMatches(lang(?name),"en")) && (!bound(?abstract) || langMatches(lang(?abstract),"en"))) }

The query phrase is searched across rdfs:label and foaf:name, but the label returned by the QA endpoint is rdfs:label only. Currently, this behavior excludes results that don't have an rdfs:label, e.g. https://dbpedia.org/page/Elche_CF__Cristiano_Ronaldo__1 and https://dbpedia.org/page/2002%E2%80%9303_Sporting_CP_season__Cristiano_Ronaldo__1. If this is not desired behavior, I can adjust this to either include both rdfs:label AND foaf:name in the Label value (e.g. "label": "[2008 FIFA Club World Cup squads, Aaron Scott, Adriano, Agustín Delgado, Ahmad El-Sayed, Ahmad S..."), or I can try to add some logic to set Label equal to either rdfs:label OR the first foaf:name if rdfs:label isn't available. Please let me know if you have a preference.

Adding context=true to the endpoint displays rdfs:label, foaf:name, and dbo:abstract.

The dbpedia test cases for accuracy (https://lookup-int.ld4l.org/check_status) are using the same ones as dbpedia_ld4l_cache, but a dbpedia_direct search for "volleyball" returns quite a lot more results since dbpedia_direct is searching across different fields. So the expected position is thousands of records off. These queries also take a long time because they're fetching many more records. We can adjust these test cases if these benchmarks aren't very useful.

Uses default QA sorting (alphabetic) because the sparql endpoint does not have a search relevancy field that we can sort by.

Please let me know if any of this behavior should be different.

@chrisrlc
Copy link

chrisrlc commented Nov 9, 2023

Updated lookup-int:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Completed
Development

No branches or pull requests

2 participants