Images dataset contains wrong triples #720

jlareck · 2021-12-02T11:14:21Z

Issue validity

Some explanation: DBpedia Snapshot is produced every three months, see Release Frequency & Schedule, which is loaded into http://dbpedia.org/sparql . During these three months, Wikipedia changes and also the DBpedia Information Extraction Framework receives patches. At http://dief.tools.dbpedia.org/server/extraction/en/ we host a daily updated extraction web service that can extract one Wikipedia page at a time. To check whether your issue is still valid, please enter the article name, e.g. Berlin or Joe_Biden here: http://dief.tools.dbpedia.org/server/extraction/en/
If the issue persists, please post the link from your browser here:

https://dbpedia.org/page/Borysthenia_goldfussiana
https://dbpedia.org/page/Ingoldiomyces
There are more triples in the DBpedia snapshot 2021-09 that contain this issue

Error Description

Please state the nature of your technical emergency:

Looks like ImageExtractorNew produces triples from Wikipedia pages that don't contain images. For example https://en.wikipedia.org/wiki/Borysthenia_goldfussiana, it doesn't contain any image but the ImageExtractorNew produced triple with image http://commons.wikimedia.org/wiki/Special:FilePath/T-72_ATE_South_Africa.jpg from it. The same issue with page https://en.wikipedia.org/wiki/Ingoldiomyces, it doesn't contain any picture but ImageExtractorNew also produced triple with image https://upload.wikimedia.org/wikipedia/commons/c/cf/B%26N_nook_Logo.svg

Pinpointing the source of the error

Where did you find the data issue? Non-exhaustive options are:

Web/SPARQL, e.g. http://dbpedia.org/sparql or http://dbpedia.org/resource/Berlin, please provide query or link
Dumps: dumps are managed by the Databus. Please provide artifact & version or download link
DIEF: you ran the software and the error occured then, please include all necessary information such as the extractor or log. If you had problems running the software use another issue template

This error occurs in https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/mappings/ImageExtractorNew.scala

Details

please post the details

Wrong triples RDF snippet

<http://dbpedia.org/resource/Borysthenia_goldfussiana> <http://xmlns.com/foaf/0.1/depiction> <http://commons.wikimedia.org/wiki/Special:FilePath/T-72_ATE_South_Africa.jpg>

<http://dbpedia.org/resource/Ingoldiomyces> <http://xmlns.com/foaf/0.1/depiction> <http://commons.wikimedia.org/wiki/Special:FilePath/B&N_nook_Logo.svg> .

Expected / corrected RDF outcome snippet

We must remove that kind of triples

Example DBpedia resource URL(s)

Other

The text was updated successfully, but these errors were encountered:

jaygray0919 · 2022-01-07T19:19:27Z

I have an extensive sample set that we can use to test when this issue is resolved
/jay gray

jlareck · 2022-01-12T08:38:30Z

@jaygray0919 Could you please send this sample set? Looks like that I resolved the issue but not sure that completely (at least produced dataset doesn't contain <http://dbpedia.org/resource/Borysthenia_goldfussiana> and <http://dbpedia.org/resource/Ingoldiomyces> triples but it would be cool to check other wrong triples)

jaygray0919 · 2022-01-12T16:17:43Z

@jlareck try using this:
https://afdsi.com/sparql-species/#/specierch/gold
i can explain the app if you are interested
/jay

jaygray0919 · 2022-01-12T16:26:26Z

@jlareck this also worked well 6 months ago, but is now very slow/unresponsive:
https://afdsi.com/search-dbpedia-tv-shows/?#genre=&language=&country=&
it seems/feels-like the parser is 'in a twist'
do you see any obvious reasons for its sluggishness?
/jay

jaygray0919 · 2022-01-19T18:13:07Z

@jlareck anything we can do to help out here?
if possible, we'd like to feature this and related DBpedia/SPARQL apps as part of a product launch in the near future
so we're motivated to help restore the images previously served by the SPARQL queries
/jay

jlareck · 2022-01-19T19:22:57Z

try using this:
https://afdsi.com/sparql-species/#/specierch/gold
i can explain the app if you are interested
/jay

Hi @jaygray0919, thank you for providing this link with examples! I checked some triples in the upcoming release image dataset and as I see some wrong images were not extracted but there are still some triples that contain images not related to the wikipage. So, the image extractor that produces the data is only partitially fixed.

if possible, we'd like to feature this and related DBpedia/SPARQL apps as part of a product launch in the near future
so we're motivated to help restore the images previously served by the SPARQL queries

Could you please provide more details what do you want to do?

jaygray0919 · 2022-01-19T21:20:22Z

The url Species is one of our DBpedia/SPARQL applications. To reprise the above: "the content ain't right"
Previously, when it "was right" the images for the queries were 100% correct (we checked extensively over a year ago - zero errors).
Our request: restore the last good version.
Now, we're not so naive to think that's easy; since the last solid data set, many changes have been applied.
But the bottom line: DBpedia content has been corrupted.
While we can determine that item images are corrupt, there may be other errors that also crept in somewhere during an update. It's highly unlikely that only the image files are fubar - my guess is that there are problems with other item properties.
An indicator is the performance problems we see with another SPARQL application - TV Shows
A year ago, this app worked very well. It is now very slow and produces irregular results.
We're far more concerned with Species than TV Shows and are willing to "pitch in" and find the last good dataset (the version with uncorrupted image property values).
Does that make sense? Anything short-term we can do to restore an uncorrupted dataset?

jlareck · 2022-01-24T09:23:37Z

Hi @jaygray0919, sorry, but it looks like we cannot restore uncorrupted dataset at the moment. Image dataset should have a better quality in the upcoming release, but it still contains some wrong triples. I am discovering those triples now, and we will try to fix image extraction till the next release

jaygray0919 · 2022-01-24T12:55:37Z

Got it.
Then we'll be happy to work with you to incrementally identify misaligned images in the next release.
Then you can use that list to correct a subsequent release.
ITMT, the link we shared above will display - for biologics - misaligned images.
It's a one-at-a-time process, but it might help you identify patterns that we cannot easily see (e.g. a consistent pairing of biologics/non-biologics).
For example, there is a high concentration of military weapons in our biologic queries.

jlareck · 2022-02-10T08:49:49Z

Hi @jaygray0919, could you please check more images on your website if there are any incorrect images? Because it seems to me that I fixed the image extraction and all images should be correct. Thank you

jaygray0919 · 2022-02-10T13:24:05Z

Hello @jlareck - will do; will report back today/tomorrow
Thank you for doing this work.

jaygray0919 · 2022-02-10T13:45:46Z

Previous errors that have been corrected:
https://afdsi.com/sparql-species/#/specierch/gold
https://afdsi.com/sparql-species/#/specierch/green
https://afdsi.com/sparql-species/#/specierch/taurus

Small problems:
https://afdsi.com/sparql-species/#/specierch/red
Feredayia graminosa

I'll look for other errors later today

jaygray0919 · 2022-02-10T17:43:50Z

foaf:depiction

https://dbpedia.org/page/Pseudocharopa_whiteleggei
https://commons.wikimedia.org/wiki/Special:Redirect/file/Lord_Howe_Island.png

https://dbpedia.org/page/Chiasmia_goldiei
https://commons.wikimedia.org/wiki/Special:Redirect/file/Chiasmia_goldiei.jpg

https://dbpedia.org/page/Golden_volute
https://commons.wikimedia.org/wiki/Special:Redirect/file/Iredalina_mirabilis.jpg

https://dbpedia.org/page/Pictured_rove_beetle
https://commons.wikimedia.org/wiki/Special:Redirect/file/thinopinus_pictus.jpg

https://dbpedia.org/page/Tenthredo_amoena
https://commons.wikimedia.org/wiki/Special:Redirect/file/Tenthredinidae_-_Tenthredo_amoena.jpg

https://dbpedia.org/page/Tenthredo_crassa
https://commons.wikimedia.org/wiki/Special:Redirect/file/Tenthredinidae_-_Tenthredo_crassa-001.jpg

jlareck · 2022-02-10T18:54:16Z

Small problems:
https://afdsi.com/sparql-species/#/specierch/red
Feredayia graminosa

Actually, this is the correct image. Check the page https://en.wikipedia.org/wiki/Feredayia_graminosa , this article contains 3 images. I think that if the current version of image extraction extracts all pictures from wikipages, and produces multiple triples with foaf:depiction, you can show not only one picture but all those pictures on your website. Otherwise if you want to show only first picture from the wikipage, you can try to use dbo:thumbnail instead of foaf:depiction .

foaf:depiction

https://dbpedia.org/page/Pseudocharopa_whiteleggei
https://commons.wikimedia.org/wiki/Special:Redirect/file/Lord_Howe_Island.png

https://dbpedia.org/page/Chiasmia_goldiei
https://commons.wikimedia.org/wiki/Special:Redirect/file/Chiasmia_goldiei.jpg

And regarding to this, I think it is a one more issue in image extraction that I didn't notice before, but now it is related to creating incorrect links to wikimedia images

jaygray0919 · 2022-02-10T20:27:13Z

Unfortunately, your (sensible) exception handling is difficult to implement.
We 'grab' the first instance and do not iterate on subsequent instances.
And dimensions for dbo:thumbnail do not look good on desktop (they are passable on mobile, but we need to keep it simple).

Returning to the big picture, your corrections seem to handle the glaring issues (biologics like Russian tanks; aircraft; etc.)
If you can correct the null values, that will further improve the display.
Bottom line: queries are dramatically improved; thank you for that.
/jay

JJ-Author · 2022-02-11T09:58:05Z

@jlareck good first milestone :-). but can you please write the documentation for the images dataset https://databus.dbpedia.org/dbpedia/generic/images/ and explain what to expect there.
I think this is important knowledge for users to understand the difference between foaf:depiction, dbo:thumbnail and foaf:thumbnail. For me it is confusing I had to look in the code to get an impression that is not good...

@jaygray0919 thanks for testing and finding issues.
But I do not understand your issue with multiple images, there seems no complexity in that, right? Just write the sparql query so that only one image is returned? or use thumbnail and cut off the size parameter at the end?

jaygray0919 · 2022-02-11T21:47:18Z

@JJ-Author I'll revist the SPARQL query, which has some age to it.
When doing the original engineering, we did not see or foresee the need to test for more than one image; our single select on foaf:depiction worked 100% of the time.
However, it will be much more difficult to read multiple properties and test for multiple images.
Based on @jlareck corrections, we're ~90% of our previous results, which is acceptable.
I'm reluctant to make an isolated change to a large program at this time.
When we do reopen the beast, we'd like to add new features like autosuggest to limit the scope of the query.
The current version hits DBpedia fairly hard, and we'd like to implement a more refined query.
We'd also like to introduce a "You also may be interested in" using a reasoner (which, of course, adds back complexity).
Bottom line: we'd like to help improve data quality thru testing, but postpone changes to the app until we have a new plan.

jlareck · 2022-02-13T19:56:55Z

@JJ-Author I made a pull request with the documentation for the image dataset: dbpedia/marvin-config#4 . Could you please check it?

jlareck self-assigned this Dec 2, 2021

jlareck added status: fix-required PR related to issue is needed status: minidump-test-required type: data labels Dec 2, 2021

Vehnem added the status: accepted label Dec 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Images dataset contains wrong triples #720

Images dataset contains wrong triples #720

jlareck commented Dec 2, 2021 •

edited

Loading

jaygray0919 commented Jan 7, 2022

jlareck commented Jan 12, 2022

jaygray0919 commented Jan 12, 2022 •

edited

Loading

jaygray0919 commented Jan 12, 2022 •

edited

Loading

jaygray0919 commented Jan 19, 2022

jlareck commented Jan 19, 2022

jaygray0919 commented Jan 19, 2022

jlareck commented Jan 24, 2022

jaygray0919 commented Jan 24, 2022

jlareck commented Feb 10, 2022

jaygray0919 commented Feb 10, 2022

jaygray0919 commented Feb 10, 2022

jaygray0919 commented Feb 10, 2022

jlareck commented Feb 10, 2022

jaygray0919 commented Feb 10, 2022

JJ-Author commented Feb 11, 2022

jaygray0919 commented Feb 11, 2022

jlareck commented Feb 13, 2022

Images dataset contains wrong triples #720

Images dataset contains wrong triples #720

Comments

jlareck commented Dec 2, 2021 • edited Loading

Issue validity

Error Description

Pinpointing the source of the error

Details

jaygray0919 commented Jan 7, 2022

jlareck commented Jan 12, 2022

jaygray0919 commented Jan 12, 2022 • edited Loading

jaygray0919 commented Jan 12, 2022 • edited Loading

jaygray0919 commented Jan 19, 2022

jlareck commented Jan 19, 2022

jaygray0919 commented Jan 19, 2022

jlareck commented Jan 24, 2022

jaygray0919 commented Jan 24, 2022

jlareck commented Feb 10, 2022

jaygray0919 commented Feb 10, 2022

jaygray0919 commented Feb 10, 2022

jaygray0919 commented Feb 10, 2022

jlareck commented Feb 10, 2022

jaygray0919 commented Feb 10, 2022

JJ-Author commented Feb 11, 2022

jaygray0919 commented Feb 11, 2022

jlareck commented Feb 13, 2022

jlareck commented Dec 2, 2021 •

edited

Loading

jaygray0919 commented Jan 12, 2022 •

edited

Loading

jaygray0919 commented Jan 12, 2022 •

edited

Loading