Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Images dataset contains wrong triples #720

Open
jlareck opened this issue Dec 2, 2021 · 18 comments
Open

Images dataset contains wrong triples #720

jlareck opened this issue Dec 2, 2021 · 18 comments

Comments

@jlareck
Copy link
Collaborator

jlareck commented Dec 2, 2021

Issue validity

Some explanation: DBpedia Snapshot is produced every three months, see Release Frequency & Schedule, which is loaded into http://dbpedia.org/sparql . During these three months, Wikipedia changes and also the DBpedia Information Extraction Framework receives patches. At http://dief.tools.dbpedia.org/server/extraction/en/ we host a daily updated extraction web service that can extract one Wikipedia page at a time. To check whether your issue is still valid, please enter the article name, e.g. Berlin or Joe_Biden here: http://dief.tools.dbpedia.org/server/extraction/en/
If the issue persists, please post the link from your browser here:

https://dbpedia.org/page/Borysthenia_goldfussiana
https://dbpedia.org/page/Ingoldiomyces
There are more triples in the DBpedia snapshot 2021-09 that contain this issue

Error Description

Please state the nature of your technical emergency:

Looks like ImageExtractorNew produces triples from Wikipedia pages that don't contain images. For example https://en.wikipedia.org/wiki/Borysthenia_goldfussiana, it doesn't contain any image but the ImageExtractorNew produced triple with image http://commons.wikimedia.org/wiki/Special:FilePath/T-72_ATE_South_Africa.jpg from it. The same issue with page https://en.wikipedia.org/wiki/Ingoldiomyces, it doesn't contain any picture but ImageExtractorNew also produced triple with image https://upload.wikimedia.org/wikipedia/commons/c/cf/B%26N_nook_Logo.svg

Pinpointing the source of the error

Where did you find the data issue? Non-exhaustive options are:

This error occurs in https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/mappings/ImageExtractorNew.scala

Details

please post the details

Wrong triples RDF snippet

<http://dbpedia.org/resource/Borysthenia_goldfussiana> <http://xmlns.com/foaf/0.1/depiction> <http://commons.wikimedia.org/wiki/Special:FilePath/T-72_ATE_South_Africa.jpg>

<http://dbpedia.org/resource/Ingoldiomyces> <http://xmlns.com/foaf/0.1/depiction> <http://commons.wikimedia.org/wiki/Special:FilePath/B&N_nook_Logo.svg> .

Expected / corrected RDF outcome snippet

We must remove that kind of triples


Example DBpedia resource URL(s)


Other

@jaygray0919
Copy link

I have an extensive sample set that we can use to test when this issue is resolved
/jay gray

@jlareck
Copy link
Collaborator Author

jlareck commented Jan 12, 2022

@jaygray0919 Could you please send this sample set? Looks like that I resolved the issue but not sure that completely (at least produced dataset doesn't contain <http://dbpedia.org/resource/Borysthenia_goldfussiana> and <http://dbpedia.org/resource/Ingoldiomyces> triples but it would be cool to check other wrong triples)

@jaygray0919
Copy link

jaygray0919 commented Jan 12, 2022

@jlareck try using this:
https://afdsi.com/sparql-species/#/specierch/gold
i can explain the app if you are interested
/jay

@jaygray0919
Copy link

jaygray0919 commented Jan 12, 2022

@jlareck this also worked well 6 months ago, but is now very slow/unresponsive:
https://afdsi.com/search-dbpedia-tv-shows/?#genre=&language=&country=&
it seems/feels-like the parser is 'in a twist'
do you see any obvious reasons for its sluggishness?
/jay

@jaygray0919
Copy link

@jlareck anything we can do to help out here?
if possible, we'd like to feature this and related DBpedia/SPARQL apps as part of a product launch in the near future
so we're motivated to help restore the images previously served by the SPARQL queries
/jay

@jlareck
Copy link
Collaborator Author

jlareck commented Jan 19, 2022

try using this:
https://afdsi.com/sparql-species/#/specierch/gold
i can explain the app if you are interested
/jay

Hi @jaygray0919, thank you for providing this link with examples! I checked some triples in the upcoming release image dataset and as I see some wrong images were not extracted but there are still some triples that contain images not related to the wikipage. So, the image extractor that produces the data is only partitially fixed.

if possible, we'd like to feature this and related DBpedia/SPARQL apps as part of a product launch in the near future
so we're motivated to help restore the images previously served by the SPARQL queries

Could you please provide more details what do you want to do?

@jaygray0919
Copy link

The url Species is one of our DBpedia/SPARQL applications. To reprise the above: "the content ain't right"
Previously, when it "was right" the images for the queries were 100% correct (we checked extensively over a year ago - zero errors).
Our request: restore the last good version.
Now, we're not so naive to think that's easy; since the last solid data set, many changes have been applied.
But the bottom line: DBpedia content has been corrupted.
While we can determine that item images are corrupt, there may be other errors that also crept in somewhere during an update. It's highly unlikely that only the image files are fubar - my guess is that there are problems with other item properties.
An indicator is the performance problems we see with another SPARQL application - TV Shows
A year ago, this app worked very well. It is now very slow and produces irregular results.
We're far more concerned with Species than TV Shows and are willing to "pitch in" and find the last good dataset (the version with uncorrupted image property values).
Does that make sense? Anything short-term we can do to restore an uncorrupted dataset?

@jlareck
Copy link
Collaborator Author

jlareck commented Jan 24, 2022

Hi @jaygray0919, sorry, but it looks like we cannot restore uncorrupted dataset at the moment. Image dataset should have a better quality in the upcoming release, but it still contains some wrong triples. I am discovering those triples now, and we will try to fix image extraction till the next release

@jaygray0919
Copy link

Got it.
Then we'll be happy to work with you to incrementally identify misaligned images in the next release.
Then you can use that list to correct a subsequent release.
ITMT, the link we shared above will display - for biologics - misaligned images.
It's a one-at-a-time process, but it might help you identify patterns that we cannot easily see (e.g. a consistent pairing of biologics/non-biologics).
For example, there is a high concentration of military weapons in our biologic queries.

@jlareck
Copy link
Collaborator Author

jlareck commented Feb 10, 2022

Hi @jaygray0919, could you please check more images on your website if there are any incorrect images? Because it seems to me that I fixed the image extraction and all images should be correct. Thank you

@jaygray0919
Copy link

Hello @jlareck - will do; will report back today/tomorrow
Thank you for doing this work.

@jaygray0919
Copy link

Previous errors that have been corrected:
https://afdsi.com/sparql-species/#/specierch/gold
https://afdsi.com/sparql-species/#/specierch/green
https://afdsi.com/sparql-species/#/specierch/taurus

Small problems:
https://afdsi.com/sparql-species/#/specierch/red
Feredayia graminosa

I'll look for other errors later today

@jlareck
Copy link
Collaborator Author

jlareck commented Feb 10, 2022

Small problems:
https://afdsi.com/sparql-species/#/specierch/red
Feredayia graminosa

Actually, this is the correct image. Check the page https://en.wikipedia.org/wiki/Feredayia_graminosa , this article contains 3 images. I think that if the current version of image extraction extracts all pictures from wikipages, and produces multiple triples with foaf:depiction, you can show not only one picture but all those pictures on your website. Otherwise if you want to show only first picture from the wikipage, you can try to use dbo:thumbnail instead of foaf:depiction .

foaf:depiction

https://dbpedia.org/page/Pseudocharopa_whiteleggei
https://commons.wikimedia.org/wiki/Special:Redirect/file/Lord_Howe_Island.png

https://dbpedia.org/page/Chiasmia_goldiei
https://commons.wikimedia.org/wiki/Special:Redirect/file/Chiasmia_goldiei.jpg

And regarding to this, I think it is a one more issue in image extraction that I didn't notice before, but now it is related to creating incorrect links to wikimedia images

@jaygray0919
Copy link

Unfortunately, your (sensible) exception handling is difficult to implement.
We 'grab' the first instance and do not iterate on subsequent instances.
And dimensions for dbo:thumbnail do not look good on desktop (they are passable on mobile, but we need to keep it simple).

Returning to the big picture, your corrections seem to handle the glaring issues (biologics like Russian tanks; aircraft; etc.)
If you can correct the null values, that will further improve the display.
Bottom line: queries are dramatically improved; thank you for that.
/jay

@JJ-Author
Copy link
Contributor

@jlareck good first milestone :-). but can you please write the documentation for the images dataset https://databus.dbpedia.org/dbpedia/generic/images/ and explain what to expect there.
I think this is important knowledge for users to understand the difference between foaf:depiction, dbo:thumbnail and foaf:thumbnail. For me it is confusing I had to look in the code to get an impression that is not good...

@jaygray0919 thanks for testing and finding issues.
But I do not understand your issue with multiple images, there seems no complexity in that, right? Just write the sparql query so that only one image is returned? or use thumbnail and cut off the size parameter at the end?

@jaygray0919
Copy link

@JJ-Author I'll revist the SPARQL query, which has some age to it.
When doing the original engineering, we did not see or foresee the need to test for more than one image; our single select on foaf:depiction worked 100% of the time.
However, it will be much more difficult to read multiple properties and test for multiple images.
Based on @jlareck corrections, we're ~90% of our previous results, which is acceptable.
I'm reluctant to make an isolated change to a large program at this time.
When we do reopen the beast, we'd like to add new features like autosuggest to limit the scope of the query.
The current version hits DBpedia fairly hard, and we'd like to implement a more refined query.
We'd also like to introduce a "You also may be interested in" using a reasoner (which, of course, adds back complexity).
Bottom line: we'd like to help improve data quality thru testing, but postpone changes to the app until we have a new plan.

@jlareck
Copy link
Collaborator Author

jlareck commented Feb 13, 2022

@JJ-Author I made a pull request with the documentation for the image dataset: dbpedia/marvin-config#4 . Could you please check it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants