Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with the client-side ("serviceprovider") implementation of ListRecords #278

Open
landreev opened this issue Oct 22, 2024 · 4 comments

Comments

@landreev
Copy link
Collaborator

It appears that harvesting via ListRecords is broken. The reason we never noticed is that Dataverse OAI client hasn't been using it, relying instead on making a ListIdentifiers call, then calling GetRecord for each non-deleted identifier. I am however working on adding support for harvesting via ListRecords as well, optionally.

To skip directly to the punchline, I believe all it is is this line:

reader.next(elementName(localPart(equalTo("metadata"))));

The problem being that the <metadata> tag in question has already been parsed by the RecordParser before this parser has been called, here:

reader.next(elementName(localPart(equalTo("metadata")))).next(aStartElement());

A larger fragment:

reader.next(elementName(localPart(equalTo("metadata")))).next(aStartElement());
String content = reader.retrieveCurrentAsString();
ByteArrayInputStream inputStream =
new ByteArrayInputStream(content.getBytes(StandardCharsets.UTF_8));
XSLPipeline pipeline =
new XSLPipeline(inputStream, true)
.apply(context.getMetadataTransformer(metadataPrefix));
if (context.hasTransformer()) pipeline.apply(context.getTransformer());
try {
record.withMetadata(new Metadata(new MetadataParser().parse(pipeline.process())));
} catch (TransformerException e) {

In other words, when it's trying to parse this fragment of a ListRecords response:

<metadata>
<oai_dc:dc xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:title>This is my test dataset</dc:title>
<dc:identifier>https://doi.org/10.5072/FK2/U6AEZM</dc:identifier>
<dc:creator>Castro, Eleni</dc:creator>
<dc:publisher>Demo Dataverse</dc:publisher>
<dc:description>This is my dataset</dc:description>
<dc:subject>Social Sciences</dc:subject>
<dc:subject>Test</dc:subject>
<dc:date>2015-04-20</dc:date>
<dc:contributor>Admin, Dataverse</dc:contributor>
</oai_dc:dc>
</metadata>

the content String in line 49 above will only contain the <oai_dc:dc ...> ... </oai_dc:dc> part, and that's where the next parser bombs with

java.util.NoSuchElementException null
StackTrace: 
com.ctc.wstx.evt.WstxEventReader.throwEndOfInput(WstxEventReader.java:511)
com.ctc.wstx.evt.WstxEventReader.nextEvent(WstxEventReader.java:270)
io.gdcc.xoai.xmlio.XmlReader.next(XmlReader.java:129)
io.gdcc.xoai.serviceprovider.parsers.MetadataParser.parse(MetadataParser.java:34)
io.gdcc.xoai.serviceprovider.parsers.RecordParser.parse(RecordParser.java:60)
io.gdcc.xoai.serviceprovider.parsers.ListRecordsParser.next(ListRecordsParser.java:5
8)
io.gdcc.xoai.serviceprovider.handler.ListRecordHandler.nextIteration(ListRecordHandl
er.java:67)
io.gdcc.xoai.serviceprovider.lazy.ItemIterator.hasNext(ItemIterator.java:31)
io.gdcc.xoai.serviceprovider.lazy.ItemIterator.&lt;init&gt;(ItemIterator.java:22)
io.gdcc.xoai.serviceprovider.ServiceProvider.listRecords(ServiceProvider.java:73)
edu.harvard.iq.dataverse.harvest.client.oai.OaiHandler.runListRecords(OaiHandler.java:266)
edu.harvard.iq.dataverse.harvest.client.HarvesterServiceBean.harvestOAIviaListRecords(HarvesterServiceBean.java:289)

The fix appears to be as simple as commenting out line 34 in MetadataParser.java 😄.
But it would sound prudent to add a test or two that would attempt to parse some example fragments.

@landreev
Copy link
Collaborator Author

Hmm, I actually don't understand what's going on - looking at the existing RecordParser tests, I don't really get how they are passing.

@landreev
Copy link
Collaborator Author

Ok, I see, the tests are passing because of
context = new Context().withMetadataTransformer("oai_dc", KnownTransformer.OAI_DC);
in the test setup.

@landreev
Copy link
Collaborator Author

@poikilotherm I want to close this issue, since I opened it based on not understanding how that parser was supposed to work. (I warned upfront that that was a possibility)
But can I keep it open for just a little longer, just to understand what's going on there.
Am I reading it correctly, that XOAI can only harvest metadata for which it has a to_xoai xsl transform?

(as you can see, Dataverse hasn't been using this parser at all)

@landreev
Copy link
Collaborator Author

I may ask for, and/or make a PR adding an extra feature to record processing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant