Add support for OAI-harvesting from DataCite #10909

landreev · 2024-10-04T19:39:04Z

DataCite maintains an OAI server (https://oai.datacite.org/oai) serving records for every DOI they have registered. There is a lot of interest in being able to harvest from them (since these are all registered DOIs, they will be redirecting to the original archival location of the actual studies/datasets etc.)

There is a couple of issues that must be addressed before our OAI client implementation is able to do that.

The oai_dc import code in Dataverse expects the metadata fragment to be self-contained, and, most importantly have the main persistent identifier (the DOI in this case) to be present in the <dc:identifier> field. DataCite however does not include the main DOI in the oai_dc - since they are using these DOIs as the OAI identifiers as well, they assume that it is enough to include them in the OAI record header, in the <identifier> field, like this:

<record>
<header>
      <identifier>doi:10.7910/dvn/tjclkp</identifier>
      <datestamp>2023-01-03T21:08:00Z</datestamp>
      <setSpec>HARVARDU</setSpec>
      <setSpec>GDCC.HARVARD-DV</setSpec>
</header>
<metadata>
      <oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
         <dc:title>Open Source at Harvard</dc:title>
         <dc:creator>Durbin, Philip</dc:creator>
         <dc:publisher>Harvard Dataverse</dc:publisher>
         <dc:date>2017</dc:date>
         <dc:date>Issued: 2017</dc:date>
         <dc:description>The tabular file contains information ...</dc:description>
         <dc:contributor>Durbin, Philip</dc:contributor>
         <dc:type>Dataset</dc:type>
     </oai_dc:dc>
</metadata>
</record>

Without the <dc:identifier>, our code in its current form is failing to import the record above.
All that needs to be done, we need to add some logic to use the identifier from the OAI header in situations like this. (We actually used to do that in one of the previous iterations of the harvester).

DataCite OAI implementation offers a very promising feature of accepting arbitrary search queries as the OAI set names (https://support.datacite.org/docs/datacite-oai-pmh#arbitrary-queries). This would make it possible to harvest individual records by the DOIs (something we've been asked for specifically) or any possible subsets of their offerings.
Example:

echo "doi%3A10.7910/DVN/TJCLKP" | base64 
ZG9pJTNBMTAuNzkxMC9EVk4vVEpDTEtQCg==

Now you can harvest this "set" made up of one dataset above, as in
https://oai.datacite.org/oai?verb=ListRecords&metadataPrefix=oai_dc&set=~ZG9pJTNBMTAuNzkxMC9EVk4vVEpDTEtQCg==
Unfortunately for whatever reason, the above notation only works in ListRecords, but not in ListIdentifiers, that Dataverse actually uses. From talking to Datacite, they may be able to fix it eventually - but not in an instant, "oh yeah, we just had this one line commented out" way.
We should go ahead and implement support for harvesting using ListRecords (it should be faster, if nothing else; we handle it via ListIdentifiers then GetRecord, one record at a time, for various historical reasons - but it may come handy in other situations, to have both modes supported (and configurable, per client maybe?)

Clearly, we don't want to touch the current, JSF-based harvesting clients UI. But making the changes above, in the import and harvesting back end code, and then making it possible to set up or configure a client via the /api/harvest/clients API to take advantage of these improvements should be both useful and sufficient.

The text was updated successfully, but these errors were encountered:

DS-INRAE · 2024-10-08T08:44:28Z

Item 1. is also true for other repositories, and would greatly enhance Dataverse's harvesting capacity 😃

scolapasta · 2024-10-18T15:49:25Z

#2 has been split off as #10936

landreev · 2024-10-21T13:47:21Z

There is an extra issue @scolapasta pointed out in #10937 that I'm adding as task 3. here - in the current scheme of things the set name is stored in the database as a varchar(255). It should be changed to an unlimited text field, since it will be used for arbitrary DataCite search queries. For example, in our immediate use case this is likely going to be a very long list of individual DOIs.

….2.0) #10909

fgassert · 2024-10-30T14:34:38Z

Hi Folks,
Glad to see this moving forward 🙇 !
Just a comment that might inform the implementation of this. If you end up changing the harvesting client behavior to hit only ListRecords, this could potentially also allow for the harvesting of any static xml document mirroring the ListRecords response. This opens the door to other potential workarounds for harvesting other metadata.

Here's an example:
https://groups.google.com/g/dataverse-community/c/XrQsCTVZzAE/m/vVIFL6xeDwAJ

…arvested datasets. #10909. (that whole block of extra checks on the harvest "style" may be redundant by now - I'll think about it)

cmbz added GREI 3 Search and Browse NIH CAFE Issues related to and/or funded by the NIH CAFE project labels Oct 4, 2024

cmbz added this to IQSS Dataverse Project Oct 4, 2024

cmbz mentioned this issue Oct 4, 2024

GREI 3: HDV Task - Improve OAI-PMH Harvesting IQSS/dataverse-pm#171

Open

56 tasks

DS-INRAE moved this to ⚠️ Needed/Important in Recherche Data Gouv Oct 7, 2024

DS-INRAE added this to Recherche Data Gouv Oct 7, 2024

scolapasta moved this to This Sprint 🏃‍♀️ 🏃 in IQSS Dataverse Project Oct 18, 2024

scolapasta added the Size: 10 A percentage of a sprint. 7 hours. label Oct 18, 2024

landreev mentioned this issue Oct 18, 2024

For harvesting, allow advanced feature of using listRecords vs ListIdentifiers #10936

Closed

landreev self-assigned this Oct 18, 2024

landreev moved this from This Sprint 🏃‍♀️ 🏃 to In Progress 💻 in IQSS Dataverse Project Oct 18, 2024

cmbz mentioned this issue Oct 20, 2024

Project: NIH CAFE IQSS/dataverse-pm#161

Open

18 tasks

gwendoux added this to Cirad Dataverse Oct 21, 2024

gwendoux moved this to Interested in Cirad Dataverse Oct 21, 2024

landreev added a commit that referenced this issue Oct 21, 2024

quick draft implementation of addressing issue 1. from #10909.

6d336c8

landreev mentioned this issue Oct 22, 2024

Allow Harvesting to use arbitrary sets #10937

Closed

landreev added Size: 80 A percentage of a sprint. 56 hours. and removed Size: 10 A percentage of a sprint. 7 hours. labels Oct 22, 2024

cmbz added the FY25 Sprint 8 FY25 Sprint 8 (2024-10-09 - 2024-10-23) label Oct 23, 2024

landreev added a commit that referenced this issue Oct 23, 2024

draft ListRecords framework (only works with a locally patched xoai-5…

b330d3e

….2.0) #10909

cmbz added the FY25 Sprint 9 FY25 Sprint 9 (2024-10-23 - 2024-11-06) label Oct 23, 2024

landreev added a commit that referenced this issue Oct 25, 2024

removing character limit on the harvesting set field #10909

faba65d

landreev added a commit that referenced this issue Oct 29, 2024

Adding the new client options to the json printer and parser #10909

bcaf896

pdurbin mentioned this issue Oct 29, 2024

Add import for oai_datacite ("OpenAire") format (this will allow Dataverse to harvest this format too) #7727

Open

stevenferey mentioned this issue Oct 30, 2024

Feature Request: Request identifier support for OAI_DC harvesting #10982

Open

landreev added a commit that referenced this issue Oct 30, 2024

saving the current, work-in-progress of the harvester service #10909

b613e32

landreev added a commit that referenced this issue Nov 1, 2024

we DO want to include the persistent id in the search cards for all h…

f605c38

…arvested datasets. #10909. (that whole block of extra checks on the harvest "style" may be redundant by now - I'll think about it)

landreev moved this from In Progress 💻 to On Hold ⌛ in IQSS Dataverse Project Nov 7, 2024

landreev added a commit that referenced this issue Nov 8, 2024

renamed flyway script #10909

44cb3be

This was referenced Nov 8, 2024

10982 Request identifier support for oai dc harvesting #11010

Open

10909 Support for OAI-PMH harvesting from DataCite #11011

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for OAI-harvesting from DataCite #10909

Add support for OAI-harvesting from DataCite #10909

landreev commented Oct 4, 2024 •

edited by pdurbin

Loading

DS-INRAE commented Oct 8, 2024

scolapasta commented Oct 18, 2024

landreev commented Oct 21, 2024

fgassert commented Oct 30, 2024

Add support for OAI-harvesting from DataCite #10909

Add support for OAI-harvesting from DataCite #10909

Comments

landreev commented Oct 4, 2024 • edited by pdurbin Loading

DS-INRAE commented Oct 8, 2024

scolapasta commented Oct 18, 2024

landreev commented Oct 21, 2024

fgassert commented Oct 30, 2024

landreev commented Oct 4, 2024 •

edited by pdurbin

Loading