v6.3
Dataverse 6.3
Summary
- New Contributor Guide. The UX Working Group released a new Dataverse Contributor Guide.
- Search Performance Improvements. Solr indexing and searching were improved, speeding up performance. Larger installations take note.
- Dataverse Now Supports File-level Retention Periods. See the Retention Periods section of the guide for details.
- API Optimizations for Large Datasets. Search API and permission checking have been improved for datasets with thousands of files.
- Improved Controlled Vocabulary Support. Improvements include updates to the citation metadata block's Language field and multiple extensions added to the external vocabulary mechanism.
- Improved Detection of RO-Crate Files. Dataverse now detects mime-types based on filename extensions and detects RO-Crate metadata files.
- Sitemap Now Supports More Than 50K Items. Dataverse can now handle more than 50,000 items when generating sitemap files. For details, see the sitemap section of the Installation Guide.
- Infrastructure Updates. Payara and Solr have been updated.
Please note: To read these instructions in full, please go to https://github.com/IQSS/dataverse/releases/tag/v6.3 rather than the list of releases, which will cut them off.
This release brings new features, enhancements, and bug fixes to Dataverse. Thank you to all of the community members who contributed code, suggestions, bug reports, and other assistance across the project.
Table of Contents
- Release Highlights
- Features
- Bug Fixes
- API
- Settings
- Complete List of Changes
- Getting Help
- Upgrade instructions
Release Highlights
Solr Search and Indexing Improvements
Multiple improvements have been made to the way Solr indexing and searching is done. Response times should be significantly improved.
-
Two experimental features flag called "add-publicobject-solr-field" and "avoid-expensive-solr-join" have been added to change how Solr documents are indexed for public objects and how Solr queries are constructed to accommodate access to restricted content (drafts, etc.). It is hoped that it will help with performance, especially on large instances and under load.
-
Before the search feature flag ("avoid-expensive...") can be turned on, the indexing flag must be enabled, and a full reindex performed. Otherwise publicly available objects are NOT going to be shown in search results.
-
A feature flag called "reduce-solr-deletes" has been added to improve how datafiles are indexed. When the flag is enabled, Dataverse will avoid pre-emptively deleting existing Solr documents for the files prior to sending updated information. This
should improve performance and will allow additional optimizations going forward. -
The /api/admin/index/status and /api/admin/index/clear-orphans calls
(see https://guides.dataverse.org/en/latest/admin/solr-search-index.html#index-and-database-consistency)
will now find and remove (respectively) additional permissions related Solr documents that were not being detected before.
Reducing the overall number of documents will improve Solr performance and large sites may wish to periodically call the "clear-orphans" API. -
Dataverse now relies on the autoCommit and autoSoftCommit settings in the Solr configuration instead of explicitly committing documents to the Solr index. This improves indexing speed.
See also #10554, #10654, and #10579.
File Retention Period
Dataverse now supports file-level retention periods. The ability to set retention periods, with a minimum duration (in months), can be configured by a Dataverse installation administrator. For more information, see the Retention Periods section of the User Guide.
-
Users can configure a specific retention period, defined by an end date and a short reason, on a set of selected files or an individual file, by selecting the "Retention Period" menu item and entering information in a popup dialog. Retention periods can only be set, changed, or removed before a file has been published. After publication, only Dataverse installation administrators can make changes, using an API.
-
After the retention period expires, files can not be previewed or downloaded (as if restricted, with no option to allow access requests). The file (landing) page and all the metadata remains available.
Features
Large Datasets Improvements
For scenarios involving API calls related to large datasets (numerous files, for example: ~10k) the following have been been optimized:
- The Search API endpoint.
- The permission checking logic present in PermissionServiceBean.
See also #10415.
Improved Controlled Vocabulary for Citation Block
The Controlled Vocabuary Values list for the "Language" metadata field in the citation block has been improved, with some missing two- and three-letter ISO 639 codes added, as well as more alternative names for some of the languages, making all these extra language identifiers importable. See also #8243.
Updates on Support for External Vocabulary Services
Multiple extensions of the external vocabulary mechanism have been added. These extensions allow interaction with services based on the Ontoportal software and are expected to be generally useful for other service types.
These changes include:
-
Improved Indexing with Compound Fields: When using an external vocabulary service with compound fields, you can now specify which field(s) will include additional indexed information, such as translations of an entry into other languages. This is done by adding the
indexIn
inretrieval-filtering
. See also #10505 and GDCC/dataverse-external-vocab-support documentation. -
Broader Support for Indexing Service Responses: Indexing of the results from
retrieval-filtering
responses can now handle additional formats including JSON arrays of strings and values from arbitrary keys within a JSON Object. See #10505. -
HTTP Headers: You are now able to add HTTP request headers required by the service you are implementing. See #10331.
-
Flexible params in retrievalUri: You can now use
managed-fields
field names as well as theterm-uri-field
field name as parameters in theretrieval-uri
when configuring an external vocabulary service.{0}
as an alternative to using theterm-uri-field
name is still supported for backward compatibility. Also you can specify if the value must be url encoded withencodeUrl:
. See #10404.For example :
"retrieval-uri": "https://data.agroportal.lirmm.fr/ontologies/{keywordVocabulary}/classes/{encodeUrl:keywordermURL}"
-
Hidden HTML Fields External controlled vocabulary scripts, configured via :CVocConf, can now access the values of managed fields as well as the term-uri-field for use in constructing the metadata view for a dataset. These values are now added as hidden elements in the HTML and can be found with the HTML attribute
data-cvoc-metadata-name
. See also #10503.
A Contributor Guide is now available
A new Contributor Guide has been added by the UX Working Group (#10531 and #10532).
URL Validation Is More Permissive
URL validation now allows two slashes in the path component of the URL.
Among other things, this allows metadata fields of url
type to be filled with more complex url such as https://archive.softwareheritage.org/browse/directory/561bfe6698ca9e58b552b4eb4e56132cac41c6f9/?origin_url=https://github.com/gem-pasteur/macsyfinder&revision=868637fce184865d8e0436338af66a2648e8f6e1&snapshot=1bde3cb370766b10132c4e004c7cb377979928d1
Improved Detection of RO-Crate Files
Detection of mime-types based on a filename with extension and detection of the RO-Crate metadata files.
From now on, filenames with extensions can be added into MimeTypeDetectionByFileName.properties
file. Filenames added there will take precedence over simply recognizing files by extensions. For example, two new filenames are added into that file:
ro-crate-metadata.json=application/ld+json; profile="http://www.w3.org/ns/json-ld#flattened http://www.w3.org/ns/json-ld#compacted https://w3id.org/ro/crate"
ro-crate-metadata.jsonld=application/ld+json; profile="http://www.w3.org/ns/json-ld#flattened http://www.w3.org/ns/json-ld#compacted https://w3id.org/ro/crate"
Therefore, files named ro-crate-metadata.json
will be then detected as RO-Crated metadata files from now on, instead as generic JSON
files.
For more information on the RO-Crate specifications, see https://www.researchobject.org/ro-crate
See also #10015.
New S3 Tagging Configuration Option
If your S3 store does not support tagging and gives an error if you configure direct upload, you can disable the tagging by using the dataverse.files.<id>.disable-tagging
JVM option. For more details, see the section on S3 tags in the guides, #10022 and #10029.
Feature Flag To Remove the Required "Reason" Field in the "Return to Author" Dialog
A reason field, that is required to not be empty, was added to the "Return to Author" dialog in v6.2. Installations that handle author communications through email or another system may prefer to not be required to use this new field. v6.3 includes a new
disable-return-to-author-reason feature flag that can be enabled to drop the reason field from the dialog and make sending a reason optional in the api/datasets/{id}/returnToAuthor call. See also #10655.
Improved Use of Dataverse Thumbnail
Dataverse will use the dataset thumbnail, if one is defined, rather than the generic Dataverse logo in the Open Graph metadata header. This means the image will be seen when, for example, the dataset is referenced on Facebook. See also #5621.
Improved Email Notifications When Guestbook is Used for File Access Requests
Multiple improvements to guestbook response emails making it easier to organize and process them. The subject line of the notification email now includes the name and user identifier of the requestor. Additionally, the body of the email now includes the user id of the requestor. Finally the guestbook responses have been sorted and spaced to improve readability. See also #10581.
New keywordTermURI Metadata Field in the Citation Metadata Block
A new metadata field - keywordTermURI
, has been added in the citation metadata block (as a fourth child field under the keyword
parent field). This has been done to improve usability and to facilitate the integration of controlled vocabulary services, adding the possibility of saving the "term" and/or its associated URI. For more information, see #10288 and PR #10371.
Updated Computational Workflow Metadata Block
The optional computational workflow metadata block has been updated to present a clickable link for the External Code Repository URL field. See also #10339.
Metadata Source Facet Added
An option has been added to index the name of the harvesting client as the "Metadata Source" of harvested datasets and files; if enabled, the Metadata Source facet will show separate entries for the content harvested from different sources, instead of the current, default behavior where there is one "Harvested" facet for all such content.
Tho enable this feature, set the optional feature flage (jvm option) dataverse.feature.index-harvested-metadata-source=true
before reindexing.
Additional Facet Settings
Extra settings have been added giving an instance admin more choices in selectively limiting the availability of search facets on the collection and dataset pages.
See Disable Solr Facets under the configuration section of the Installation Guide for more info as well as #10570.
Sitemap Now Supports More Than 50k Items
Dataverse can now handle more than 50,000 items when generating sitemap files, splitting the content across multiple files to comply with the Sitemap protocol. For details, see the sitemap section of the Installation Guide. See also #8936 and #10321.
MIT and Apache 2.0 Licenses Added
New files have been added to import the MIT and Apache 2.0 Licenses to Dataverse:
- licenseMIT.json
- licenseApache-2.0.json
Guidance has been added to the guides to explain the procedure for adding new licenses to Dataverse.
See also #10425.
3D Viewer by Open Forest Data
3DViewer by openforestdata.pl has been added to the list of external tools. See also #10561.
Datalad Integration With Dataverse
DataLad has been integrated with Dataverse. For more information, see the integrations section of the guides. See also #10468.
Rsync Support Has Been Deprecated
Support for rsync has been deprecated. Information has been removed from the guides for rsync and related software such as Data Capture Module (DCM) and Repository Storage Abstraction Layer (RSAL). You can still find this information in older versions of the guides. See Settings, below, for deprecated settings. See also #8985.
Bug Fixes
OpenAPI Re-Enabled
In Dataverse 6.0 when Payara was updated it caused the url /openapi
to stop working:
In addition to fixing the /openapi
URL, we are also making some changes on how we provide the OpenAPI document:
When it worked in Dataverse 5.x, the /openapi
output was generated automatically by Payara, but in this release we have switched to OpenAPI output produced by the SmallRye OpenAPI plugin. This gives us finer control over the output.
For more information, see the section on OpenAPI in the API Guide and #10328.
Re-Addition of "Cell Counting" to Life Sciences Block
In the Life Sciences metadata block under the "Measurement Type" field the value cell counting
was accidentally removed in v5.1. It has been restored. See also #8655 and #9735.
Math Challenge Fixed on 403 Error Page
On the "forbidden" (403) error page, the math challenge now correctly displays so that the contact form can be submitted. See also #10466.
Ingest Option Bug Fixed
A bug that prevented the "Ingest" option in the file page "Edit File" menu from working has been fixed. See also #10568.
Incomplete Metadata Bug Fix
A bug was fixed where the incomplete metadata
label was being shown for published dataset with incomplete metadata in certain scenarios. This label will now be shown for draft versions of such datasets and published datasets that the user can edit. This label can also be made invisible for published datasets (regardless of edit rights) with the new option dataverse.ui.show-validity-label-when-published
set to false
. See also #10116.
Identical Role Error Message
An error is now correctly reported when an attempt is made to assign an identical role to the same collection, dataset, or file. See also #9729 and #10465.
API
Superuser Endpoint
The existing API endpoint for toggling the superuser status of a user has been deprecated in favor of a new API endpoint that allows you to explicitly and idempotently set the status as true or false. For details, see the API Guide, #9887 and #10440.
New Featured Collections Endpoints
New API endpoints have been added to allow you to add or remove featured collections from a collection.
See also the sections on listing, setting, and removing featured collections in the API Guide, #10242 and #10459.
Dataset Version Endpoint Extended
The API endpoint for getting the Dataset version has been extended to include latestVersionPublishingStatus. See also #10330.
New Optional Query Parameters for Metadatablocks Endpoints
New optional query parameters have been added to api/metadatablocks
and api/dataverses/{id}/metadatablocks
endpoints:
returnDatasetFieldTypes
: Whether or not to return the dataset field types present in each metadata block. If not set, the default value is false.- Setting the query parameter
onlyDisplayedOnCreate=true
also returns metadata blocks with dataset field type input levels configured as required on the General Information page of the collection, in addition to the metadata blocks and their fields with the propertydisplayOnCreate=true
.
See also #10389
Dataverse Payload Includes Release Status
The Dataverse object returned by /api/dataverses has been extended to include "isReleased": {boolean}. See also #10491.
New Field Type Input Level Endpoint
A new endpoint api/dataverses/{id}/inputLevels
has been created for updating the dataset field type input levels of a collection via API. See also #10477.
Banner Message Endpoint Extended
The endpoint api/admin/bannerMessage
has been extended so the ID is returned when created. See also #10565.
Settings
Database Settings:
New:
- :DisableSolrFacets
Deprecated (used with rsync):
- :DataCaptureModuleUrl
- :DownloadMethods
- :LocalDataAccessPath
- :RepositoryStorageAbstractionLayerUrl
New Configuration Options
dataverse.files.<id>.disable-tagging
dataverse.feature.add-publicobject-solr-field
dataverse.feature.avoid-expensive-solr-join
dataverse.feature.reduce-solr-deletes
dataverse.feature.disable-return-to-author-reason
dataverse.feature.index-harvested-metadata-source
dataverse.ui.show-validity-label-when-published
Complete List of Changes
For the complete list of code changes in this release, see the 6.3 Milestone in GitHub.
Getting Help
For help with upgrading, installing, or general questions please post to the Dataverse Community Google Group or email [email protected].
Upgrade Instructions
Upgrading requires a maintenance window and downtime. Please plan accordingly, create backups of your database, etc.
These instructions assume that you've already upgraded through all the 5.x releases and are now running Dataverse 6.2.
0. These instructions assume that you are upgrading from the immediate previous version. If you are running an earlier version, the only supported way to upgrade is to progress through the upgrades to all the releases in between before attempting the upgrade to this version.
If you are running Payara as a non-root user (and you should be!), remember not to execute the commands below as root. Use sudo
to change to that user first. For example, sudo -i -u dataverse
if dataverse
is your dedicated application user.
In the following commands, we assume that Payara 6 is installed in /usr/local/payara6
. If not, adjust as needed.
export PAYARA=/usr/local/payara6
(or setenv PAYARA /usr/local/payara6
if you are using a csh
-like shell)
1. Undeploy the previous version.
$PAYARA/bin/asadmin undeploy dataverse-6.2
2. Stop Payara and remove the following directories:
service payara stop
rm -rf $PAYARA/glassfish/domains/domain1/generated
rm -rf $PAYARA/glassfish/domains/domain1/osgi-cache
rm -rf $PAYARA/glassfish/domains/domain1/lib/databases
3. Upgrade Payara to v6.2024.6
With this version of Dataverse, we encourage you to upgrade to version 6.2024.6.
This will address security issues accumulated since the release of 6.2023.8.
Note that if you are using GDCC containers, this upgrade is included when pulling new release images.
No manual intervention is necessary.
The steps below are a simple matter of reusing your existing domain directory with the new distribution.
But we recommend that you review the Payara upgrade instructions as it could be helpful during any troubleshooting:
Payara Release Notes.
We also recommend you ensure you followed all update instructions from the past releases regarding Payara.
(The latest Payara update was for v6.0.)
Move the current Payara directory out of the way:
mv $PAYARA $PAYARA.2023.8
Download the new Payara version 6.2024.6 (from https://www.payara.fish/downloads/payara-platform-community-edition/), and unzip it in its place:
wget https://nexus.payara.fish/repository/payara-community/fish/payara/distributions/payara/6.2024.6/payara-6.2024.6.zip
cd /usr/local
unzip payara-6.2024.6.zip
Replace the brand new payara/glassfish/domains/domain1
with your old, preserved domain1:
mv payara6/glassfish/domains/domain1 payara6/glassfish/domains/domain1_DIST
mv payara6.2023.8/glassfish/domains/domain1 payara6/glassfish/domains/
Make sure that you have the following --add-opens
options in your payara6/glassfish/domains/domain1/config/domain.xml
. If not present, add them:
<jvm-options>--add-opens=java.management/javax.management=ALL-UNNAMED</jvm-options>
<jvm-options>--add-opens=java.management/javax.management.openmbean=ALL-UNNAMED</jvm-options>
<jvm-options>[17|]--add-opens=java.base/java.io=ALL-UNNAMED</jvm-options>
<jvm-options>[21|]--add-opens=java.base/jdk.internal.misc=ALL-UNNAMED</jvm-options>
(Note that you likely already have the java.base/java.io
option there, but without the [17|]
prefix. Make sure to replace it with the version above)
Start Payara:
sudo service payara start
4. Deploy this version.
$PAYARA/bin/asadmin deploy dataverse-6.3.war
5. For installations with internationalization:
- Please remember to update translations via Dataverse language packs.
6. Restart Payara
service payara stop
service payara start
7. Update the following metadata blocks to reflect the incremental improvements made to the handling of core metadata fields:
wget https://raw.githubusercontent.com/IQSS/dataverse/v6.3/scripts/api/data/metadatablocks/citation.tsv
curl http://localhost:8080/api/admin/datasetfield/load -H "Content-type: text/tab-separated-values" -X POST --upload-file citation.tsv
wget https://raw.githubusercontent.com/IQSS/dataverse/v6.3/scripts/api/data/metadatablocks/biomedical.tsv
curl http://localhost:8080/api/admin/datasetfield/load -H "Content-type: text/tab-separated-values" -X POST --upload-file biomedical.tsv
7a. If you are using the optional computational workflow metadata block, update it:
wget https://raw.githubusercontent.com/IQSS/dataverse/v6.3/scripts/api/data/metadatablocks/computational_workflow.tsv
curl http://localhost:8080/api/admin/datasetfield/load -H "Content-type: text/tab-separated-values" -X POST --upload-file computational_workflow.tsv
8. Upgrade Solr
Solr 9.4.1 is now the version recommended in our Installation Guide and used with automated testing. There is a known security issue in the previously recommended version 9.3.0: https://nvd.nist.gov/vuln/detail/CVE-2023-36478. While the risk of an exploit should not be significant unless the Solr instance is accessible from outside networks (which we have always recommended against), we recommend to upgrade.
Install Solr 9.4.1 following the instructions from the Installation Guide.
The instructions in the guide suggest to use the config files from the installer zip bundle. Upgrading an existing instance, it may be easier to download them from the source tree:
wget https://raw.githubusercontent.com/IQSS/dataverse/v6.3/conf/solr/solrconfig.xml
wget https://raw.githubusercontent.com/IQSS/dataverse/v6.3/conf/solr/schema.xml
cp solrconfig.xml schema.xml /usr/local/solr/solr-9.4.1/server/solr/collection1/conf
8a. For installations with custom or experimental metadata blocks:
-
Stop Solr instance (usually
service solr stop
, depending on Solr installation/OS, see the Installation Guide). -
Run the
update-fields.sh
script that we supply, as in the example below (modify the command lines as needed to reflect the correct path of your Solr installation):
wget https://raw.githubusercontent.com/IQSS/dataverse/v6.3/conf/solr/update-fields.sh
chmod +x update-fields.sh
curl "http://localhost:8080/api/admin/index/solr/schema" | ./update-fields.sh /usr/local/solr/solr-9.4.1/server/solr/collection1/conf/schema.xml
- Start Solr instance (usually
service solr start
depending on Solr/OS).
9. Enable the Metadata Source facet for harvested content (Optional):
If you choose to enable this new feature, set the optional feature flag (jvm option) dataverse.feature.index-harvested-metadata-source=true
before reindexing.
10. Reindex Solr, if you upgraded Solr (recommended), or chose to enable any options that require a reindex:
curl http://localhost:8080/api/admin/index
Note: if you choose to perform a migration of your keywordValue
metadata fields (section below), that will require a reindex as well, so do that first.
Notes for Dataverse Installation Administrators
Data migration to the new keywordTermURI
field
You can migrate your keywordValue
data containing URIs to the new keywordTermURI
field.
In case of data migration, view the affected data with the following database query:
SELECT value FROM datasetfieldvalue dfv
INNER JOIN datasetfield df ON df.id = dfv.datasetfield_id
WHERE df.datasetfieldtype_id = (SELECT id FROM datasetfieldtype WHERE name = 'keywordValue')
AND value ILIKE 'http%';
If you wish to migrate your data, a database update is then necessary:
UPDATE datasetfield df
SET datasetfieldtype_id = (SELECT id FROM datasetfieldtype WHERE name = 'keywordTermURI')
FROM datasetfieldvalue dfv
WHERE dfv.datasetfield_id = df.id
AND df.datasetfieldtype_id = (SELECT id FROM datasetfieldtype WHERE name = 'keywordValue')
AND dfv.value ILIKE 'http%';
A reindex in place will be required. ReExportAll will need to be run to update the metadata exports of the dataset. Follow the directions in the Admin Guide.