Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All work links to Open Library are broken due to wrong casing #6739

Open
alexshpilkin opened this issue Mar 3, 2023 · 5 comments
Open

All work links to Open Library are broken due to wrong casing #6739

alexshpilkin opened this issue Mar 3, 2023 · 5 comments
Assignees

Comments

@alexshpilkin
Copy link

alexshpilkin commented Mar 3, 2023

Open Library IDs are case-insensitive in that their casing does not bear information, but the server requires the ID to be passed in uppercase: https://openlibrary.org/b/OL38581116M is a book, while https://openlibrary.org/b/ol38581116m is a 404. However, all (?) Open Library URLs that appear on ORCID web pages seem to be in lowercase, thus 404: see e.g. https://orcid.org/0000-0003-1199-7080.

This seems to happen because when the URL is generated from an ol-type work ID by the resolver service, it is passed (among other things) through org.orcid.core.utils.v3.identifier.normalizers.CaseSensitiveNormalizer, which is under the impression that case-insensitive identifiers entail that the URL can be harmlessly lowercased:

IdentifierType t = idman.fetchIdentifierTypesByAPITypeName(Locale.ENGLISH).get(apiTypeName);
if (t != null && !t.getCaseSensitive()){
return value.toLowerCase();
}
return value;

Open Library has a different opinion of what the normal form of the ID (and therefore URL) should be.

As far as possible solutions are concerned, either the case sensitivity flag needs to become a tristate (uppercase, lowercase, preserve); a further normalization step that fixes the casing for OL links needs to be added; or declaring Open Library IDs to be case sensitive. All of these seem a bit meh. Neither deals with the fact that there are plenty of wrong URLs already stored in the database (they are stored, right?).

@wjrsimpson
Copy link
Member

Thanks for your thoughts @alexshpilkin. I am struggling to find the specification for Open Library IDs. Do you have a link?

@alexshpilkin
Copy link
Author

@wjrsimpson That’s a fair question. While you can tell that there are broken Open Library links at the ORCID website and that the Open Library website as it is now is case sensitive by simply poking at them, I don’t actually know that Open Library IDs are supposed to be case-insensitive, that was just me trusting your implementation and my own experience. So maybe I shouldn’t have said that with such confidence.

I’ve looked around the OL developer and librarian docs and, surprisingly, I can’t find much about how OLIDs are supposed to work. The best I’ve seen is a brief mention in the “Understanding Identifiers” section of the librarians-in-training guide.

There are some schemas and schema-adjacent things in the Open Library code, though. First, the official client library has a JSON schema for the API, which contains a (case-sensitive) regular expression for work_key, ^/works/OL[0-9]+W$, and similar ones for author_key (with A) and edition_key (with E). Second, the database schema for the backend contains code implying that an OLID in general is OL[0-9]+[A-Z].

Finally, the Wikidata definition for this identifier type says OL[1-9]\d*[AMW], but doesn’t link to an official reference either.

That’s all I could find, unfortunately, but if you want an official word on this I guess asking the Open Library maintainers is also an option.

@wjrsimpson
Copy link
Member

@alexshpilkin Thaks for the additional info.

@TomDemeranville Do you happen to know?

@TomDemeranville
Copy link
Contributor

I think we can pretty easily fix this to not alter the case we have in the database for OL identifiers. We still won't be able to can't guarantee they're correct, but we will be able to preserve the case.

I've raised a bug here: https://trello.com/c/z3efGnyn/781-preserve-case-when-normalising-open-library-identifiers

@alexshpilkin
Copy link
Author

@TomDemeranville From my outside point of view that sounds like a boring but workable solution. One thing I’m concerned about is existing links: am I right that they are persisted in the database as normalized (currently lowercased) links? and if so, do you plan to fix those up in the data from before the normalization is fixed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants