Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for searching within/across IIIF annotations for Japanese language transcription #201

Open
caaster opened this issue Jul 31, 2019 · 0 comments

Comments

@caaster
Copy link

caaster commented Jul 31, 2019

Use cases drawn from SUL Text Search Study Report (July 2019). Please note: this is ultimately likely to be an overlapping set of requirements that requires further investigation and specification.

Use case 1:
The Magario diaries include 40 years of handwritten pages in Japanese by donor Steven Yoba, representing a rare instance of trans-Japanese history (Japan + US). The Japanese diary pages have been accessioned as individual images in the SDR. OCR does not work well for Japanese; Japanese transcriptions for each page were created by hand and are currently in non-accessioned individual MS Word pages. This is a high profile collection with broad faculty support. The content should be searchable and ideally accessible to text-mining. Curator: Murphy Kao

Use case 2:
The NDC collection comprises Japanese books cataloged by Hoover using the Nippon Decimal Classification system and housed at SAL1/2. The collection, which was transferred to EAL in the early 2000s, contains many rare books related to 20th century history and was digitized by Google Books years ago. Curator: Regan Murphy Kao

@anarchivist comment:
Our implementation of the IIIF Content Search API does not currently support the level of analysis for CJK query terms as provided for SearchWorks. However, the Content Search API supports CJK text, and examples of CJK transcription via annotation do exist.

Quinn Dombrowski comment:
Dombrowski has experimented with creating page-level Japanese-language OCR files (TXT) for the Magario Family Diaries (see below for more information about this collection). Note that in addition to requiring Japanese-language support in Content Search, remediating the accessioned diary page images with the OCR files and enabling text search support for the collection would also require infrastructure development.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant