Finder and Visualizer relevant pages in a book

This contains aa notebook that prototypes an application that finds the most relevant pages in a book that match a query string. As input, you should define a query string and the id of the volume you want to extract the relevant pages.

This prototype was created for the 2024 HathiTrust TorchLite hackathon project by: Lianet Sepulveda Torres.

Potential Use Cases

In the digital library, the user cannot access books that are not public access, then the user sees the record in the list of the query results but cannot see the full text. When the user clicks to see the book's content, the user can see the book's title and a message advising the book is not public.

Instead of showing the message advising the book is not public, we could present some information about the book without showing the full text. For example,

we could show the user the most relevant pages in the book that match the query string.
a cloud of words that are most relevant to the book

Methodology

Use HTRC extracted features (EF) dataset to get the text with enriched metadata (text annotation with linguistic features)
Analyze the book at page level.
- Filter out non-relevant words (stop words)
- Create a text putting together all the words in the page
- Use KeyBERT('distilbert-base-nli-mean-tokens') to extract the top 10-words of a text. KeyBERT is based on a technique to extract keywords that represent the plain text as a document embedding and using cosine similarity find the words/phrases that are the most similar to the document. The most similar word could be considered the best candidate to describe the entire document.
Identify the most relevant pages based on similarities between the query string and the keywords at page level.
Filter out pages with a score less than 0.83.
Lemmatize the words using the root of the word to improve visualization.

Data

We are using the EF API to get the dataset of the volume osu.32436000578904.

Datasets

The notebook creates datasets that are used for visualizations: * A list of pages that are most relevant to the query string. * Each page is enriched with a score that measures the similarity between the query string and the keywords extracted from the page. * The list of the top 10 keywords per page.

Outputs

How to use the notebook

If you are using a virtual environment, activate it and run the following command:

jupyter notebook --NotebookApp.iopub_data_rate_limit=1.0e10 --NotebookApp.rate_limit_window=3.0

Open the notebook and run the cells. The first cell of the notebook will install the requirements.

If you do not have the virtual environment, you can create one and install the requirements by running the following commands:

poetry env use python source ~/venv/bin/activate

Then, run the notebook:

jupyter notebook --NotebookApp.iopub_data_rate_limit=1.0e10 --NotebookApp.rate_limit_window=3.0

Relevant pages dataset:

For each page
    * text of the page.
    * score that measures the similarity between the query string and the keywords extracted from the page.
    * 10 keywords extracted from the page.

The relevant pages dataset is used to create the following visualizations:

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
BookNavigator.ipynb		BookNavigator.ipynb
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
relevant_pages_wordcloud.png		relevant_pages_wordcloud.png
relevant_word_pages.png		relevant_word_pages.png
words_relevant_pages.png		words_relevant_pages.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Finder and Visualizer relevant pages in a book

Potential Use Cases

Methodology

Data

Datasets

Outputs

How to use the notebook

About

Releases

Packages

Languages

License

liseli/torchlite_hackathon

Folders and files

Latest commit

History

Repository files navigation

Finder and Visualizer relevant pages in a book

Potential Use Cases

Methodology

Data

Datasets

Outputs

How to use the notebook

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages