Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add structural cue chunking based inspired by JinaAI's implementation #92

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

AswanthManoj
Copy link
Contributor

Inspired by https://jina.ai/tokenizer/#chunking which leverage common structural cues and build a set of rules and heuristics which should perform exceptionally well across diverse types of content, including Markdown, HTML, LaTeX, and more, ensuring accurate segmentation of text into meaningful chunks.

Reference: https://gist.github.com/JeremiahZhang/2f8ae87dad836b25f40c02b8c43d16ec
Original x post: https://x.com/JinaAI_/status/1823756993108304135

@AswanthManoj AswanthManoj changed the title Added structural cue chunking strategy based on JinaAI's tokenizer ch… Add structural cue chunking based inspired by JinaAI's implementation Sep 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant