-
Notifications
You must be signed in to change notification settings - Fork 275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LLM_EXTRACT_TEXT implementation #18435
base: main
Are you sure you want to change the base?
LLM_EXTRACT_TEXT implementation #18435
Conversation
PR-Agent was enabled for this repository. To continue using it, please link your git user with your CodiumAI identity here. PR Reviewer Guide 🔍
|
PR-Agent was enabled for this repository. To continue using it, please link your git user with your CodiumAI identity here. PR Code Suggestions ✨
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you know you are extracting a pdf file? How about add a const "file type" or "extractor type" arg to the function?
Extractor type arg might be better, if we happen to have many different kind of extractors for same file type (or a extractor for many file types).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a test.
Extractor type arg added |
Unit and BVT tests have been added. |
@@ -0,0 +1,2 @@ | |||
select llm_extract_text(cast('file://$resources/llm_test/extract_text/MODocs1.pdf?offset=0&size=4' as datalink), cast('file://$resources/llm_test/extract_text/MODocs1.txt' as datalink), "pdf"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need offset and size here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I included them here to demonstrate that this format of datalink is recognized by the function.
Offset and size are not utilized in the llm_extract_text function and will not affect the outcome.
Shall I delete them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, just remove those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add result verification in the bvt test.
} | ||
|
||
// file service and write file | ||
ctx := context.TODO() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try to get the context from proc. Only use context.TODO() when you have no choice.
fs, readPath, err := fileservice.GetForETL(ctx, proc.Base.FileService, txtPath) | ||
|
||
// delete the file if txt file exist because Write() only works when a file does not exist | ||
_, err = fs.StatFile(ctx, readPath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just error out if file exists. You may delete the customer file in the cloud if you use stage URL.
} | ||
|
||
func readPdfToString(path string) (string, error) { | ||
f, r, err := pdf.Open(path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use fileservice to open a pdf file?
What type of PR is this?
Which issue(s) this PR fixes:
issue #18664
What this PR does / why we need it:
As part of our document LLM support, we are introducing the
LLM_EXTRACT_TEXT
function. This function extracts text from PDF files and writes the extracted text to a specified text file, extractor type can be specified by the third argument.Usage:
llm_extract_text(<input PDF datalink>, <output text file datalink>, <extractor type string>);
Return Value: A boolean indicating whether the extraction and writing process was successful.
Note:
Example SQL:
Example return: