LLM_EXTRACT_TEXT implementation #18435

charleschile · 2024-08-30T13:56:48Z

What type of PR is this?

Which issue(s) this PR fixes:

What this PR does / why we need it:

As part of our document LLM support, we are introducing the LLM_EXTRACT_TEXT function. This function extracts text from PDF files and writes the extracted text to a specified text file, extractor type can be specified by the third argument.

Usage: llm_extract_text(<input PDF datalink>, <output text file datalink>, <extractor type string>);

Return Value: A boolean indicating whether the extraction and writing process was successful.

Note:

Both the input and output paths must be absolute paths.

Example SQL:

select llm_extract_text(cast('file:///Users/charles/Desktop/codes/matrixone/matrixone/test/distributed/resources/llm_test/extract_text/example.pdf?offset=0&size=4' as datalink), cast('file:///Users/charles/Desktop/codes/matrixone/matrixone/test/distributed/resources/llm_test/extract_text/example.txt' as datalink), "pdf");

Example return:

mysql> select llm_extract_text(cast('file:///Users/charles/Desktop/codes/matrixone/matrixone/test/distributed/resources/llm_test/extract_text/example.pdf?offset=0&size=4' as datalink), cast('file:///Users/charles/Desktop/codes/matrixone/matrixone/test/distributed/resources/llm_test/extract_text/example.txt' as datalink), "pdf");
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| llm_extract_text(cast(file:///Users/charles/Desktop/codes/matrixone/matrixone/test/distributed/resources/llm_test/extract_text/example.pdf?offset=0&size=4 as datalink), cast(file:///Users/charles/Desktop/codes/matrixone/matrixone/test/distributed/resources/llm_test/extract_text/example.txt as datalink), pdf)   |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| true                                                                                                                                                                                                                                                                                                                    |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.10 sec)

codiumai-pr-agent-pro · 2024-08-30T13:57:20Z

PR-Agent was enabled for this repository. To continue using it, please link your git user with your CodiumAI identity here.

PR Reviewer Guide 🔍

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🧪 No relevant tests
🔒 Security concerns Potential security vulnerability: The removal of file type validation in the ParseDatalink function (pkg/container/types/datalink.go) could potentially allow processing of unexpected file types, which may lead to security risks such as arbitrary file read or write operations. This change should be carefully reviewed to ensure it doesn't introduce vulnerabilities in the system.
⚡ Key issues to review Potential Security Risk Removal of file type validation in ParseDatalink function may lead to security vulnerabilities by allowing processing of unexpected file types. Error Handling The function does not properly handle errors in the PDF reading process. It should return an error instead of a boolean. Resource Management The PDF file is not properly closed in case of an error. Consider using defer immediately after successful open.

codiumai-pr-agent-pro · 2024-08-30T13:57:49Z

PR-Agent was enabled for this repository. To continue using it, please link your git user with your CodiumAI identity here.

PR Code Suggestions ✨

Category	Suggestion	Score
Error handling	✅ Improve error handling by returning an error instead of a boolean Suggestion Impact: The function extractTextFromPdfAndWriteToFile was modified to return an error along with a boolean, improving error handling by providing more context about failures. code diff: -func extractTextFromPdfAndWriteToFile(pdfPath string, txtPath string, proc process.Process) bool { +func extractTextFromPdfAndWriteToFile(pdfPath string, txtPath string, proc process.Process) (bool, error) { // read PDF to string content, err := readPdfToString(pdfPath) if err != nil { - return false + return false, moerr.NewInvalidInputNoCtxf("Invalid PDF input.") } // file service and write file ctx := context.TODO() fs, readPath, err := fileservice.GetForETL(ctx, proc.Base.FileService, txtPath) - // NOTE: Write only works when a file does not exist + // delete the file if txt file exist because Write() only works when a file does not exist + _, err = fs.StatFile(ctx, readPath) + if err == nil { + err1 := fs.Delete(ctx, readPath) + if err1 != nil { + return false, moerr.NewInvalidInputNoCtxf("Cannot remove file %s", readPath) + } + } + _, err = fileservice.DoWithRetry( "BackupWrite", func() (int, error) { @@ -96,9 +125,9 @@ fileservice.IsRetryableError, ) if err != nil { - return false + return false, err } - return true + return true, nil Consider using a more robust error handling mechanism. Instead of returning a boolean, return an error that can provide more context about what went wrong during the extraction process. pkg/sql/plan/function/func_llm.go [69-74] -func extractTextFromPdfAndWriteToFile(pdfPath string, txtPath string, proc process.Process) bool { +func extractTextFromPdfAndWriteToFile(pdfPath string, txtPath string, proc process.Process) error { // read PDF to string content, err := readPdfToString(pdfPath) if err != nil { - return false + return fmt.Errorf("failed to read PDF: %w", err) } Apply this suggestion Suggestion importance[1-10]: 9 Why: The suggestion to return an error instead of a boolean provides more context about failures, improving error handling and debugging capabilities significantly.	9
Enhancement	Add context parameter for better control over function execution Consider adding a context parameter to the `extractTextFromPdfAndWriteToFile` function to allow for cancellation and timeout handling. pkg/sql/plan/function/func_llm.go [69] -func extractTextFromPdfAndWriteToFile(pdfPath string, txtPath string, proc process.Process) bool { +func extractTextFromPdfAndWriteToFile(ctx context.Context, pdfPath string, txtPath string, proc process.Process) bool { Apply this suggestion Suggestion importance[1-10]: 8 Why: Adding a context parameter allows for better control over function execution, enabling cancellation and timeout handling, which is a valuable enhancement for long-running operations.	8
Enhancement	Enhance PDF parsing robustness and error handling Consider implementing a more robust PDF parsing logic that can handle different PDF structures and potential parsing errors. pkg/sql/plan/function/func_llm.go [112-121] func readPdfToString(path string) (string, error) { f, r, err := pdf.Open(path) if err != nil { - return "", err + return "", fmt.Errorf("failed to open PDF: %w", err) } - defer func() { - if f != nil { - f.Close() - } - }() + defer f.Close() + + if r.IsEncrypted() { + return "", fmt.Errorf("encrypted PDFs are not supported") + } Apply this suggestion Suggestion importance[1-10]: 8 Why: Improving PDF parsing logic to handle different structures and potential errors enhances the robustness and reliability of the function, making it more resilient to various input conditions.	8
Performance	Implement concurrent processing of PDFs for improved performance Consider using a buffered channel or a worker pool to process multiple PDFs concurrently, which could improve performance for large datasets. pkg/sql/plan/function/func_llm.go [22-29] +results := make(chan bool, rowCount) for i := uint64(0); i < rowCount; i++ { - inputBytes, nullInput := input.GetStrValue(i) - if nullInput { - if err := rs.AppendMustNullForBytesResult(); err != nil { - return err + go func(index uint64) { + inputBytes, nullInput := input.GetStrValue(index) + if nullInput { + results <- false + return } - continue + // Process PDF and send result + results <- extractTextFromPdfAndWriteToFile(...) + }(i) +} + +for i := uint64(0); i < rowCount; i++ { + success := <-results + if err := rs.Append(success, false); err != nil { + return err } +} Apply this suggestion Suggestion importance[1-10]: 7 Why: The suggestion to use concurrent processing can significantly improve performance for large datasets, although it introduces complexity and requires careful handling of concurrency issues.	7

fengttt

How do you know you are extracting a pdf file? How about add a const "file type" or "extractor type" arg to the function?

Extractor type arg might be better, if we happen to have many different kind of extractors for same file type (or a extractor for many file types).

fengttt

Please add a test.

charleschile · 2024-09-03T09:34:48Z

How do you know you are extracting a pdf file? How about add a const "file type" or "extractor type" arg to the function?

Extractor type arg might be better, if we happen to have many different kind of extractors for same file type (or a extractor for many file types).

Extractor type arg added

charleschile · 2024-09-03T14:32:09Z

Please add a test.

Unit and BVT tests have been added.

fengttt · 2024-09-03T15:57:28Z

test/distributed/cases/function/func_llm_extract_file.sql

@@ -0,0 +1,2 @@
+select llm_extract_text(cast('file://$resources/llm_test/extract_text/MODocs1.pdf?offset=0&size=4' as datalink), cast('file://$resources/llm_test/extract_text/MODocs1.txt' as datalink), "pdf");


Why do we need offset and size here?

I included them here to demonstrate that this format of datalink is recognized by the function.
Offset and size are not utilized in the llm_extract_text function and will not affect the outcome.
Shall I delete them?

Yeah, just remove those.

fengttt

Please add result verification in the bvt test.

cpegeric · 2024-09-03T16:18:50Z

pkg/sql/plan/function/func_llm.go

+	}
+
+	// file service and write file
+	ctx := context.TODO()


try to get the context from proc. Only use context.TODO() when you have no choice.

cpegeric · 2024-09-03T16:20:12Z

pkg/sql/plan/function/func_llm.go

+	fs, readPath, err := fileservice.GetForETL(ctx, proc.Base.FileService, txtPath)
+
+	// delete the file if txt file exist because Write() only works when a file does not exist
+	_, err = fs.StatFile(ctx, readPath)


Just error out if file exists. You may delete the customer file in the cloud if you use stage URL.

cpegeric · 2024-09-03T16:21:42Z

pkg/sql/plan/function/func_llm.go

+}
+
+func readPdfToString(path string) (string, error) {
+	f, r, err := pdf.Open(path)


Can you use fileservice to open a pdf file?

LLM_EXTRACT_TEXT implementation

8a76a0b

charleschile requested review from m-schen, XuPeng-SH and zhangxu19830126 as code owners August 30, 2024 13:56

codiumai-pr-agent-pro bot added Enhancement Dependencies Review effort [1-5]: 3 labels Aug 30, 2024

mergify bot added the kind/feature label Aug 30, 2024

Merge branch 'main' into ospp/LLM_EXTRACT_TEXT

76f2764

fengttt requested changes Aug 31, 2024

View reviewed changes

Merge branch 'matrixorigin:main' into ospp/LLM_EXTRACT_TEXT

b805a99

fengttt requested changes Sep 1, 2024

View reviewed changes

charleschile added 2 commits September 3, 2024 17:16

fix bug about Write() only works when a file does not exist

f619224

fix bug about Only pdf file supported

373d39d

charleschile added 2 commits September 3, 2024 17:35

add extractor type arg

11b1dcc

add unit tests

f02d9be

charleschile requested review from aressu1985 and heni02 as code owners September 3, 2024 10:45

add BVT tests

f12a9b1

fengttt reviewed Sep 3, 2024

View reviewed changes

fengttt requested changes Sep 3, 2024

View reviewed changes

cpegeric requested changes Sep 3, 2024

View reviewed changes

charleschile added 3 commits September 9, 2024 14:40

Merge branch 'matrixorigin:main' into ospp/LLM_EXTRACT_TEXT

4b487dd

Merge branch 'matrixorigin:main' into ospp/LLM_EXTRACT_TEXT

227e8e2

Merge branch 'main' into ospp/LLM_EXTRACT_TEXT

5498d07

matrix-meow added the size/M Denotes a PR that changes [100,499] lines label Sep 9, 2024

fix bug

16580c7

charleschile requested review from daviszhen and qingxinhome as code owners September 14, 2024 17:26

fengttt approved these changes Sep 16, 2024

View reviewed changes

qingxinhome approved these changes Sep 20, 2024

View reviewed changes

m-schen approved these changes Sep 20, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLM_EXTRACT_TEXT implementation #18435

LLM_EXTRACT_TEXT implementation #18435

charleschile commented Aug 30, 2024 •

edited

Loading

codiumai-pr-agent-pro bot commented Aug 30, 2024

codiumai-pr-agent-pro bot commented Aug 30, 2024 •

edited

Loading

fengttt left a comment

fengttt left a comment

charleschile commented Sep 3, 2024

charleschile commented Sep 3, 2024

fengttt Sep 3, 2024

charleschile Sep 3, 2024

fengttt Sep 4, 2024

fengttt left a comment

cpegeric Sep 3, 2024

cpegeric Sep 3, 2024

cpegeric Sep 3, 2024

		@@ -0,0 +1,2 @@
		select llm_extract_text(cast('file://$resources/llm_test/extract_text/MODocs1.pdf?offset=0&size=4' as datalink), cast('file://$resources/llm_test/extract_text/MODocs1.txt' as datalink), "pdf");

LLM_EXTRACT_TEXT implementation #18435

Are you sure you want to change the base?

LLM_EXTRACT_TEXT implementation #18435

Conversation

charleschile commented Aug 30, 2024 • edited Loading

What type of PR is this?

Which issue(s) this PR fixes:

What this PR does / why we need it:

codiumai-pr-agent-pro bot commented Aug 30, 2024

PR Reviewer Guide 🔍

codiumai-pr-agent-pro bot commented Aug 30, 2024 • edited Loading

PR Code Suggestions ✨

fengttt left a comment

Choose a reason for hiding this comment

fengttt left a comment

Choose a reason for hiding this comment

charleschile commented Sep 3, 2024

charleschile commented Sep 3, 2024

fengttt Sep 3, 2024

Choose a reason for hiding this comment

charleschile Sep 3, 2024

Choose a reason for hiding this comment

fengttt Sep 4, 2024

Choose a reason for hiding this comment

fengttt left a comment

Choose a reason for hiding this comment

cpegeric Sep 3, 2024

Choose a reason for hiding this comment

cpegeric Sep 3, 2024

Choose a reason for hiding this comment

cpegeric Sep 3, 2024

Choose a reason for hiding this comment

charleschile commented Aug 30, 2024 •

edited

Loading

codiumai-pr-agent-pro bot commented Aug 30, 2024 •

edited

Loading