Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLM_EXTRACT_TEXT implementation #18435

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

charleschile
Copy link
Contributor

@charleschile charleschile commented Aug 30, 2024

What type of PR is this?

  • API-change
  • BUG
  • Improvement
  • Documentation
  • Feature
  • Test and CI
  • Code Refactoring

Which issue(s) this PR fixes:

issue #18664

What this PR does / why we need it:

As part of our document LLM support, we are introducing the LLM_EXTRACT_TEXT function. This function extracts text from PDF files and writes the extracted text to a specified text file, extractor type can be specified by the third argument.

Usage: llm_extract_text(<input PDF datalink>, <output text file datalink>, <extractor type string>);

Return Value: A boolean indicating whether the extraction and writing process was successful.

Note:

  • Both the input and output paths must be absolute paths.

Example SQL:

select llm_extract_text(cast('file:///Users/charles/Desktop/codes/matrixone/matrixone/test/distributed/resources/llm_test/extract_text/example.pdf?offset=0&size=4' as datalink), cast('file:///Users/charles/Desktop/codes/matrixone/matrixone/test/distributed/resources/llm_test/extract_text/example.txt' as datalink), "pdf");

Example return:

mysql> select llm_extract_text(cast('file:///Users/charles/Desktop/codes/matrixone/matrixone/test/distributed/resources/llm_test/extract_text/example.pdf?offset=0&size=4' as datalink), cast('file:///Users/charles/Desktop/codes/matrixone/matrixone/test/distributed/resources/llm_test/extract_text/example.txt' as datalink), "pdf");
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| llm_extract_text(cast(file:///Users/charles/Desktop/codes/matrixone/matrixone/test/distributed/resources/llm_test/extract_text/example.pdf?offset=0&size=4 as datalink), cast(file:///Users/charles/Desktop/codes/matrixone/matrixone/test/distributed/resources/llm_test/extract_text/example.txt as datalink), pdf)   |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| true                                                                                                                                                                                                                                                                                                                    |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.10 sec)

Copy link

PR-Agent was enabled for this repository. To continue using it, please link your git user with your CodiumAI identity here.

PR Reviewer Guide 🔍

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🧪 No relevant tests
🔒 Security concerns

Potential security vulnerability:
The removal of file type validation in the ParseDatalink function (pkg/container/types/datalink.go) could potentially allow processing of unexpected file types, which may lead to security risks such as arbitrary file read or write operations. This change should be carefully reviewed to ensure it doesn't introduce vulnerabilities in the system.

⚡ Key issues to review

Potential Security Risk
Removal of file type validation in ParseDatalink function may lead to security vulnerabilities by allowing processing of unexpected file types.

Error Handling
The function does not properly handle errors in the PDF reading process. It should return an error instead of a boolean.

Resource Management
The PDF file is not properly closed in case of an error. Consider using defer immediately after successful open.

Copy link

codiumai-pr-agent-pro bot commented Aug 30, 2024

PR-Agent was enabled for this repository. To continue using it, please link your git user with your CodiumAI identity here.

PR Code Suggestions ✨

CategorySuggestion                                                                                                                                    Score
Error handling
✅ Improve error handling by returning an error instead of a boolean
Suggestion Impact:The function extractTextFromPdfAndWriteToFile was modified to return an error along with a boolean, improving error handling by providing more context about failures.

code diff:

-func extractTextFromPdfAndWriteToFile(pdfPath string, txtPath string, proc *process.Process) bool {
+func extractTextFromPdfAndWriteToFile(pdfPath string, txtPath string, proc *process.Process) (bool, error) {
 	// read PDF to string
 	content, err := readPdfToString(pdfPath)
 	if err != nil {
-		return false
+		return false, moerr.NewInvalidInputNoCtxf("Invalid PDF input.")
 	}
 
 	// file service and write file
 	ctx := context.TODO()
 	fs, readPath, err := fileservice.GetForETL(ctx, proc.Base.FileService, txtPath)
 
-	// NOTE: Write only works when a file does not exist
+	// delete the file if txt file exist because Write() only works when a file does not exist
+	_, err = fs.StatFile(ctx, readPath)
+	if err == nil {
+		err1 := fs.Delete(ctx, readPath)
+		if err1 != nil {
+			return false, moerr.NewInvalidInputNoCtxf("Cannot remove file %s", readPath)
+		}
+	}
+
 	_, err = fileservice.DoWithRetry(
 		"BackupWrite",
 		func() (int, error) {
@@ -96,9 +125,9 @@
 		fileservice.IsRetryableError,
 	)
 	if err != nil {
-		return false
+		return false, err
 	}
-	return true
+	return true, nil

Consider using a more robust error handling mechanism. Instead of returning a
boolean, return an error that can provide more context about what went wrong during
the extraction process.

pkg/sql/plan/function/func_llm.go [69-74]

-func extractTextFromPdfAndWriteToFile(pdfPath string, txtPath string, proc *process.Process) bool {
+func extractTextFromPdfAndWriteToFile(pdfPath string, txtPath string, proc *process.Process) error {
     // read PDF to string
     content, err := readPdfToString(pdfPath)
     if err != nil {
-        return false
+        return fmt.Errorf("failed to read PDF: %w", err)
     }
 
  • Apply this suggestion
Suggestion importance[1-10]: 9

Why: The suggestion to return an error instead of a boolean provides more context about failures, improving error handling and debugging capabilities significantly.

9
Enhancement
Add context parameter for better control over function execution

Consider adding a context parameter to the extractTextFromPdfAndWriteToFile function
to allow for cancellation and timeout handling.

pkg/sql/plan/function/func_llm.go [69]

-func extractTextFromPdfAndWriteToFile(pdfPath string, txtPath string, proc *process.Process) bool {
+func extractTextFromPdfAndWriteToFile(ctx context.Context, pdfPath string, txtPath string, proc *process.Process) bool {
 
  • Apply this suggestion
Suggestion importance[1-10]: 8

Why: Adding a context parameter allows for better control over function execution, enabling cancellation and timeout handling, which is a valuable enhancement for long-running operations.

8
Enhance PDF parsing robustness and error handling

Consider implementing a more robust PDF parsing logic that can handle different PDF
structures and potential parsing errors.

pkg/sql/plan/function/func_llm.go [112-121]

 func readPdfToString(path string) (string, error) {
     f, r, err := pdf.Open(path)
     if err != nil {
-        return "", err
+        return "", fmt.Errorf("failed to open PDF: %w", err)
     }
-    defer func() {
-        if f != nil {
-            f.Close()
-        }
-    }()
+    defer f.Close()
+    
+    if r.IsEncrypted() {
+        return "", fmt.Errorf("encrypted PDFs are not supported")
+    }
 
  • Apply this suggestion
Suggestion importance[1-10]: 8

Why: Improving PDF parsing logic to handle different structures and potential errors enhances the robustness and reliability of the function, making it more resilient to various input conditions.

8
Performance
Implement concurrent processing of PDFs for improved performance

Consider using a buffered channel or a worker pool to process multiple PDFs
concurrently, which could improve performance for large datasets.

pkg/sql/plan/function/func_llm.go [22-29]

+results := make(chan bool, rowCount)
 for i := uint64(0); i < rowCount; i++ {
-    inputBytes, nullInput := input.GetStrValue(i)
-    if nullInput {
-        if err := rs.AppendMustNullForBytesResult(); err != nil {
-            return err
+    go func(index uint64) {
+        inputBytes, nullInput := input.GetStrValue(index)
+        if nullInput {
+            results <- false
+            return
         }
-        continue
+        // Process PDF and send result
+        results <- extractTextFromPdfAndWriteToFile(...)
+    }(i)
+}
+
+for i := uint64(0); i < rowCount; i++ {
+    success := <-results
+    if err := rs.Append(success, false); err != nil {
+        return err
     }
+}
 
  • Apply this suggestion
Suggestion importance[1-10]: 7

Why: The suggestion to use concurrent processing can significantly improve performance for large datasets, although it introduces complexity and requires careful handling of concurrency issues.

7

@mergify mergify bot added the kind/feature label Aug 30, 2024
Copy link
Contributor

@fengttt fengttt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you know you are extracting a pdf file? How about add a const "file type" or "extractor type" arg to the function?

Extractor type arg might be better, if we happen to have many different kind of extractors for same file type (or a extractor for many file types).

Copy link
Contributor

@fengttt fengttt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a test.

@charleschile
Copy link
Contributor Author

How do you know you are extracting a pdf file? How about add a const "file type" or "extractor type" arg to the function?

Extractor type arg might be better, if we happen to have many different kind of extractors for same file type (or a extractor for many file types).

Extractor type arg added

@charleschile
Copy link
Contributor Author

Please add a test.

Unit and BVT tests have been added.

@@ -0,0 +1,2 @@
select llm_extract_text(cast('file://$resources/llm_test/extract_text/MODocs1.pdf?offset=0&size=4' as datalink), cast('file://$resources/llm_test/extract_text/MODocs1.txt' as datalink), "pdf");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need offset and size here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I included them here to demonstrate that this format of datalink is recognized by the function.
Offset and size are not utilized in the llm_extract_text function and will not affect the outcome.
Shall I delete them?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, just remove those.

Copy link
Contributor

@fengttt fengttt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add result verification in the bvt test.

}

// file service and write file
ctx := context.TODO()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try to get the context from proc. Only use context.TODO() when you have no choice.

fs, readPath, err := fileservice.GetForETL(ctx, proc.Base.FileService, txtPath)

// delete the file if txt file exist because Write() only works when a file does not exist
_, err = fs.StatFile(ctx, readPath)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just error out if file exists. You may delete the customer file in the cloud if you use stage URL.

}

func readPdfToString(path string) (string, error) {
f, r, err := pdf.Open(path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use fileservice to open a pdf file?

@matrix-meow matrix-meow added the size/M Denotes a PR that changes [100,499] lines label Sep 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants