GitHub

README for Data Ingestion Module

This module provides functions for fetching and processing data from various sources, including text, URLs, YouTube videos, and PDF files. It utilizes various libraries such as BeautifulSoup, YouTube, and OpenAI to extract and transform data.

Functions

fetch_pdf:
- Description: Fetches content from a PDF file.
- Parameters:
  - path: Path to the PDF file.
  - support_image: Whether to extract images from the PDF (default: True).
  - base_path: Base directory for temporary files (default: "Data/").
- Return: The content of the PDF file.
tag_visible:
- Description: Identifies visible text elements from a given webpage.
- Parameters: None
- Return: A boolean indicating whether the element is visible.
text_from_html:
- Description: Extracts text from a webpage.
- Parameters: The HTML content of the webpage.
- Return: The extracted text.
extract_text_from_website:
- Description: Extracts text content from a given website.
- Parameters: The URL of the website.
- Return: The extracted text.
download_audio:
- Description: Downloads an audio file from a YouTube video.
- Parameters:
  - url: The URL of the YouTube video.
  - base_path: Base directory for the downloaded file (default: "Data/").
- Return: The path of the downloaded file.
youtube_to_transcript:
- Description: Converts a YouTube video into a transcript.
- Parameters:
  - url: The URL of the YouTube video.
  - del_audio: Whether to delete the audio file after transcription (default: True).
  - base_path: Base directory for the audio file (default: "Data/").
- Return: The transcript of the video.
fetch_input:
- Description: Returns text from a given input. The input could be either text or a URL.
- Parameters:
  - content: The input content.
  - type: The type of input (default: "text").
  - input_images_pdf: Whether to include images from PDF files (default: False).
- Return: The extracted text.

Usage

To use this module, you need to install the required libraries and set up your environment variables. The module supports various types of inputs, including text, URLs, YouTube videos, and PDF files. Each function has its own set of parameters and return values, which are documented above.

Environment Variables

The module uses environment variables to store API keys and other configuration settings. You need to set these variables in a .env file in the root directory of your project.

Example Usage

Here are some examples of how to use the functions in this module:

# Fetch text from a PDF file
pdf_content = fetch_pdf("path/to/pdf.pdf")

# Fetch text from a webpage
web_content = extract_text_from_website("https://example.com")

# Fetch transcript from a YouTube video
transcript = youtube_to_transcript("https://www.youtube.com/watch?v=dQw4w9WgXcQ")

Limitations

The module currently supports only English videos for transcription.
The audio size is limited to 25MB as of now.
The module does not support other languages for transcription.

Future Development

Support for other languages for transcription.
Support for larger audio files.
Additional features for handling different types of inputs.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
data_ingest.py		data_ingest.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Functions

Usage

Environment Variables

Example Usage

Limitations

Future Development

About

Releases

Packages

Languages

knitlabs/llmPipelines

Folders and files

Latest commit

History

Repository files navigation

Functions

Usage

Environment Variables

Example Usage

Limitations

Future Development

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages