DocETL is a tool for creating and executing data processing pipelines, especially suited for complex document processing tasks. It offers a low-code, declarative YAML interface to define LLM-powered operations on complex data.
DocETL is the ideal choice when you're looking to maximize correctness and output quality for complex tasks over a collection of documents or unstructured datasets. You should consider using DocETL if:
- You want to perform semantic processing on a collection of data
- You have complex tasks that you want to represent via map-reduce
- You're unsure how to best express your task to maximize LLM accuracy
- You're working with long documents that don't fit into a single prompt
- You have validation criteria and want tasks to automatically retry when validation fails
- Python 3.10 or later
- OpenAI API key
- Install from PyPI:
pip install docetl
To see examples of how to use DocETL, check out the tutorial.
We offer a simple UI for building pipelines. We recommend building up complex pipelines one operation at a time, so you can see the results of each operation as you go and iterate on your pipeline. To run it locally, follow these steps:
- Clone the repository:
git clone https://github.com/ucbepic/docetl.git
cd docetl
- Install dependencies:
make install # Install Python package
make install-ui # Install UI dependencies
- Set up environment variables in
.env
:
OPENAI_API_KEY=your_api_key_here
BACKEND_ALLOW_ORIGINS=
BACKEND_HOST=localhost
BACKEND_PORT=8000
BACKEND_RELOAD=True
FRONTEND_HOST=0.0.0.0
FRONTEND_PORT=3000
- Start the development server:
make run-ui-dev
If you're planning to contribute or modify DocETL, you can verify your setup by running the test suite:
make tests-basic # Runs basic test suite (costs < $0.01 with OpenAI)
For detailed documentation and tutorials, visit our documentation.