Development

This document describes the process for running this application on your local machine.

Important

This software was developed and tested only on Ubuntu 22.04.

Development

Getting started

git clone https://github.com/stoyanK7/BG-DE-Anki-Decks.git
cd BG-DE-Anki-Decks
pipenv sync --dev

Activating the environment

pipenv shell

Running the whole pipeline

./run.sh

Running an individual step

python3 src/XX_step_you_want_to_run.py

Running the linter

ruff format . && ruff check . --fix .

Pipeline explanations

flowchart LR
    inputPdfFile["`**data/input/Goethe-Zertifikat_B1_Wortliste.pdf**
    -----------
    Input file. Downloaded from Goethe Institut's website.`"]
    
    convertPdfToTxtScript[["`**convert_pdf_to_txt.py**
    -----------
    Convert *data/input/\*.pdf* to a text file - *data/output/raw/\*.txt*`"]]
    
    rawTxtFile["`**data/output/raw/Goethe-Zertifikat_B1_Wortliste.txt**
    -----------
    Text representation of the input PDF file.`"]
    
    manuallyEdit["`**Manually edit data/output/raw/\*.txt**
    -----------
    This step is inevitable. It's faster to catch edge cases and fix them manually than trying to come up with an algorithm.
    This step can occur during any of the steps below.`"]

    cleanTxtScript[["`**clean_txt.py**
    -----------
    Clean *data/output/raw/\*.txt*`"]]
    cleanedTxtFile["Cleaned TXT file"]
    preprocessTxtScript[["Preprocess Txt"]]
    preprocessedTxtFile["preprocessed TXT file"]
    parseTxtScript[["Parse TXT"]]
    rawCsvFile[("Raw CSV file")]
    cleanCsvScript[["Clean CSV"]]
    cleanedCsvFile[("Cleaned CSV file")]
    preprocessCsvScript[["Preprocess CSV"]]
    preprocessedCsvFile[("Preprocessed CSV file")]

    convertPdfToTxtScript -->|Reads| inputPdfFile
    convertPdfToTxtScript -->|Writes| rawTxtFile
    manuallyEdit -->|Edits| rawTxtFile
    cleanTxtScript -->|Reads| rawTxtFile
    cleanTxtScript -->|Writes| cleanedTxtFile
    preprocessTxtScript -->|Reads| cleanedTxtFile
    preprocessTxtScript -->|Writes| preprocessedTxtFile
    
    parseTxtScript -->|Reads| preprocessedTxtFile
    parseTxtScript -->|Writes| rawCsvFile
    
    cleanCsvScript -->|Reads| rawCsvFile
    cleanCsvScript -->|Writes| cleanedCsvFile
    
    preprocessCsvScript -->|Reads| cleanedCsvFile
    preprocessCsvScript -->|Writes| preprocessedCsvFile

Loading

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DEVELOPMENT.md

DEVELOPMENT.md

Development

Getting started

Activating the environment

Running the whole pipeline

Running an individual step

Running the linter

Pipeline explanations

Files

DEVELOPMENT.md

Latest commit

History

DEVELOPMENT.md

File metadata and controls

Development

Getting started

Activating the environment

Running the whole pipeline

Running an individual step

Running the linter

Pipeline explanations