-
Notifications
You must be signed in to change notification settings - Fork 4
Home
This goal of this project is to scrape the previous day’s New Criminal Filings at 6AM daily and send a .csv of the information and a roundup summary to people at PBF. Every Sunday at 6:15AM, a second .csv and roundup is emailed, which is a weekly summary. An example format can be seen at https://www.phillybailfund.org/weekly-bail-reports
The goal of this project is to be able to programmatically download and extract data from dockets, then enter it into the database. This project needs to solve the technological problem of not being able to easily programmatically download the dockets. Some people have done work around using a headless/embedded browser. The data we need from the dockets is noted in the Data Dictionary on the project’s Github repository.
Once we can download dockets, there will be a one-time need to download and parse everything going back to January 1, 2020. There is also an ongoing need to scrape the previous day’s dockets, which will get correlated with New Criminal Filing data and entered into the database.
The database/data lake is the destination for the New Criminal Filing, Docket, and Court Summary data. Each row is a docket number. The Data Dictionary notes which data needs to end up in the database. The database is the backend to the dashboard
The end goal of the data collection/storage is to drive an interactive dashboard embedded on the Philly Bail Fund Squarespace site. In general it should be something like https://data.philadao.com/Bail_Report.html, but the data team has a lot of leeway in terms of telling a story with the data. PBF is especially interested in examining whether some magistrates set higher bail while controlling for other constants. This requires the data from the Dockets.
New Criminal Filings - Every day, all new criminal cases filed in the city are posted to this website: https://www.courts.phila.gov/NewCriminalFilings/date/. These contain general information about the case, like the person’s name, the main charge, whether bail was set, and the docket number
Docket: Dockets are PDFs containing detailed information about a given case. They are publicly available at this website: https://ujsportal.pacourts.us/DocketSheets/MC.aspx. Dockets contain information that the New Criminal Filings website does not, like the name of the magistrate who set the bail, the defendant’s race. A docket is updated as the case progresses to reflect the next court hearing, as well as the case outcome when it concludes. The URL to access a given docket has a unique hash, so it is difficult to scrape them.
Court Summary: Court Summaries are also available from the docket website. They contain a defendant’s entire PA court history, so they’ll have information about a defendant’s current case as well as any other ongoing or concluded ones. They also have unique hashed URLs.
Scraping: Scraping is the process of extracting data, specifically from New Criminal Filings and dockets, so that the data can be compiled and analyzed.
Parsing: Once data has been scraped, or pulled from New Criminal Filings and dockets, the data has to be converted, or parsed, into a more readable format. Parsing the data allows for better analysis.
Script: A script is a sequence of instructions that is interpreted by a program, like Python.
Query: Once data has been scraped and parsed into a large database, specific metrics or measures can be pulled in smaller pieces using a written request, or query.
API: An application programming interface (API) is a computing interface which defines interactions between multiple software intermediaries. It defines the kinds of calls or requests that can be made, how to make them, the data formats that should be used, the conventions to follow, etc.
Cloud: The cloud is a general metaphor that is used to refer to the Internet. For our purposes, the "cloud" refers to online database storage.
CSV: A comma separated values (CSV) file contains different values separated by a delimiter, which acts as a database table or an intermediate form of a database table. In other words, a CSV file is a set of database rows and columns stored in a text file such that the rows are separated by a new line while the columns are separated by a semicolon or a comma. A CSV file is primarily used to transport data between two databases of different formats through a computer program.
Dashboard: After specific metrics have been queried from the data, it's often useful to visualize and display key findings using a dashboard, a graphical user interface that allows an at-a-glance look at the data.
Athena (AWS): Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL.
S3 bucket (AWS): Amazon Simple Storage Service (Amazon S3) is an object storage service that offers industry-leading scalability, data availability, security, and performance.
R (programming language): R is a programming language and free software environment for statistical computing and graphics. The R language is widely used among statisticians and data miners for developing statistical software and data analysis.
Github Actions: GitHub Actions is an API for cause and effect on GitHub: orchestrate any workflow, based on any event, while GitHub manages the execution, provides feedback, and secures every step along the way.
Issue (Github): Issues are a great way to keep track of tasks, enhancements, and bugs for projects. They're kind of like email—except they can be shared and discussed with the rest of the team.
Pull request (Github): Pull Requests are the heart of collaboration on GitHub. When you open a pull request, you’re proposing your changes and requesting that someone review and pull in your contribution and merge them into their branch.