This is a collection of tools to scrape SIAN's (The National Archives of Brazil) website: https://sian.an.gov.br/. The website contains an assortment of national documents, separated into different dossiers, according to their origin. All files are saved as available on source, majority as PDF files.
This script generates two lists:
- a list of all pages in SIAN's fund website
- a list of the links to all files on the specified page, along with its respective dossier link.
You must edit the file in two separate points before running it.
First, edit your login credentials:
//Login
$params2 = [
"login" => "YOUR_LOGIN",
"senha" => "YOUR_PASSWORD",
Then, on line 82, replace ENDEREÇO for the URL of where the script has been hosted. In cases where the server is hosted locally, use localhost.
header("Location: https://ENDEREÇO/fundo.php?colecao=".$_GET["colecao"]."&pag=".$_GET["pag"]."&time=".time());
To run the script, call the PHP script in your browser and set the "ID" to the fund ID wanted to download:
https://localhost/fundo.php?colecao=ID
Due to performance issues, it is recommended to use the script together with IDM's Grabber tool or another preferred download manager.
This script generates local HTML files with the entire content of the pages in a dossier listed in SIAN's website. You must edit the file in two separate points before running it.
First, edit your login credentials:
//Login
$params2 = [
"login" => "YOUR_LOGIN",
"senha" => "YOUR_PASSWORD",
Then, on line 66, replace ENDEREÇO for the URL of where the script has been hosted. In cases where the server is hosted locally, use localhost.
header("Location: https://ENDEREÇO/dossie.php?id=".$_GET["id"]."&time=".time());
To run the script, call the PHP script in your browser and set the "ID" to the file ID wanted to download:
https://localhost/dossie.php?id=ID
This script converts the html files generated by dossie.php into a CSV file. The end file contains informations such as the code of the pdf related to that dossie / fund, the pdf's title and the date in which it was created. The end file may require some minimal manual editing.
To run the script, place it in the same folder as the dossies you wish to convert, as well as the file Html2Text.php, which can be found here: https://github.com/mtibben/html2text/blob/master/src/Html2Text.php
Then, use the following command:
php tabelador.php
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
Please make sure to update tests as appropriate.
GNU General Public License v3.0
This application was developed by the Wiki Movimento Brasil User Group, supported by the University of São Paulo and the University of São Paulo Support Foundation.