In order to enable suggestions for vocabularies, Annif is used. Annif has to be trained for every vocabulary based on sample data, we are currently experimenting with the Hochschulfachersystematik. Since we do not have training data for that vocabulary, we mapped the Hochschulfachersystematik to the GND for which plenty of data exists in lobid.org.
The approach we took is as follows:
- Download data that uses the GND concepts that have been mapped to the Hochschulfachersystematik:
$ curl --header "Accept-Encoding: gzip" 'https://lobid.org/resources/search?q=NOT+type%3APublicationIssue+AND+%28%224039457-8%22+%2C+%224166076-6%22+%2C+%224019838-8%22+%2C+%224041447-4%22+%2C+%224162468-3%22+%2C+%224184586-9%22+%2C+%224170208-6%22+%2C+%224121930-2%22+%2C+%227504419-5%22+%2C+%224066486-7%22+%2C+%224005924-8%22+%2C+%224130526-7%22+%2C+%224040802-4%22+%2C+%224048816-0%22+%2C+%224055864-2%22+%2C+%224056837-4%22+%2C+%224113615-9%22+%2C+%224164044-5%22+%2C+%224830608-3%22+%2C+%224049426-3%22+%2C+%224062992-2%22+%2C+%224196910-8%22+%2C+%224034873-8%22+%2C+%224116533-0%22+%2C+%224031888-6%22+%2C+%224148885-4%22+%2C+%224372194-1%22+%2C+%227593577-6%22+%2C+%224113791-7%22+%2C+%224129090-2%22+%2C+%224037944-9%22+%2C+%224152829-3%22+%2C+%224161557-8%22+%2C+%224169194-5%22+%2C+%224045705-9%22+%2C+%224020470-4%22+%2C+%224830611-3%22+%2C+%224122851-0%22+%2C+%224079184-1%22+%2C+%224036034-9%22+%2C+%224126242-6%22+%2C+%224153261-2%22+%2C+%224160760-0%22+%2C+%224061916-3%22+%2C+%224028532-7%22+%2C+%224055916-6%22+%2C+%224112736-5%22+%2C+%224129410-5%22+%2C+%224045791-6%22+%2C+%224241223-7%22+%2C+%224142845-6%22+%2C+%224160775-2%22+%2C+%224038243-6%22+%2C+%224122766-9%22+%2C+%224044302-4%22+%2C+%224120316-1%22+%2C+%224133320-2%22+%2C+%224014390-9%22+%2C+%224014346-6%22+%2C+%224025763-0%22+%2C+%224147095-3%22+%2C+%224126655-9%22+%2C+%224020759-6%22+%2C+%224155069-9%22+%2C+%224006801-8%22+%2C+%224466880-6%22+%2C+%224122614-8%22+%2C+%224611085-9%22+%2C+%224015875-5%22+%2C+%224020517-4%22+%2C+%224004955-3%22+%2C+%224114056-4%22+%2C+%224179595-7%22+%2C+%224052397-4%22+%2C+%224042178-8%22+%2C+%224079346-1%22+%2C+%224002718-1%22+%2C+%224062781-0%22+%2C+%227505710-4%22+%2C+%224026978-4%22+%2C+%227603053-2%22+%2C+%224517988-8%22+%2C+%224015428-2%22+%2C+%224112643-9%22+%2C+%224076570-2%22+%2C+%224051158-3%22+%2C+%224066528-8%22+%2C+%224038261-8%22+%2C+%224078943-3%22+%2C+%224794555-2%22+%2C+%224064700-6%22+%2C+%224392528-5%22+%2C+%224034901-9%22+%2C+%224012435-6%22+%2C+%224241276-6%22+%2C+%224120314-8%22+%2C+%224038953-4%22+%2C+%224077608-6%22+%2C+%224006851-1%22+%2C+%224039207-7%22+%2C+%224061644-7%22+%2C+%224056415-0%22+%2C+%224130615-6%22+%2C+%224129951-6%22+%2C+%224073499-7%22+%2C+%224141476-7%22+%2C+%227503709-9%22+%2C+%224120632-0%22+%2C+%224037220-0%22+%2C+%224009816-3%22+%2C+%224378168-8%22+%2C+%224017212-0%22+%2C+%224146518-0%22+%2C+%2244236073-0%22+%2C+%224021845-4%22+%2C+%224047704-6%22+%2C+%224026926-7%22+%2C+%224032950-1%22+%2C+%224051038-4%22+%2C+%224002046-0%22+%2C+%224129566-3%22+%2C+%224177403-6%22+%2C+%224030198-9%22+%2C+%224065136-8%22+%2C+%224066510-0%22+%2C+%224162713-1%22+%2C+%224139318-1%22+%2C+%224078951-2%22+%2C+%227508942-7%22+%2C+%224207188-4%22+%2C+%224188027-4%22+%2C+%224178059-0%22+%2C+%224003311-9%22+%2C+%224003326-0%22+%2C+%224064821-7%22+%2C+%224113411-4%22+%2C+%224134381-5%22+%2C+%224142081-0%22+%2C+%224172819-1%22+%2C+%224061084-6%22+%2C+%227502913-3%22+%2C+%224185058-0%22+%2C+%224002859-8%22+%2C+%224006777-4%22+%2C+%221033784729%22+%2C+%224185935-2%22+%2C+%224113594-5%22+%2C+%224040843-7%22+%2C+%224127822-7%22+%2C+%224003910-9%22+%2C+%224016654-5%22+%2C+%224704302-7%22+%2C+%224029823-1%22+%2C+%224020202-1%22+%2C+%224183231-0%22+%2C+%224191032-1%22+%2C+%224079454-4%22+%2C+%224065590-8%22+%2C+%224117223-1%22+%2C+%224168244-0%22+%2C+%224177061-4%22+%2C+%224030005-5%22+%2C+%224162398-8%22+%2C+%224137304-2%22+%2C+%224020216-1%22+%2C+%224576777-4%22+%2C+%224056442-3%22+%2C+%224064324-4%22+%2C+%224056995-0%22+%2C+%224072819-5%22+%2C+%224161576-1%22+%2C+%224069402-1%22+%2C+%224113262-2%22+%2C+%224137670-5%22+%2C+%227503845-6%22+%2C+%224238812-0%22+%2C+%224057808-2%22+%2C+%224005614-4%22+%2C+%224059643-6%22+%2C+%224046277-8%22+%2C+%224020288-4%22+%2C+%224050484-0%22+%2C+%224055768-6%22+%2C+%224066446-6%22+%2C+%224115348-0%22+%2C+%224249464-3%22+%2C+%224044375-9%22+%2C+%224020227-6%22+%2C+%224875411-0%22+%2C+%224172951-1%22+%2C+%224507887-7%22+%2C+%224647152-2%22+%2C+%224039289-2%22+%2C+%224251085-5%22+%2C+%224002851-3%22+%2C+%224037790-8%22+%2C+%224030736-0%22+%2C+%224006439-6%22+%2C+%224827059-3%22+%2C+%224025684-4%22+%2C+%224113292-0%22+%2C+%224020383-9%22+%2C+%224000107-6%22+%2C+%224535278-1%22+%2C+%224136254-8%22+%2C+%224444561-1%22+%2C+%224030270-2%22+%2C+%224159290-6%22+%2C+%224048590-0%22+%2C+%224047770-8%22+%2C+%224120588-1%22+%2C+%224077478-8%22+%2C+%224033637-2%22+%2C+%224349723-8%22+%2C+%224019558-2%22+%2C+%227506104-1%22+%2C+%224045956-1%22+%2C+%224014725-3%22+%2C+%224062803-6%22+%2C+%224046595-0%22+%2C+%224069491-4%22+%2C+%224066472-7%22+%2C+%224146998-7%22+%2C+%224062901-6%22+%2C+%224077208-1%22+%2C+%224055676-1%22+%2C+%224345139-1%22+%2C+%224130930-3%22+%2C+%224006465-7%22+%2C+%224035843-4%22+%2C+%224020252-5%22+%2C+%224142196-6%22+%2C+%224310260-8%22+%2C+%224130392-1%22+%2C+%224076229-4%22+%2C+%224068598-6%22+%2C+%224043687-1%22+%2C+%224019294-5%22+%2C+%224137703-5%22+%2C+%224078931-7%22+%2C+%224025677-7%22+%2C+%224074685-9%22+%2C+%224291279-9%22+%2C+%224055287-1%22+%2C+%224078315-7%22+%2C+%224186473-6%22+%2C+%224221617-5%22+%2C+%224061650-2%22+%2C+%224114364-4%22+%2C+%224077640-2%22+%2C+%224158406-5%22+%2C+%224127385-0%22+%2C+%221072438291%22+%2C+%224329079-6%22+%2C+%224077624-4%22+%2C+%224078937-8%22+%2C+%224120278-8%22+%2C+%224131523-6%22+%2C+%224196734-3%22+%2C+%224112526-5%22+%2C+%224015602-3%22+%2C+%224034350-9%22+%2C+%224034265-7%22+%2C+%224137926-3%22+%2C+%224201431-1%22+%2C+%224038872-4%22+%2C+%224077513-6%22+%2C+%224049540-1%22+%2C+%224124801-6%22+%2C+%224661954-9%22+%2C+%224068473-8%22+%2C+%224005878-5%22+%2C+%224499546-5%22+%2C+%224047340-5%22+%2C+%224072788-9%22+%2C+%224026894-9%22+%2C+%224012656-0%22+%2C+%224128313-2%22+%2C+%224016825-6%22+%2C+%224017102-4%22+%2C+%224055905-1%22+%2C+%224162796-9%22+%2C+%224023922-6%22+%2C+%224002827-6%22+%2C+%224257369-5%22+%2C+%224114040-0%22%29&format=jsonl' > hfs-train.gz
- Extract title strings and use the short text document corpus format:
$ zcat hfs-train.gz | node lobidtraining.js > hfs-train.tsv
- Create a configuration section in
projects.cfg
:
[hfs-de]
name=HFS
language=de
backend=tfidf
vocab=hfs-de
analyzer=snowball(german)
- Load the vocabulary:
$ wget https://raw.githubusercontent.com/dini-ag-kim/hochschulfaechersystematik/addGndMatches/hochschulfaechersystematik.ttl
$ annif loadvoc hfs-de hochschulfaechersystematik.ttl
- Train and test:
$ annif train hfs-de hfs-train.tsv
$ echo "Was ist Statistik?" | annif suggest hfs-de
<https://w3id.org/kim/hochschulfaechersystematik/n237> Mathematische Statistik/Wahrscheinlichkeitsrechnung 0.47248679399490356
<https://w3id.org/kim/hochschulfaechersystematik/n020> Bergbau/Bergtechnik 0.03955048695206642
<https://w3id.org/kim/hochschulfaechersystematik/n148> Sozialwissenschaft 0.034819480031728745
<https://w3id.org/kim/hochschulfaechersystematik/n161> Diakoniewissenschaft 0.03423655405640602
<https://w3id.org/kim/hochschulfaechersystematik/n078> Indologie 0.028748050332069397
<https://w3id.org/kim/hochschulfaechersystematik/n371> Tierproduktion 0.02805497497320175
<https://w3id.org/kim/hochschulfaechersystematik/n050> Geographie/Erdkunde 0.027537360787391663
<https://w3id.org/kim/hochschulfaechersystematik/n544> Evang. Religionspädagogik, kirchliche Bildungsarbeit 0.02738470770418644
<https://w3id.org/kim/hochschulfaechersystematik/n184> Wirtschaftswissenschaften 0.0272254329174757
<https://w3id.org/kim/hochschulfaechersystematik/n361> Schulpädagogik 0.02666403166949749
- Start API:
$ annif run --host 0.0.0.0
$ curl -X POST --header 'Content-Type: application/x-www-form-urlencoded' --header 'Accept: application/json' -d 'text=Was ist Statistik?&limit=1' 'http://127.0.0.1:5000/v1/projects/hfs-de/suggest'
{
"results": [
{
"label": "Mathematische Statistik/Wahrscheinlichkeitsrechnung",
"score": 0.47248679399490356,
"uri": "https://w3id.org/kim/hochschulfaechersystematik/n237"
}
]
}