Skip to content

skohub-io/skohub-suggest

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

SkoHub Suggest

In order to enable suggestions for vocabularies, Annif is used. Annif has to be trained for every vocabulary based on sample data, we are currently experimenting with the Hochschulfachersystematik. Since we do not have training data for that vocabulary, we mapped the Hochschulfachersystematik to the GND for which plenty of data exists in lobid.org.

The approach we took is as follows:

  1. Download data that uses the GND concepts that have been mapped to the Hochschulfachersystematik:
$ curl --header "Accept-Encoding: gzip" 'https://lobid.org/resources/search?q=NOT+type%3APublicationIssue+AND+%28%224039457-8%22+%2C+%224166076-6%22+%2C+%224019838-8%22+%2C+%224041447-4%22+%2C+%224162468-3%22+%2C+%224184586-9%22+%2C+%224170208-6%22+%2C+%224121930-2%22+%2C+%227504419-5%22+%2C+%224066486-7%22+%2C+%224005924-8%22+%2C+%224130526-7%22+%2C+%224040802-4%22+%2C+%224048816-0%22+%2C+%224055864-2%22+%2C+%224056837-4%22+%2C+%224113615-9%22+%2C+%224164044-5%22+%2C+%224830608-3%22+%2C+%224049426-3%22+%2C+%224062992-2%22+%2C+%224196910-8%22+%2C+%224034873-8%22+%2C+%224116533-0%22+%2C+%224031888-6%22+%2C+%224148885-4%22+%2C+%224372194-1%22+%2C+%227593577-6%22+%2C+%224113791-7%22+%2C+%224129090-2%22+%2C+%224037944-9%22+%2C+%224152829-3%22+%2C+%224161557-8%22+%2C+%224169194-5%22+%2C+%224045705-9%22+%2C+%224020470-4%22+%2C+%224830611-3%22+%2C+%224122851-0%22+%2C+%224079184-1%22+%2C+%224036034-9%22+%2C+%224126242-6%22+%2C+%224153261-2%22+%2C+%224160760-0%22+%2C+%224061916-3%22+%2C+%224028532-7%22+%2C+%224055916-6%22+%2C+%224112736-5%22+%2C+%224129410-5%22+%2C+%224045791-6%22+%2C+%224241223-7%22+%2C+%224142845-6%22+%2C+%224160775-2%22+%2C+%224038243-6%22+%2C+%224122766-9%22+%2C+%224044302-4%22+%2C+%224120316-1%22+%2C+%224133320-2%22+%2C+%224014390-9%22+%2C+%224014346-6%22+%2C+%224025763-0%22+%2C+%224147095-3%22+%2C+%224126655-9%22+%2C+%224020759-6%22+%2C+%224155069-9%22+%2C+%224006801-8%22+%2C+%224466880-6%22+%2C+%224122614-8%22+%2C+%224611085-9%22+%2C+%224015875-5%22+%2C+%224020517-4%22+%2C+%224004955-3%22+%2C+%224114056-4%22+%2C+%224179595-7%22+%2C+%224052397-4%22+%2C+%224042178-8%22+%2C+%224079346-1%22+%2C+%224002718-1%22+%2C+%224062781-0%22+%2C+%227505710-4%22+%2C+%224026978-4%22+%2C+%227603053-2%22+%2C+%224517988-8%22+%2C+%224015428-2%22+%2C+%224112643-9%22+%2C+%224076570-2%22+%2C+%224051158-3%22+%2C+%224066528-8%22+%2C+%224038261-8%22+%2C+%224078943-3%22+%2C+%224794555-2%22+%2C+%224064700-6%22+%2C+%224392528-5%22+%2C+%224034901-9%22+%2C+%224012435-6%22+%2C+%224241276-6%22+%2C+%224120314-8%22+%2C+%224038953-4%22+%2C+%224077608-6%22+%2C+%224006851-1%22+%2C+%224039207-7%22+%2C+%224061644-7%22+%2C+%224056415-0%22+%2C+%224130615-6%22+%2C+%224129951-6%22+%2C+%224073499-7%22+%2C+%224141476-7%22+%2C+%227503709-9%22+%2C+%224120632-0%22+%2C+%224037220-0%22+%2C+%224009816-3%22+%2C+%224378168-8%22+%2C+%224017212-0%22+%2C+%224146518-0%22+%2C+%2244236073-0%22+%2C+%224021845-4%22+%2C+%224047704-6%22+%2C+%224026926-7%22+%2C+%224032950-1%22+%2C+%224051038-4%22+%2C+%224002046-0%22+%2C+%224129566-3%22+%2C+%224177403-6%22+%2C+%224030198-9%22+%2C+%224065136-8%22+%2C+%224066510-0%22+%2C+%224162713-1%22+%2C+%224139318-1%22+%2C+%224078951-2%22+%2C+%227508942-7%22+%2C+%224207188-4%22+%2C+%224188027-4%22+%2C+%224178059-0%22+%2C+%224003311-9%22+%2C+%224003326-0%22+%2C+%224064821-7%22+%2C+%224113411-4%22+%2C+%224134381-5%22+%2C+%224142081-0%22+%2C+%224172819-1%22+%2C+%224061084-6%22+%2C+%227502913-3%22+%2C+%224185058-0%22+%2C+%224002859-8%22+%2C+%224006777-4%22+%2C+%221033784729%22+%2C+%224185935-2%22+%2C+%224113594-5%22+%2C+%224040843-7%22+%2C+%224127822-7%22+%2C+%224003910-9%22+%2C+%224016654-5%22+%2C+%224704302-7%22+%2C+%224029823-1%22+%2C+%224020202-1%22+%2C+%224183231-0%22+%2C+%224191032-1%22+%2C+%224079454-4%22+%2C+%224065590-8%22+%2C+%224117223-1%22+%2C+%224168244-0%22+%2C+%224177061-4%22+%2C+%224030005-5%22+%2C+%224162398-8%22+%2C+%224137304-2%22+%2C+%224020216-1%22+%2C+%224576777-4%22+%2C+%224056442-3%22+%2C+%224064324-4%22+%2C+%224056995-0%22+%2C+%224072819-5%22+%2C+%224161576-1%22+%2C+%224069402-1%22+%2C+%224113262-2%22+%2C+%224137670-5%22+%2C+%227503845-6%22+%2C+%224238812-0%22+%2C+%224057808-2%22+%2C+%224005614-4%22+%2C+%224059643-6%22+%2C+%224046277-8%22+%2C+%224020288-4%22+%2C+%224050484-0%22+%2C+%224055768-6%22+%2C+%224066446-6%22+%2C+%224115348-0%22+%2C+%224249464-3%22+%2C+%224044375-9%22+%2C+%224020227-6%22+%2C+%224875411-0%22+%2C+%224172951-1%22+%2C+%224507887-7%22+%2C+%224647152-2%22+%2C+%224039289-2%22+%2C+%224251085-5%22+%2C+%224002851-3%22+%2C+%224037790-8%22+%2C+%224030736-0%22+%2C+%224006439-6%22+%2C+%224827059-3%22+%2C+%224025684-4%22+%2C+%224113292-0%22+%2C+%224020383-9%22+%2C+%224000107-6%22+%2C+%224535278-1%22+%2C+%224136254-8%22+%2C+%224444561-1%22+%2C+%224030270-2%22+%2C+%224159290-6%22+%2C+%224048590-0%22+%2C+%224047770-8%22+%2C+%224120588-1%22+%2C+%224077478-8%22+%2C+%224033637-2%22+%2C+%224349723-8%22+%2C+%224019558-2%22+%2C+%227506104-1%22+%2C+%224045956-1%22+%2C+%224014725-3%22+%2C+%224062803-6%22+%2C+%224046595-0%22+%2C+%224069491-4%22+%2C+%224066472-7%22+%2C+%224146998-7%22+%2C+%224062901-6%22+%2C+%224077208-1%22+%2C+%224055676-1%22+%2C+%224345139-1%22+%2C+%224130930-3%22+%2C+%224006465-7%22+%2C+%224035843-4%22+%2C+%224020252-5%22+%2C+%224142196-6%22+%2C+%224310260-8%22+%2C+%224130392-1%22+%2C+%224076229-4%22+%2C+%224068598-6%22+%2C+%224043687-1%22+%2C+%224019294-5%22+%2C+%224137703-5%22+%2C+%224078931-7%22+%2C+%224025677-7%22+%2C+%224074685-9%22+%2C+%224291279-9%22+%2C+%224055287-1%22+%2C+%224078315-7%22+%2C+%224186473-6%22+%2C+%224221617-5%22+%2C+%224061650-2%22+%2C+%224114364-4%22+%2C+%224077640-2%22+%2C+%224158406-5%22+%2C+%224127385-0%22+%2C+%221072438291%22+%2C+%224329079-6%22+%2C+%224077624-4%22+%2C+%224078937-8%22+%2C+%224120278-8%22+%2C+%224131523-6%22+%2C+%224196734-3%22+%2C+%224112526-5%22+%2C+%224015602-3%22+%2C+%224034350-9%22+%2C+%224034265-7%22+%2C+%224137926-3%22+%2C+%224201431-1%22+%2C+%224038872-4%22+%2C+%224077513-6%22+%2C+%224049540-1%22+%2C+%224124801-6%22+%2C+%224661954-9%22+%2C+%224068473-8%22+%2C+%224005878-5%22+%2C+%224499546-5%22+%2C+%224047340-5%22+%2C+%224072788-9%22+%2C+%224026894-9%22+%2C+%224012656-0%22+%2C+%224128313-2%22+%2C+%224016825-6%22+%2C+%224017102-4%22+%2C+%224055905-1%22+%2C+%224162796-9%22+%2C+%224023922-6%22+%2C+%224002827-6%22+%2C+%224257369-5%22+%2C+%224114040-0%22%29&format=jsonl' > hfs-train.gz
  1. Extract title strings and use the short text document corpus format:
$ zcat hfs-train.gz | node lobidtraining.js > hfs-train.tsv
  1. Create a configuration section in projects.cfg:
[hfs-de]
name=HFS
language=de
backend=tfidf
vocab=hfs-de
analyzer=snowball(german)
  1. Load the vocabulary:
$ wget https://raw.githubusercontent.com/dini-ag-kim/hochschulfaechersystematik/addGndMatches/hochschulfaechersystematik.ttl
$ annif loadvoc hfs-de hochschulfaechersystematik.ttl
  1. Train and test:
$ annif train hfs-de hfs-train.tsv
$ echo "Was ist Statistik?" | annif suggest hfs-de
<https://w3id.org/kim/hochschulfaechersystematik/n237>  Mathematische Statistik/Wahrscheinlichkeitsrechnung     0.47248679399490356
<https://w3id.org/kim/hochschulfaechersystematik/n020>  Bergbau/Bergtechnik     0.03955048695206642
<https://w3id.org/kim/hochschulfaechersystematik/n148>  Sozialwissenschaft      0.034819480031728745
<https://w3id.org/kim/hochschulfaechersystematik/n161>  Diakoniewissenschaft    0.03423655405640602
<https://w3id.org/kim/hochschulfaechersystematik/n078>  Indologie       0.028748050332069397
<https://w3id.org/kim/hochschulfaechersystematik/n371>  Tierproduktion  0.02805497497320175
<https://w3id.org/kim/hochschulfaechersystematik/n050>  Geographie/Erdkunde     0.027537360787391663
<https://w3id.org/kim/hochschulfaechersystematik/n544>  Evang. Religionspädagogik, kirchliche Bildungsarbeit    0.02738470770418644
<https://w3id.org/kim/hochschulfaechersystematik/n184>  Wirtschaftswissenschaften       0.0272254329174757
<https://w3id.org/kim/hochschulfaechersystematik/n361>  Schulpädagogik  0.02666403166949749
  1. Start API:
$ annif run --host 0.0.0.0
$ curl -X POST --header 'Content-Type: application/x-www-form-urlencoded' --header 'Accept: application/json' -d 'text=Was ist Statistik?&limit=1' 'http://127.0.0.1:5000/v1/projects/hfs-de/suggest'
{       
  "results": [
    {
      "label": "Mathematische Statistik/Wahrscheinlichkeitsrechnung",
      "score": 0.47248679399490356,
      "uri": "https://w3id.org/kim/hochschulfaechersystematik/n237"
    }
  ]
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published