Dataset | Paper | PPT | Presentation
Code and Dataset for the Bhola et al. (2020) Retrieving Skills from Job Descriptions: A Language Model Based Extreme Multi-label Classification Framework
Default model weights and dataset are available in the link
The dataset is collected from Singaporean government website, mycareersfuture.sg consisting of over 20, 000 richly-structured job posts. The detailed statistics of the dataset are shown below:
Mycareersfuture.sg Dataset | Stats |
---|---|
Number of job posts | 20,298 |
Number of distinct skills | 2,548 |
Number of skills with 20 or more mentions | 1,209 |
Average skill tags per job post | 19.98 |
Average token count per job post | 162.27 |
Maximum token count in a job post | 1,127 |
This dataset includes the following fields:
- company_name
- job_title
- employment_type
- seniority
- job_category
- location
- salary
- min_experience
- skills_required
- requirements_and_role
- job_requirements
- company_info
- posting_date
- expiry_date
- no_of_applications
- job_id
The proposed model constitutes a pre-trained BERT based text encoder utilizing WordPiece embedding. The encoded textual representation is passed into bottleneck layer. This layer alleviates overfitting by (significantly) limiting the number of trainable parameters. The activations are passed through fully connected layer, finally producing probability scores using sigmoid activation function.
- Run
bash setup.sh
Or
- Transfer all files from
checkpoint
folder (in google drive) topybert/pretrain/bert/bert-uncased
folder - Transfer dataset files from
dataset
folder (in google drive) topybert/dataset
folder
Training
python run_bert.py --train --data_name job_dataset
Testing
python run_bert.py --test --data_name job_dataset
Note: Configurations for training, validation and testing of model are provided in pybert/configs/basic_config.py
Additionally, pybert/model_setup/CAB_dataset_script.py
is provided to implement CAB
Docker image deployment
To create a docker deployment:
- First download & setup all the datasets files and pre-trained model weights in the
pybert
dir - Run
docker build .
to create docker image - To start training with default config, run
docker run <docker_image_name>
- For testing and further operations, execute commands in
/bin/bash
terminal in the containerdocker run -it <docker_image_name> /bin/bash
Experimental results on skill prediction task are shown below:
Note: Model has been further finetuned
@inproceedings{bhola-etal-2020-retrieving,
title = "Retrieving Skills from Job Descriptions: A Language Model Based Extreme Multi-label Classification Framework",
author = "Bhola, Akshay and
Halder, Kishaloy and
Prasad, Animesh and
Kan, Min-Yen",
booktitle = "Proceedings of the 28th International Conference on Computational Linguistics",
month = dec,
year = "2020",
address = "Barcelona, Spain (Online)",
publisher = "International Committee on Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.coling-main.513",
doi = "10.18653/v1/2020.coling-main.513",
pages = "5832--5842",
abstract = "We introduce a deep learning model to learn the set of enumerated job skills associated with a job description. In our analysis of a large-scale government job portal mycareersfuture.sg, we observe that as much as 65{\%} of job descriptions miss describing a significant number of relevant skills. Our model addresses this task from the perspective of an extreme multi-label classification (XMLC) problem, where descriptions are the evidence for the binary relevance of thousands of individual skills. Building upon the current state-of-the-art language modeling approaches such as BERT, we show our XMLC method improves on an existing baseline solution by over 9{\%} and 7{\%} absolute improvements in terms of recall and normalized discounted cumulative gain. We further show that our approach effectively addresses the missing skills problem, and helps in recovering relevant skills that were missed out in the job postings by taking into account the structured semantic representation of skills and their co-occurrences through a Correlation Aware Bootstrapping process. We further show that our approach, to ensure the BERT-XMLC model accounts for structured semantic representation of skills and their co-occurrences through a Correlation Aware Bootstrapping process, effectively addresses the missing skills problem, and helps in recovering relevant skills that were missed out in the job postings. To facilitate future research and replication of our work, we have made the dataset and the implementation of our model publicly available.",
}