Processed dataset contains 620k tweets and corresponding coordinates.
Processing includes geocoding US cities, which is done using Nominatim, and country location check.
Train - test - val split: 80% - 10% - 10%, batch size = 64.
The task is to predict coorditates (lat - lon) based on tweet texts.
To estimate distance betweet predicted and real coordinates, haversine distance is used.
It considers Earth as a sphere with a set radius, which is its simplest representation.
- BERT tokenizer is used with max_length=32, truncation=True
- Takes use of BERT <CLS> token embeddings only
- They are fed to two linear layers, followed up by linear regression
- Each layer uses batch normalization
- ReLU is used as an activation function
- Used for dimensionality reduction
- Denoising architecture (with scalable factor)
- BERT weights are disabled while training AE
- MSE loss is used for autoencoder training
- Both encoder and decoder consist of two layers with ReLU activation
- Encoder states are saved during training and used in regression model