This document serves as the documentation for the solution, providing an overview of the training and inference model, as well as the ablation study conducted.
The training data consists of two components:
-
Organizers' Data: This refers to the initial version of the dataset provided by the organizers. It includes the validation and train splits.
-
UAVid: During the training experiments, it was discovered that incorporating UAVid significantly improved the accuracy of the model, particularly for the person class.
As a starting point, the PIDNet model was used as a baseline. The PIDNet model is known for its good balance between speed and quality. [paper, code]
However, a modification was made to the model architecture. The last convolutional layer along with the resizing operation was replaced with a transposed convolution. This change was found to improve both the accuracy of the model and the speed of the pipeline.
The postprocessing step involves applying several heuristics based on observations of model predictions, real data, and validation metrics. The main essence of the postprocessing heuristics is as follows:
-
Cracks/Fissures/Subsidence: If cracks, fissures, or subsidence are detected in the predicted image, the pixels predicted as lava_flow are assigned to the background class. It was observed that the lava_flow class did not occur when these anomalies were present, and the model often made mistakes in such cases. Hence, this heuristic was proposed to address this issue.
-
Class Occupancy: If any class occupies less than 500 pixels in the image, the pixels of that class are assigned to the background. This heuristic was developed after analyzing the behavior of metrics based on the model's predictions. It was found that it is better to exclude predictions where a class occupies a small area, rather than predict the wrong class. Therefore, a filter was implemented to remove predictions with minimal class occupancy.
An ablation study was conducted to assess the impact of certain modifications on the model's performance.
-
Replacement of the Last Convolutional Layer + Resize: The last convolutional layer along with the resizing operation was replaced with a transposed convolution. This modification increased the accuracy by 2% (55.2 -> 57.2) and reduced the inference time from 73 to 67 ms.
-
Addition of Postprocessing: The implementation of postprocessing measures further improved the accuracy by 2.8% (57.2 -> 60.0).