This repository contains the code to reproduce all experiments in our paper
ASTRIDE: Adaptive Symbolization for Time Series Databases
available on arXiv. All the code is written in Python (scripts and notebooks).
Toggle for the paper's abstract!
We introduce ASTRIDE (Adaptive Symbolization for Time seRIes DatabasEs), a novel symbolic representation of time series, along with its accelerated variant FASTRIDE (Fast ASTRIDE). Unlike most symbolization procedures, ASTRIDE is adaptive during both the segmentation step by performing change-point detection and the quantization step by using quantiles. Instead of proceeding signal by signal, ASTRIDE builds a dictionary of symbols that is common to all signals in a data set. We also introduce D-GED (Dynamic General Edit Distance), a novel similarity measure on symbolic representations based on the general edit distance. We demonstrate the performance of the ASTRIDE and FASTRIDE representations compared to SAX (Symbolic Aggregate approXimation), 1d-SAX, SFA (Symbolic Fourier Approximation), and ABBA (Adaptive Brownian Bridge-based Aggregation) on reconstruction and, when applicable, on classification tasks. These algorithms are evaluated on 86 univariate equal-size data sets from the UCR Time Series Classification Archive. An open source GitHub repository called astride is made available to reproduce all the experiments in Python.Please let us know of any issue you might encounter when using this code, either by opening an issue on this repository or by sending an email to sylvain.combettes [at] ens-paris-saclay.fr
. Or if you just need some clarification or help.
For ASTRIDE and FASTRIDE, a symbolic representation (with an associated distance) is a scikit-learn pipeline based on the following classes in the src
folder:
SegmentFeature
(insegmentation.py
)Segmentation
(insegment_features.py
)Symbolization
(insymbolization.py
)SymbolicSignalDistance
(insymbolic_signal_distance.py
)
An example usage is given in 07_run_reconstruction.py
.
Download the ABBA/ABBA.py
file from https://github.com/nla-group/ABBA, as it is used in the signal reconstruction benchmark.
date_exp
is a string (for example "2023_02_08"
) in order to version the experiments.
The code outputs the following kinds of csv
files:
data
folder: the UCR archive meta data filtered on the list of data sets under considerationresults/{date_exp}/acc
folder: the results from the 1-NN classifcation taskresults/{date_exp}/clean_acc
folder: the results from the 1-NN classifcation taskresults/{date_exp}/reconstruction
folder: the results from signal reconstruction task
The code outputs all the figures in the paper in the results/{date_exp}/img
folder.
For the Python notebooks, the configuration parameters (at the top) do the following:
DATE_EXP
sets the date of launch of the experiment (for versioning)- if
IS_EXPORT_DF=False
, then no CSV files (pandas DataFrame) are exported - if
IS_SAVE_FIG=False
, then no figures are exported - if
IS_COMPUTE=False
, then long computations are made (no more than a few mintutes on a standard laptop)
Note that more details are given at the top of each notebook and that the code is commented.
- Explore the UCR archive.
Run the01_explore_ucr_metadata.ipynb
notebook.
It generates:- the
data/DataSummary_prep_equalsize.csv
file which contains the 117 univariate and equal-size data sets from the UCR archive. - the
data/DataSummary_prep_equalsize_min100samples.csv
file which contains the 94 univariate and equal-size data sets with at least 100 samples from the UCR archive. - Table 4.
- Data for Section 2.3.3 and Section 3.4 about the total memory usage of a symbolization method.
- the
- Look into the normality assumption oof the means per segment.
Run the02_explore_gaussian_assumption.ipynb
notebook.
- It performs the normality test for section 2.3.1.
- It generates Figure 2.
- Intrepret the symbolization methods, how they transform a signal.
Run the03_interpret_symbolization.ipynb
notebook- It generates Figures 1, 3 and 4.
- It generates Table 3.
- For the classification benchmark of SAX, 1d-SAX, ASTRIDE, and FASTRIDE of Section 4.1.
- Compute the test accuracies of each method.
Run the04_run_classification.py
script for the four symbolization methods:Note that the$ python3 04_run_classification.py --method_name "saxtslearn" --date_exp "2023_02_08" $ python3 04_run_classification.py --method_name "1dsax" --date_exp "2023_02_08" $ python3 04_run_classification.py --method_name "fastride" --date_exp "2023_02_08" $ python3 04_run_classification.py --method_name "astride" --date_exp "2023_02_08"
date_exp
variable is for versioning the experiments.
It generates theresults/{date_exp}/acc/df_acc_{method}_{dataset}.csv
files, which are the test accuracies per method per data set and for all combination of hyper-parameters. - Clean the classification results for each method.
Run the05_clean_classification_results.ipynb
notebook. It generates theresults/{date_exp}/acc_clean/df_acc_{method}_alldatasets_clean.csv
files, which are the test accuracies per method for all data sets and combination of hyper-parameters, cleaned. - Explore the results for each method: plot the accuracy as a function of the word length.
Run the06_explore_classification_results.ipynb
notebook.
It generates Figure 5.
- Compute the test accuracies of each method.
- For the signal reconstruction benchmark of SAX, 1d-SAX, SFA, ABBA, ASTRIDE and FASTRIDE of Section 4.2.
- Compute the reconstructed signals for each method and all signals of all data sets as well as the reconstruction error.
Run the07_run_recontruction.py
script for several target memory usage ratiosNote that thepython3 07_run_reconstruction.py --denom 3 --date_exp "2023_02_08" python3 07_run_reconstruction.py --denom 4 --date_exp "2023_02_08" python3 07_run_reconstruction.py --denom 5 --date_exp "2023_02_08" python3 07_run_reconstruction.py --denom 6 --date_exp "2023_02_08" python3 07_run_reconstruction.py --denom 10 --date_exp "2023_02_08" python3 07_run_reconstruction.py --denom 15 --date_exp "2023_02_08" python3 07_run_reconstruction.py --denom 20 --date_exp "2023_02_08"
denom
variable is the inverse of the target memory usage ratio.
It generates:- the
results/reconstruction/{denom}/reconstruction_errors_{dataset}.csv
files, which are the - the
results/reconstruction/{denom}/reconstructed_signals/reconstructed_{dataset}_{method}.csv
files, which are the reconstructed signals for each symbolization method.
- the
- Compute the reconstruction errors.
Run the08_explore_recontruction_results.ipynb
notebook.
It generates Figures 6 and 7. - Interpret the signal reconstruction for each symbolization method.
Run the09_interpret_reconstruction.ipynb
notebook.
It generates Figure 8.
- Compute the reconstructed signals for each method and all signals of all data sets as well as the reconstruction error.
- Compare the symbolization and classification durations of SAX, 1d-SAX, ASTRIDE, and FASTRIDE.
Run the10_processing_time.ipynb
notebook.
It generates Table 6.
- loadmydata==0.0.9
- matplotlib==3.4.1
- numpy==1.20.2
- pandas==1.2.4
- plotly==5.5.0
- ruptures==1.1.6.dev3+g1def4ff
- scikit-learn==1.0.1
- scipy==1.6.2
- seaborn==0.11.1
- statsmodels==0.13.1
- tslearn==0.5.2
- weighted-levenshtein==0.2.1
If you use this code or publication, please cite (arXiv: https://arxiv.org/abs/2302.04097):
@article{2023_combettes_astride,
doi = {10.48550/ARXIV.2302.04097},
url = {https://arxiv.org/abs/2302.04097},
author = {Combettes, Sylvain W. and Truong, Charles and Oudre, Laurent},
title = {ASTRIDE: Adaptive Symbolization for Time Series Databases},
journal = {arXiv preprint arXiv:2302.04097},
year = {2023},
}
This project is licensed under the MIT License, see the LICENSE.md
file for more information.
- Sylvain W. Combettes (Centre Borelli, ENS Paris-Saclay)
- Charles Truong (Centre Borelli, ENS Paris-Saclay)
- Laurent Oudre (Centre Borelli, ENS Paris-Saclay)
Sylvain W. Combettes is supported by the IDAML chair (ENS Paris-Saclay) and UDOPIA (ANR-20-THIA-0013-01). Charles Truong is funded by the PhLAMES chair (ENS Paris-Saclay). Part of the computations has been executed on Atos Edge computer, funded by the IDAML chair (ENS Paris-Saclay).