Releases: pytorch/audio
v0.10.0
torchaudio 0.10.0 Release Note
Highlights
torchaudio 0.10.0 release includes:
- New models (Tacotron2, HuBERT) and datasets (CMUDict, LibriMix)
- Pretrained model support for ASR (Wav2Vec2, HuBERT) and TTS (WaveRNN, Tacotron2)
- New operations (RNN Transducer loss, MVDR beamforming, PitchShift, etc)
- CUDA-enabled binaries
[Beta] Wav2Vec2 / HuBERT Models and Pretrained Weights
HuBERT model architectures (“base”, “large” and “extra large” configurations) are added. In addition to that, support for pretrained weights from wav2vec 2.0, Unsupervised Cross-lingual Representation Learning and HuBERT are added.
These pretrained weights can be used for feature extractions and downstream task adaptation.
>>> import torchaudio
>>>
>>> # Build the model and load pretrained weight.
>>> model = torchaudio.pipelines.HUBERT_BASE.get_model()
>>> # Perform feature extraction.
>>> features, lengths = model.extract_features(waveforms)
>>> # Pass the features to downstream task
>>> ...
Some of the pretrained weights are fine-tuned for ASR tasks. The following example illustrates how to use weights and access to associated information, such as labels, which can be used in subsequent CTC decoding steps. (Note: torchaudio does not provide a CTC decoding mechanism.)
>>> import torchaudio
>>>
>>> bundle = torchaudio.pipelines.HUBERT_ASR_LARGE
>>>
>>> # Build the model and load pretrained weight.
>>> model = bundle.get_model()
Downloading:
100%|███████████████████████████████| 1.18G/1.18G [00:17<00:00, 73.8MB/s]
>>> # Check the corresponding labels of the output.
>>> labels = bundle.get_labels()
>>> print(labels)
('<s>', '<pad>', '</s>', '<unk>', '|', 'E', 'T', 'A', 'O', 'N', 'I', 'H', 'S', 'R', 'D', 'L', 'U', 'M', 'W', 'C', 'F', 'G', 'Y', 'P', 'B', 'V', 'K', "'", 'X', 'J', 'Q', 'Z')
>>>
>>> # Infer the label probability distribution
>>> waveform, sample_rate = torchaudio.load(hello-world.wav')
>>>
>>> emissions, _ = model(waveform)
>>>
>>> # Pass emission to (hypothetical) decoder
>>> transcripts = ctc_decode(emissions, labels)
>>> print(transcripts[0])
HELLO WORLD
[Beta] Tacotron2 and TTS Pipeline
A new model architecture, Tacotron2 is added, alongside several pretrained weights for TTS (text-to-speech). Since these TTS pipelines are composed of multiple models and specific data processing, so as to make it easy to use associated objects, a notion of bundle is introduced. Bundles provide a common access point to create a pipeline with a set of pretrained weights. They are available under torchaudio.pipelines
module.
The following example illustrates a TTS pipeline where two models (Tacotron2 and WaveRNN) are used together.
>>> import torchaudio
>>>
>>> bundle = torchaudio.pipelines.TACOTRON2_WAVERNN_CHAR_LJSPEECH
>>>
>>> # Build text processor, Tacotron2 and vocoder (WaveRNN) model
>>> processor = bundle.get_text_preprocessor()
>>> tacotron2 = bundle.get_tacotron2()
Downloading:
100%|███████████████████████████████| 107M/107M [00:01<00:00, 87.9MB/s]
>>> vocoder = bundle.get_vocoder()
Downloading:
100%|███████████████████████████████| 16.7M/16.7M [00:00<00:00, 78.1MB/s]
>>>
>>> text = "Hello World!"
>>>
>>> # Encode text
>>> input, lengths = processor(text)
>>>
>>> # Generate (mel-scale) spectrogram
>>> specgram, lengths, _ = tacotron2.infer(input, lengths)
>>>
>>> # Convert spectrogram to waveform
>>> waveforms, lengths = vocoder(specgram, lengths)
>>>
>>> # Save audio
>>> torchaudio.save('hello-world.wav', waveforms, vocoder.sample_rate)
[Beta] RNN Transducer Loss
The loss function used in the RNN transducer architecture, which is widely used for speech recognition tasks, is added. The loss function (torchaudio.functional.rnnt_loss
or torchaudio.transforms.RNNTLoss
) supports float16 and float32 logits, has autograd and torchscript support, and can be run on both CPU and GPU, which has a custom CUDA kernel implementation for improved performance.
[Beta] MVDR Beamforming
This release adds support for MVDR beamforming on multi-channel audio using Time-Frequency masks. There are three solutions (ref_channel, stv_evd, stv_power) and it supports single-channel and multi-channel (perform average in the method) masks. It provides an online option that recursively updates the parameters for streaming audio.
Please refer to the MVDR tutorial.
GPU Build
This release adds GPU builds that support custom CUDA kernels in torchaudio, like the one being used for RNN transducer loss. Following this change, torchaudio’s binary distribution now includes CPU-only versions and CUDA-enabled versions. To use CUDA-enabled binaries, PyTorch also needs to be compatible with CUDA.
Additional Features
torchaudio.functional.lfilter
now supports batch processing and multiple filters. Additional operations, including pitch shift, LFCC, and inverse spectrogram, are now supported in this release. The datasets CMUDict and LibriMix are added as well.
Backward Incompatible Changes
I/O
- Default to PCM_16 for flac on soundfile backend (#1604)
- When saving FLAC format with “soundfile” backend,
PCM_24
(the previous default) could cause warping. The default has been changed toPCM_16
, which does not suffer this.
- When saving FLAC format with “soundfile” backend,
Ops
- Default to native complex type when returning raw spectrogram (#1549)
- When
power=None
,torchaudio.functional.spectrogram
andtorchaudio.transforms.Spectrogram
now defaults toreturn_complex=True
, which returns Tensor of native complex type (such astorch.cfloat
andtorch.cdouble
). To use a pseudo complex type, pass the resulting tensor totorch.view_as_real
.
- When
- Remove deprecated kaldi.resample_waveform (#1555)
- Please use
torchaudio.functional.resample
.
- Please use
- Replace waveform with specgram in SlidingWindowCmn (#1859)
- The argument name was corrected to
specgram
.
- The argument name was corrected to
- Ensure integer input frequencies for resample (#1857)
- Sampling rates were silently cast to integers in the resampling implementation, so it now requires integer sampling rate inputs to ensure expected resampling quality.
Wav2Vec2
- Update
extract_features
of Wav2Vec2Model (#1776)- The previous implementation returned outputs from convolutional feature extractors. To match the behavior with the original fairseq’s implementation, the method was changed to return the outputs of the intermediate layers of transformer layers. To achieve the original behavior, please use
Wav2Vec2Model.feature_extractor()
.
- The previous implementation returned outputs from convolutional feature extractors. To match the behavior with the original fairseq’s implementation, the method was changed to return the outputs of the intermediate layers of transformer layers. To achieve the original behavior, please use
- Move fine-tune specific module out of wav2vec2 encoder (#1782)
- The internal structure of
Wav2Vec2Model
was updated.Wav2Vec2Model.encoder.read_out
module is moved toWav2Vec2Model.aux
. If you have serialized state dict, please replace the keyencoder.read_out
withaux
.
- The internal structure of
- Updated wav2vec2 factory functions for more customizability (#1783, #1804, #1830)
- The signatures of wav2vec2 factory functions are changed.
num_out
parameter has been changed toaux_num_out
and other parameters are added before it. Please update the code fromwav2vec2_base(num_out)
towav2vec2_base(aux_num_out=num_out)
.
- The signatures of wav2vec2 factory functions are changed.
Deprecations
- Add
melscale_fbanks
and deprecatecreate_fb_matrix
(#1653)- As
linear_fbanks
is introduced,create_fb_matrix
is renamed tomelscale_fbanks
. The originalcreate_fb_matrix
is now deprecated. Please usemelscale_fbanks
.
- As
- Deprecate
VCTK
dataset (#1810)- This dataset has been taken down and is no longer available. Please use
VCTK_092
dataset.
- This dataset has been taken down and is no longer available. Please use
- Deprecate data utils (#1809)
bg_iterator
anddiskcache_iterator
are known to not improve the throughput of data loaders. Please cease their usage.
New Features
Models
Tacotron2
- Add Tacotron2 model (#1621, #1647, #1844)
- Add Tacotron2 loss function (#1764)
- Add Tacotron2 inference method (#1648, #1839, #1849)
- Add phoneme text preprocessing for Tacotron2 (#1668)
- Move Tacotron2 out of prototype (#1714)
HuBERT
Pretrained Weights and Pipelines
- Add pretrained weights for wavernn (#1612)
- Add Tacotron2 pretrained models (#1693)
- Add HUBERT pretrained weights (#1821, #1824)
- Add pretrained weights from wav2vec2.0 and XLSR papers (#1827)
- Add customization support to wav2vec2 labels (#1834)
- Default pretrained weights to eval mode (#1843)
- Move wav2vec2 pretrained models to pipelines module (#1876)
- Add TTS bundle/pipelines (#1872)
- Fix vocoder interface (#1895)
- Fix Phonemizer download (#1897)
RNN Transducer Loss
- Add reduction parameter for RNNT loss (#1590)
- Rename RNNT loss C++ parameters (#1602)
- Rename transducer to RNNT (#1603)
- Remove gradient variable from RNNT loss Python code (#1616)
- Remove reuse_logits_for_grads option for RNNT loss (#1610)
- Remove fused_log_softmax option from RNNT loss (#1615)
- RNNT loss resolve null gradient (#1707)
- Move RNNT loss out of prototype (#1711)
MVDR Beamforming
- Add MVDR module to example (#1709)
- Add normalization to steering vector solutions in MVDR Module (#1765)
- Move MVDR and PSD modules to transforms (#1771)
- Add MVDR beamforming tutorial to example directory (#1768)
Ops
- Add edit_distance (#1601)
- Add PitchShift to functional and transform (#1629)
- Add LFCC feature to transforms (#1611)
- Add InverseSpectrogram to transforms and functional (#1652)
Datasets
Improvements
I/O
- Make buffer size for function info configurable (#1634)
Ops
torchaudio 0.9.1 Minor bugfix release
This release depends on pytorch 1.9.1
No functional changes other than minor updates to CI rules.
v0.9.0
torchaudio 0.9.0 Release Note
Highlights
torchaudio 0.9.0 release includes:
- Lots of performance improvements. (filtering, resampling, spectral operation)
- Popular wav2vec2.0 model architecture.
- Improved autograd support.
[Beta] Wav2Vec2.0 Model
This release includes model architectures from wav2vec2.0 paper with utility functions that allow importing pretrained model parameters published on fairseq
and Hugging Face Hub. Now you can easily run speech recognition with torchaudio. These model architectures also support TorchScript, and you can deploy them with ONNX or in non-Python environments, such as C++, Android and iOS. Please checkout our C++, Android and iOS examples. The following snippets illustrate how to create a deployable model.
# Import fine-tuned model from Hugging Face Hub
import transformers
from torchaudio.models.wav2vec2.utils import import_huggingface_model
original = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")
imported = import_huggingface_model(original)
# Import fine-tuned model from fairseq
import fairseq
from torchaudio.models.wav2vec2.utils import import_fairseq_model
Original, _, _ = fairseq.checkpoint_utils.load_model_ensemble_and_task(
["wav2vec_small_960h.pt"], arg_overrides={'data': "<data_dir>"})
imported = import_fairseq_model(original[0].w2v_encoder)
# Build uninitialized model and load state dict
from torchaudio.models import wav2vec2_base
model = wav2vec2_base(num_out=32)
model.load_state_dict(imported.state_dict())
# Quantize / script / optimize for mobile
quantized_model = torch.quantization.quantize_dynamic(
model, qconfig_spec={torch.nn.Linear}, dtype=torch.qint8)
scripted_model = torch.jit.script(quantized_model)
optimized_model = optimize_for_mobile(scripted_model)
optimized_model.save("model_for_deployment.pt")
Filtering Improvement
The internal implementation of lfilter
has been updated to support autograd on both CPU and CUDA. Additionally, the performance on CPU is significantly improved. These improvements also apply to biquad
variants.
The following table illustrates the performance improvements compared against the previous releases. lfilter
was applied on float32
tensors with one channel and different number of frames.
torchaudio version | 256 |
512 |
1024 |
0.9 | 0.282 |
0.381 |
0.564 |
0.8 | 0.493 |
0.780 |
1.37 |
0.7 | 5.42 |
10.8 |
22.3 |
Unit: msec
Complex Tensor Migration
torchaudio
has functions that handle complex-valued tensors. In early days when PyTorch did not have a complex dtype, torchaudio
adopted the convention to use an extra dimension to represent real and imaginary parts. In PyTorch 1.6, new dtyps, such as torch.cfloat
and torch.cdouble
were introduced to represent complex values natively. (In the following, we refer to torchaudio
’s original convention as pseudo complex types, and PyTorch’s native dtype as native complex types.)
As the native complex types have become mature and stable, torchaudio
has started to migrate complex functions to use the native complex type. In this release, the internal implementation was updated to use the native complex types, and interfaces were updated to allow passing/receiving native complex type directly. Users can choose to keep using the pseudo complex type or opt in to use native complex type. However, please note that the use of the pseudo complex type is now deprecated. These functions are tested to support TorchScript and autograd. For the detail of this migration plan, please refer to #1337.
Additionally, switching the internal implementation to the native complex types improved the performance. Since the internal implementation uses native complex type regardless of which complex type is passed/returned, users will automatically benefit from this performance improvement.
The following table illustrates the performance improvements from the previous release by comparing the time it takes for complex transforms to perform operation on float32
Tensor with two channels and 256 frames.
CPU
torchaudio version | Spectrogram
|
TimeStretch
|
GriffinLim
|
0.9 | 0.229 |
12.6 |
3320 |
0.8 | 0.283 |
126 |
5320 |
Unit: msec
CUDA
torchaudio version | Spectrogram
|
TimeStretch
|
GriffinLim
|
0.9 | 0.195 |
0.599 |
36 |
0.8 | 0.219 |
0.687 |
60.2 |
Unit: msec
Improved Autograd Support
Along with the work of Complex Tensor Migration and Filtering Improvement mentioned above, more tests were added to ensure the autograd support. Now the following operations are guaranteed to support autograd up to second order.
Functionals
lfilter
allpass_biquad
biquad
band_biquad
bandpass_biquad
bandrefect_biquad
bass_biquad
equalizer_biquad
treble_biquad
highpass_biquad
lowpass_biquad
Transforms
AmplitudeToDB
ComputeDeltas
Fade
GriffinLim
TimeMasking
FrequencyMasking
MFCC
MelScale
MelSpectrogram
Resample
SpectralCentroid
Spectrogram
SlidingWindowCmn
TimeStretch
*Vol
NOTE:
- Autograd test for transforms also covers the following functionals.
amplitude_to_DB
spectrogram
griffinlim
resample
phase_vocoder
*mask_along_axis_iid
mask_along_axis
gain
spectral_centroid
torchaudio.transforms.TimeStretch
andtorchaudio.functional.phase_vocoder
callatan2
, which is not differentiable around zero. Therefore these functions are differentiable only when the input spectrogram does not contain values around zero.
[Beta] Resampling Improvement
In release 0.8, the resampling operation was vectorized and its performance improved. In this release, the implementation of the resampling algorithm has been further revised.
- Kaiser window has been added for a wider range of resampling quality.
rolloff
parameter has been added for anti-aliasing control.torchaudio.transforms.Resample
precomputes the kernel usingfloat64
precision and caches it for even faster operation.- New entry point,
torchaudio.functional.resample
has been added and the original entry point,torchaudio.compliance.kaldi.resample_waveform
is deprecated.
The following table illustrates the performance improvements from the previous release by comparing the time it takes for torchaudio.transforms.Resample
to complete the operation on float32
tensor with two channels and one-second duration.
CPU
torchaudio version | 8k → 16k [Hz] | 16k → 8k | 16k → 44.1k | 44.1k → 16k |
0.9 | 0.192 |
0.559 |
0.478 |
0.467 |
0.8 | 0.537 |
0.753 |
43.9 |
17.6 |
Unit: msec
CUDA
...torchaudio version | 8k → 16k | 16k → 8k | 16k → 44.1k | 44.1k → 16k |
v0.8.1
v0.8.0
Highlights
This release supports Python 3.9.
I/O Improvements
Continuing from the previous release, torchaudio improves the audio I/O mechanism. In this release, we have four major updates.
-
Backend migration.
We have migrated the default backend for audio I/O. The new default backend is “sox_io” (for Linux/macOS). The interface for “soundfile” backend has been also changed to align that of “sox_io”. Following the change of default backends, the legacy backend/interface have been marked as deprecated. The legacy backend/interface are still accessible, though it is strongly discouraged to use them. For the detail on the migration, please refer to #903. -
File-like object support.
We have added file-like object support to I/O functions and sox_effects. You can perform theinfo
,load
,save
andapply_effects_file
operation on file-like objects.# Query audio metadata over HTTP # Will only fetch the first few kB with requests.get(URL, stream=True) as response: metadata = torchaudio.info(response.raw) # Load audio from tar file # No need to extract TAR file. with tarfile.open(TAR_PATH, mode='r') as tarfile_: fileobj = tarfile_.extractfile(SAMPLE_TAR_ITEM) waveform, sample_rate = torchaudio.load(fileobj) # Saving to Bytes buffer # Using BytesIO, you can perform in-memory encoding/decoding. buffer_ = io.BytesIO() torchaudio.save(buffer_, waveform, sample_rate, format="wav") # Apply effects (lowpass filter / resampling) while loading audio from S3 client = boto3.client('s3') response = client.get_object(Bucket=S3_BUCKET, Key=S3_KEY) waveform, sample_rate = torchaudio.sox_effects.apply_effect_file( response['Body'], [["lowpass", "-1", "300"], ["rate", "8000"]])
-
[Beta] Codec Application.
Built upon the file-like object support, we addedfunctional.apply_codec
function, which can degrades audio data by applying audio codecs supported by “sox_io” backend, in in-memory fashion.# Apply MP3 codec degraded = F.apply_codec( waveform, sample_rate, format="mp3", compression=-9) # Apply GSM codec degraded = F.apply_codec(waveform, sample_rate, format="gsm")
-
Encoding options.
We have added encoding options to save function of new backends. Now you can change the format and encodings withformat
,encoding
andbits_per_sample
options# Save without any encoding option. # The function will pick the encoding which the provided data fit # For Tensor of float32 type, that is 32-bit floating-point PCM. torchaudio.save("data.wav", waveform, sample_rate) # Save as 16-bit signed integer Linear PCM # The resulting file occupies half the storage but loses precision torchaudio.save( "data.wav", waveform, sample_rate, encoding="PCM_S", bits_per_sample=16)
-
More format support to "sox_io"’s save function.
We have added support for GSM, HTK, AMB, and AMR-NB formats to "sox_io"’s save function.
Switch to CMake-based build
torchaudio was utilizing CMake to build third party dependencies. Now torchaudio uses CMake to build its C++ extension. This will open the door to integrate torchaudio in non-Python environments (such as C++ applications and mobile). We will work on adding example applications and mobile integrations in upcoming releases.
Backwards Incompatible Changes
- Removed deprecated transform and target_transform arguments from VCTK and YESNO datasets. (#1120) If you were relying on the previous behavior, we recommend that you apply the transforms in the collate function.
- Removed torchaudio.datasets.utils.walk_files (#1111) and replaced by Path and glob. (#1069, #1101). If you relied on the function, we recommend that you use glob instead.
- Removed torchaudio.data.utils.unicode_csv_reader. (#1086) If you relied on the function, we recommend that you replace by csv.reader.
- Disabled CommonVoice download as users are required to sign user agreement. Please download and extract the dataset manually, and replace the root argument by the subfolder for the version and language of interest, see #1082 for more details. (#1018, #1079, #1080, #1082)
- Removed legacy sox effects (#977, #1001). Please migrate to apply_effects_file or apply_effects_tensor.
- Switched the default backend to the ones with new interfaces (#978). If you were relying on the previous behavior, you can return to the previous behavior by following instructions in #975 for one more release.
New Features
- Added GSM, HTK, AMB, AMR-NB and AMR-WB format support to “sox_io” backend. (#1276, #1291, #1277, #1275, #1066)
- Added encoding options (format, bits_per_sample and encoding) to save function. (#1226, #1177, #1129, #1104)
- Added new attributes (bits_per_sample and encoding) to the info function return type (AudioMetaData) (#1177, #1206, #1324)
- Added format override to libsox-based file input. (load, info, sox_effects.apply_effects_file) (#1104)
- Added file-like object support in “sox_io”, and “soundfile” backend and sox_effects.apply_effects_file. (#1115)
- [Beta] Added the Kaldi Pitch feature. (#1243, #1260)
- [Beta] Added the SpectralCentroid transform. (#1167, #1216, #1316)
- [Beta] Added codec transformation apply_codec. (#1200)
Improvements
- Exposed normalization method to Mel transforms. (#1212)
- Exposed additional STFT arguments to Spectrogram (#892) and to MelSpectrogram (#1211).
- Added support for pathlib.Path to apply_effects_file (#1048) and to CMUARCTIC (#1025), YESNO (#1015), COMMONVOICE (#1027), VCTK and LJSPEECH (#1028), GTZAN (#1032), SPEECHCOMMANDS (#1039), TEDLIUM (#1045), LIBRITTS and LIBRISPEECH (#1046).
- Added SpeechCommands train/valid/test split. (#966, #1012)
Internals
- Replaced if-elseif-else with switch in sox C++ code. (#1270)
- Refactored C++ interface for sox_io's get_info_file (#1232) and get_encodinginfo (#1233).
- Add explicit functional import in init. (#1228)
- Refactored YESNO dataset (#1127), LJSPEECH dataset (#1143).
- Removed Python 2.7 reference from setup.py. (#1182)
- Merged flake8 configurations into single .flake8 file. (#1172, #1214)
- Updated calls to torch.stft to use return_complex=True. (#1096, #1013)
- Cleaned up handling of optional args in C++ with c10:optional. (#1043)
- Removed unused imports in sox effects. (#1052)
- Introduced functional submodule to organize functionals. (#1003)
- [Testing] Refactored MelSpectrogram librosa compatibility test to decouple from other tests. (#1267)
- [Testing] Moved batch tests for functionals. (#1254)
- [Testing] Refactored tests for backend (#1239) and for functionals (#1237).
- [Testing] Removed dependency on pytest from testing (#1157, #1188)
- [Testing] Refactored unitests for VCTK (#1134), SPEECHCOMMANDS (#1136), LIBRISPEECH (#1140), TEDLIUM (#1135), LJSPEECH (#1138), LIBRITTS (#1139), CMUARCTIC (#1147), GTZAN(#1148), COMMONVOICE and YESNO (#1133).
- [Testing] Removed dependency on COMMONVOICE dataset from tests. (#1132)
- [Build] Fixed Python 3.9 support (#1242)
- [Build] Switched to cmake for build. (#1187, #1246, #1249)
- [Build] Restructured C++ code to allow per file registration of custom ops. (#1221)
- [Build] Added logging to sox/CMakeLists.txt. (#1190)
- [Build] Disabled C++11 ABI when necessary for libtorch compatibility. (#880)
- [Build] Reorganized libsox source and build directory to accommodate additional third party code. (#1161, #1176)
- [Build] Refactored sox source files and moved into dedicated subfolder. (#1106)
- [Build] Enabled custom clean function for python setup.py clean. (#1142)
- [CI] Documented undocumented parameters. Added CI check. (#1248)
- [CI] Fixed sphinx warnings in documentation. Turned warnings into errors. (#1247)
- [CI] Print CPU info before running unit test. (#1218)
- [CI] Fixed clang-format job and fixed newly detected formatting issues. (#981, #1198, #1222)
- [CI] Updated unit test base Docker image. (#1193)
- [CI] Disabled CCI cache which is now known to be flaky. (#1189)
- [CI] Disabled torchscript BC test which is known to fail. (#1192)
- [CI] Stripped version suffix for pytorch. (#1185)
- [CI] Ran smoke test with CPU package for pytorch due to known issue with CUDA 11. (#1105)
- [CI] Added missing empty line at the end of config.yml. (#1020)
- [CI] Added automatic documentation build and push to branch in CI. (#1006, #1034, #1041, #1049, #1091, #1093, #1098, #1100, #1121)
- [CI] Ran GPU test for all pull requests and fixed current setup. (#998, #1014, #1191)
- [CI] Skipped tests that is known to fail on macOS Python 3.6/3.7. (#999)
- [CI] Changed the order of installation and aligned with Windows. (#987)
- [CI] Fixed documentation rendering by using Sphinx 2.4.4. (#974)
- [Doc] Added subcategories to functional documentation. (#1325)
- [Doc] Added a version selector in documentation. (#1273)
- [Doc] Updated compilation recommendation in README. (#1263)
- [Doc] Added CONTRIBUTING.md. (#1241)
- [Doc] Added instructions to install parametrized package. (#1164)
- [Doc] Fixed the return type for load functions. (#1122)
- [Doc] Added missing modules and minor fixes. (#1022, #1056, #1117)
- [Doc] Fixed spelling and links in README. (#1029, #1037, #1062, #1110, #1261)
- [Doc] Grouped filtering functionals in documentation page. (#1005, #1004)
- [Doc] Updated the compatibility matrix with torchaudio 0.7 (#979)
- [Doc] Added description of prototype/beta/stable features. (#968)
Bug Fixes
- Fixed amplitude_to_DB clamping behaviour on batches. (#1113)
- Disabled audio devices in sox builds which could interfere in the build process when detected. (#1153)
- Fixed COMMONVOICE for French where the audio file extension was missing on load. (#1126)
- Disabled OpenMP support for libsox which can produce errors when used i...
v0.7.2
Highlights
This release introduces support for python 3.9. There is no 0.7.1 release, and the following changes are compared to 0.7.0.
Improvements
- Add python 3.9 support (#1061)
Bug Fixes
- Temporarily disable OpenMP support for libsox (#1054)
Deprecations
- Disallow
download=True
in CommonVoice (#1076)
v0.7.0
Highlights
Example Pipelines
torchaudio is expanding its support for models and end-to-end applications. Please file an issue on github to provide feedback on them.
- Speech Recognition: Building on the addition of the Wav2Letter model for speech recognition in the last release, we added a training example pipelines for speech recognition that uses the LibriSpeech dataset.
- Text-to-Speech: With the goal of supporting text-to-speech applications, we added a vocoder based on the WaveRNN model. WaveRNN model is based on the implementation from this repository. The original implementation was introduced in "Efficient Neural Audio Synthesis". We provide an example training pipeline in the example folder that uses the LibriTTS dataset added to torchaudio in this release.
- Source Separation: We also support source separation with the addition of the ConvTasNet model, based on the paper "Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation." An example training pipeline is provided with the wsj0-mix dataset.
I/O Improvements
As you are likely already aware from the last release we’re currently in the process of making sox_io
, which ships with new features such as TorchScript support and performance improvements, the new default. If you want to benefit from these features now, we encourage you to migrate. For more information see issue #903.
Backwards Incompatible Changes
- Switched all %-based string formatting to
str.format
to adopt changes in PyTorch, leading to improved error messages for TorchScript (#850) - Split
sox_utils.list_formats()
for read and write (#811) - Made directory traversal order alphabetical and breadth-first, consistent across operating systems (#814)
- Changed GTZAN so that it only traverses filenames belonging to the dataset (#791)
New Features
- Added ConvTasNet model (#920, #933) with pipeline (#894)
- Added canonical pipeline with wav2letter (#632)
- The WaveRNN model (#705, #797, #801, #810, #836) is available with a canonical pipeline (#749, #802, #831, #863)
- Added all 3 releases of tedlium dataset (#882, #934, #945, #895)
- Added
VCTK_092
dataset (#812) - Added LibriTTS (#790, #820)
- Added SPHERE support to
sox_io
backend (#871) - Added torchscript sox effects (#760)
- Added a flag to change the interface of
soundfile
backend to the one identical tosox_io
backend. (#922)
Improvements
- Added
soundfile
compatibility backend. (#922) - Improved the speed of
torchaudio.compliance.kaldi.fbank
(#947) - Improved the speed of phaser (#660)
- Added warning when a Mel filter is all zero (#914)
- Added
pathlib.Path
support tosox_io
backend (#907) - Simplified C++ registration with TORCH_LIBRARY (#840)
- Merged sox effect and
sox_io
C++ implementation (#779)
Internal
- CI: Added test to validate torchscript backward compatibility (#838)
- CI: Used mocked datasets to test CMUArctic (#829), CommonVoice (#827), Speech Commands (#824), LJSpeech (#826), LibriSpeech (#825), YESNO (#792, #832)
- CI: Made *nix unit test fail if C++ extension is not available (#847, #849)
- CI: Separated I/O in testing. (#813, #773, #783)
- CI: Added smoke tests to
sox_io
andsox_effects
(#806) - CI: Tested utilities have been refactored (#805, #808, #809, #817, #822, #831)
- Doc: Added how to run tests (#843)
- Doc: Added 0.6.0 to version matrix in README (#833)
Bug Fixes
- Fixed device in interactive ASR example (#900)
- Fixed incorrect extension parsing (#885)
- Fixed dither with
noise_shaping = True
(#865) - Run unit test with non-editable installation (#845), and set
zip_safe = False
to disable egg installation (#842) - Sorted GTZAN dataset and use on-the-fly data in GTZAN test (#819)
Deprecations
v0.6.0
Highlights
torchaudio now includes a new model module (with wav2letter included), new functionals (contrast, cvm, dcshift, overdrive, vad, phaser, flanger, biquad), datasets (GTZAN, CMU), and a new optional sox backend with support for torchscript. torchaudio now also supports Windows, with the soundfile backend.
torchaudio requires python 3.6 or more recent.
Backwards Incompatible Changes
- We reorganized the C++ resources (#630) and replaced C++ bindings for sox_effects init/list/shutdown with torch binding (#748).
- We removed code specific to python 2 (#691), and we no longer tests against python 2 (#575) and 3.5 (#577)
New Features
- We now support Windows. (#604, #637, #642, #655, #743)
- We now have a model module which includes wav2letter. (#462, #722)
- We added the GTZAN and CMU datasets. (#668, #710)
- We now have the contrast functional (#551), cvm (#540), dcshift (#558), overdrive (#569), vad (#578, #599), phaser (#587, #607, #702), flanger (#651, #702), biquad (#661).
- We added a new sox_io backend (#718, #728, #734, #727, #763, #752, #731, #732, #726, #780) that is compatible with torchscript with a new AudioMetaData class (#761).
- MelSpectrogram now has power and normalized parameters (#633), and slaney normalization (#589, #641).
- lfilter now has a clamp option. (#600)
- Griffin-Lim can now have zero momentum. (#601)
- sliding_window_cmn now supports batching. (#570)
- Downloaded datasets now verify checksums. (#499)
Improvements
- We added ogg/vorbis/opus support to binary distribution (#750, #755).
- We replaced the use of torch.norm in spectrogram to improve performance (#747).
- We now use fused operations in lfilter for faster computation. (#517, #564)
- STFT is now called directly from torchaudio. (#531)
- We redesigned the backend mechanism to support torchscript, by restructuring the code (#695, #696, #700, #706, #707, #698), adding dynamic listing (#697)
- torchaudio can be built along with sox, or can use external sox. (#625, #669, #739)
- We redesigned the sox_effects module. (#708)
- We added more details to compilation instructions. (#667)
- We updated the README with instructions on changing the backend. (#553)
- We now have a version compatibility matrix in README. (#685)
- We now use cmake to build third party libraries (#753).
- We now use CircleCI instead of travis (#576, #584, #598, #603, #636, #738) and we test on GPU (#586, #777).
- We run the test suite against nightlies. (#538, #678)
- We redesigned our test suite: with new helper functions (#514, #519, #521, #565, #616, #690, #692, #694), standard pytorch test utilities (#513, #640, #643, #645, #646, #652, #650, #712), separated CPU and GPU tests (#513, #528, #644), more descriptive names (#532), clearer organization (#539, #541, #542, #664, #672, #687, #703, #716, #732), standardized name (#559), and backend aware (#719). This is detailed in a new README for testing (#566, #759).
- We now support typing, for datasets (#511, #522), for backends (#527), for init (#526), and inline (#530), with mypy configuration (#524, #544, #590).
Bug Fixes
- We removed in place operations so that Griffin-Lim can be backpropagated through. (#730)
- We fixed kaldi MFCC on GPU. (#681)
- We removed multiple definitions of SoxEffect in C++. (#635)
- We fixed the docstring of masking. (#612)
- We replaced views by reshape for batching. (#594)
- We fixed missing conda environment when testing in python 3.8. (#582)
- We ensure that sox is not exposed in windows. (#579)
- We corrected the instructions to install nightlies. (#547, #552)
- We fix the seed of mask_along_iid. (#529)
- We correctly report GPU tests as skipped instead of passed. (#516)
Deprecations
v0.5.1
v0.5.0
Highlights
torchaudio includes new transforms (e.g. Griffin-Lim and inverse Mel scale), new filters (e.g. all pass, fade, band pass/reject, band, treble, deemph, riaa), and datasets (LJ Speech and SpeechCommands).
Backwards Incompatible Changes
- torchaudio no longer supports python 2. We removed future and six imports. We added inline typing. (#413, #478, #479, #482, #486)
- We fixed CommonVoice dataset download, and updated to the latest version. (#498)
- We now skip data point with missing data in VCTK dataset. (#484)
New Features
- We now have the Vol transforms, and DB_to_amplitude.(#468, #469)
- We now have the InverseMelScale (#448)
- We now have the Griffin-Lim functional. (#365)
- We now support allpass, fade, bandpass, bandreject, band, treble, deemph, riaa. (#444, #449, #464, #470, #508)
- We now offer LJSpeech and SpeechCommands datasets. (#439, #437)
Improvements
- We added inline typing to SoxEffects and Kaldi compliance. (#490, #497)
- We refactored the tests. (#480, #485, #496, #491, #501, #502, #503, #506, #507, #509)
- We now run tests with sox only when sox is available. (#419)
- We extended batch support to MelScale, MelSpectrogram, MFCC, Resample. (#391, #435)
- The speed of torchaudio.functional.istft was improved. (#471)
- We now have transform and functional tests for AmplitudeToDB. (#463)
- We now ignore pycharm and OSX files in git. (#461)
- TimeStretch now has a batch test. (#459)
- Docstrings in transforms were polished. (#442)
- TimeStretch and AmplitudeToDB are now torch.nn.Module. (#456)
- Resample is now jitable. (#441)
- We support python 3.8. (#397)
- Add cuda test for complex norm. (#421)
- Dither is jitable with the latest version of pytorch. (#417)
- Batching uses view instead of reshape. (#409)
- We refactored the jitability test. (#395)
- In .circleci, we removed a conditional block that wasn't doing anything. (#399)
- We now have Windows CI for building. (#394 and #398)
- We corrected the use of standard variable names in code. (#393)
- We adopted native-Python code generation convention. (#378)
- torchaudio.istft creates tensors directly on device. (#377)
- torchaudio.compliance.kaldi.resample_waveform is now jitable. (#362)
- The runtime of torchaudio.functional.lfilter was decreased. (#374)
Bug Fixes
- We fixed flake8 errors. (#504, #505)
- We fixed Windows test by only testing with cpu-only binaries. (#489)
- Spelling correction in docstrings for transforms.FrequencyMasking and transforms.TimeMasking. (#474)
- In .circleci, we switched to use token for conda uploads. (#460)
- The default value of dither parameter was changed. (#453)
- TimeStretch moves device correctly. (#457)
- Adding dev-other option in librispeech. (#433)
- In build script, we install the correct version of pytorch for pip. (#412)
- Upgrading dataset DeprecationWarning to UserWarning so that the user gets the warning. (#402)
- Make power of spectrogram a float to work with complex norm. (#392)
- Fix random seed for flaky test_griffinlim test. (#388)
- Apply 'nightly' branch filter to binary uploads. (#385)
- Fixed build errors: added explicitly utf8 decoration, added explicit utf_8_encoder definition if not available, explicitly cast to int. (#380)
Deprecations
- None