Skip to content

Releases: pytorch/audio

v0.10.0

21 Oct 15:55
Compare
Choose a tag to compare

torchaudio 0.10.0 Release Note

Highlights

torchaudio 0.10.0 release includes:

  • New models (Tacotron2, HuBERT) and datasets (CMUDict, LibriMix)
  • Pretrained model support for ASR (Wav2Vec2, HuBERT) and TTS (WaveRNN, Tacotron2)
  • New operations (RNN Transducer loss, MVDR beamforming, PitchShift, etc)
  • CUDA-enabled binaries

[Beta] Wav2Vec2 / HuBERT Models and Pretrained Weights

HuBERT model architectures (“base”, “large” and “extra large” configurations) are added. In addition to that, support for pretrained weights from wav2vec 2.0, Unsupervised Cross-lingual Representation Learning and HuBERT are added.

These pretrained weights can be used for feature extractions and downstream task adaptation.

>>> import torchaudio
>>>
>>> # Build the model and load pretrained weight.
>>> model = torchaudio.pipelines.HUBERT_BASE.get_model()
>>> # Perform feature extraction.
>>> features, lengths = model.extract_features(waveforms)
>>> # Pass the features to downstream task
>>> ...

Some of the pretrained weights are fine-tuned for ASR tasks. The following example illustrates how to use weights and access to associated information, such as labels, which can be used in subsequent CTC decoding steps. (Note: torchaudio does not provide a CTC decoding mechanism.)

>>> import torchaudio
>>>
>>> bundle = torchaudio.pipelines.HUBERT_ASR_LARGE
>>>
>>> # Build the model and load pretrained weight.
>>> model = bundle.get_model()
Downloading:
100%|███████████████████████████████| 1.18G/1.18G [00:17<00:00, 73.8MB/s]
>>> # Check the corresponding labels of the output.
>>> labels = bundle.get_labels()
>>> print(labels)
('<s>', '<pad>', '</s>', '<unk>', '|', 'E', 'T', 'A', 'O', 'N', 'I', 'H', 'S', 'R', 'D', 'L', 'U', 'M', 'W', 'C', 'F', 'G', 'Y', 'P', 'B', 'V', 'K', "'", 'X', 'J', 'Q', 'Z')
>>>
>>> # Infer the label probability distribution
>>> waveform, sample_rate = torchaudio.load(hello-world.wav')
>>>
>>> emissions, _ = model(waveform)
>>>
>>> # Pass emission to (hypothetical) decoder
>>> transcripts = ctc_decode(emissions, labels)
>>> print(transcripts[0])
HELLO WORLD

[Beta] Tacotron2 and TTS Pipeline

A new model architecture, Tacotron2 is added, alongside several pretrained weights for TTS (text-to-speech). Since these TTS pipelines are composed of multiple models and specific data processing, so as to make it easy to use associated objects, a notion of bundle is introduced. Bundles provide a common access point to create a pipeline with a set of pretrained weights. They are available under torchaudio.pipelines module.
The following example illustrates a TTS pipeline where two models (Tacotron2 and WaveRNN) are used together.

>>> import torchaudio
>>>
>>> bundle = torchaudio.pipelines.TACOTRON2_WAVERNN_CHAR_LJSPEECH
>>>
>>> # Build text processor, Tacotron2 and vocoder (WaveRNN) model
>>> processor = bundle.get_text_preprocessor()
>>> tacotron2 = bundle.get_tacotron2()
Downloading:
100%|███████████████████████████████| 107M/107M [00:01<00:00, 87.9MB/s]
>>> vocoder = bundle.get_vocoder()
Downloading:
100%|███████████████████████████████| 16.7M/16.7M [00:00<00:00, 78.1MB/s]
>>>
>>> text = "Hello World!"
>>>
>>> # Encode text
>>> input, lengths = processor(text)
>>>
>>> # Generate (mel-scale) spectrogram
>>> specgram, lengths, _ = tacotron2.infer(input, lengths)
>>>
>>> # Convert spectrogram to waveform
>>> waveforms, lengths = vocoder(specgram, lengths)
>>>
>>> # Save audio
>>> torchaudio.save('hello-world.wav', waveforms, vocoder.sample_rate)

[Beta] RNN Transducer Loss

The loss function used in the RNN transducer architecture, which is widely used for speech recognition tasks, is added. The loss function (torchaudio.functional.rnnt_loss or torchaudio.transforms.RNNTLoss) supports float16 and float32 logits, has autograd and torchscript support, and can be run on both CPU and GPU, which has a custom CUDA kernel implementation for improved performance.

[Beta] MVDR Beamforming

This release adds support for MVDR beamforming on multi-channel audio using Time-Frequency masks. There are three solutions (ref_channel, stv_evd, stv_power) and it supports single-channel and multi-channel (perform average in the method) masks. It provides an online option that recursively updates the parameters for streaming audio.
Please refer to the MVDR tutorial.

GPU Build

This release adds GPU builds that support custom CUDA kernels in torchaudio, like the one being used for RNN transducer loss. Following this change, torchaudio’s binary distribution now includes CPU-only versions and CUDA-enabled versions. To use CUDA-enabled binaries, PyTorch also needs to be compatible with CUDA.

Additional Features

torchaudio.functional.lfilter now supports batch processing and multiple filters. Additional operations, including pitch shift, LFCC, and inverse spectrogram, are now supported in this release. The datasets CMUDict and LibriMix are added as well.

Backward Incompatible Changes

I/O

  • Default to PCM_16 for flac on soundfile backend (#1604)
    • When saving FLAC format with “soundfile” backend, PCM_24 (the previous default) could cause warping. The default has been changed to PCM_16, which does not suffer this.

Ops

  • Default to native complex type when returning raw spectrogram (#1549)
    • When power=None, torchaudio.functional.spectrogram and torchaudio.transforms.Spectrogram now defaults to return_complex=True, which returns Tensor of native complex type (such as torch.cfloat and torch.cdouble). To use a pseudo complex type, pass the resulting tensor to torch.view_as_real.
  • Remove deprecated kaldi.resample_waveform (#1555)
    • Please use torchaudio.functional.resample.
  • Replace waveform with specgram in SlidingWindowCmn (#1859)
    • The argument name was corrected to specgram.
  • Ensure integer input frequencies for resample (#1857)
    • Sampling rates were silently cast to integers in the resampling implementation, so it now requires integer sampling rate inputs to ensure expected resampling quality.

Wav2Vec2

  • Update extract_features of Wav2Vec2Model (#1776)
    • The previous implementation returned outputs from convolutional feature extractors. To match the behavior with the original fairseq’s implementation, the method was changed to return the outputs of the intermediate layers of transformer layers. To achieve the original behavior, please use Wav2Vec2Model.feature_extractor().
  • Move fine-tune specific module out of wav2vec2 encoder (#1782)
    • The internal structure of Wav2Vec2Model was updated. Wav2Vec2Model.encoder.read_out module is moved to Wav2Vec2Model.aux. If you have serialized state dict, please replace the key encoder.read_out with aux.
  • Updated wav2vec2 factory functions for more customizability (#1783, #1804, #1830)
    • The signatures of wav2vec2 factory functions are changed. num_out parameter has been changed to aux_num_out and other parameters are added before it. Please update the code from wav2vec2_base(num_out) to wav2vec2_base(aux_num_out=num_out).

Deprecations

  • Add melscale_fbanks and deprecate create_fb_matrix (#1653)
    • As linear_fbanks is introduced, create_fb_matrix is renamed to melscale_fbanks. The original create_fb_matrix is now deprecated. Please use melscale_fbanks.
  • Deprecate VCTK dataset (#1810)
    • This dataset has been taken down and is no longer available. Please use VCTK_092 dataset.
  • Deprecate data utils (#1809)
    • bg_iterator and diskcache_iterator are known to not improve the throughput of data loaders. Please cease their usage.

New Features

Models

Tacotron2

  • Add Tacotron2 model (#1621, #1647, #1844)
  • Add Tacotron2 loss function (#1764)
  • Add Tacotron2 inference method (#1648, #1839, #1849)
  • Add phoneme text preprocessing for Tacotron2 (#1668)
  • Move Tacotron2 out of prototype (#1714)

HuBERT

Pretrained Weights and Pipelines

  • Add pretrained weights for wavernn (#1612)

  • Add Tacotron2 pretrained models (#1693)

  • Add HUBERT pretrained weights (#1821, #1824)

  • Add pretrained weights from wav2vec2.0 and XLSR papers (#1827)

  • Add customization support to wav2vec2 labels (#1834)

  • Default pretrained weights to eval mode (#1843)

  • Move wav2vec2 pretrained models to pipelines module (#1876)

  • Add TTS bundle/pipelines (#1872)

  • Fix vocoder interface (#1895)

  • Fix Phonemizer download (#1897)


RNN Transducer Loss

  • Add reduction parameter for RNNT loss (#1590)

  • Rename RNNT loss C++ parameters (#1602)

  • Rename transducer to RNNT (#1603)

  • Remove gradient variable from RNNT loss Python code (#1616)

  • Remove reuse_logits_for_grads option for RNNT loss (#1610)

  • Remove fused_log_softmax option from RNNT loss (#1615)

  • RNNT loss resolve null gradient (#1707)

  • Move RNNT loss out of prototype (#1711)


MVDR Beamforming

  • Add MVDR module to example (#1709)

  • Add normalization to steering vector solutions in MVDR Module (#1765)

  • Move MVDR and PSD modules to transforms (#1771)

  • Add MVDR beamforming tutorial to example directory (#1768)


Ops

  • Add edit_distance (#1601)

  • Add PitchShift to functional and transform (#1629)

  • Add LFCC feature to transforms (#1611)

  • Add InverseSpectrogram to transforms and functional (#1652)


Datasets

  • Add CMUDict dataset (#1627)

  • Move LibriMix dataset to datasets directory (#1833)


Improvements

I/O

  • Make buffer size for function info configurable (#1634)


Ops

  • Replace deprecated AutoNonVariableTypeMode (#1583)

  • Remove lazy behavior from MelScale (#163...
Read more

torchaudio 0.9.1 Minor bugfix release

27 Sep 04:42
a85b239
Compare
Choose a tag to compare

This release depends on pytorch 1.9.1
No functional changes other than minor updates to CI rules.

v0.9.0

15 Jun 15:32
Compare
Choose a tag to compare

torchaudio 0.9.0 Release Note

Highlights

torchaudio 0.9.0 release includes:

  • Lots of performance improvements. (filtering, resampling, spectral operation)
  • Popular wav2vec2.0 model architecture.
  • Improved autograd support.

[Beta] Wav2Vec2.0 Model

This release includes model architectures from wav2vec2.0 paper with utility functions that allow importing pretrained model parameters published on fairseq and Hugging Face Hub. Now you can easily run speech recognition with torchaudio. These model architectures also support TorchScript, and you can deploy them with ONNX or in non-Python environments, such as C++, Android and iOS. Please checkout our C++, Android and iOS examples. The following snippets illustrate how to create a deployable model.

# Import fine-tuned model from Hugging Face Hub
import transformers
from torchaudio.models.wav2vec2.utils import import_huggingface_model

original = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")
imported = import_huggingface_model(original)
# Import fine-tuned model from fairseq
import fairseq
from torchaudio.models.wav2vec2.utils import import_fairseq_model

Original, _, _ = fairseq.checkpoint_utils.load_model_ensemble_and_task(
    ["wav2vec_small_960h.pt"], arg_overrides={'data': "<data_dir>"})
imported = import_fairseq_model(original[0].w2v_encoder)
# Build uninitialized model and load state dict
from torchaudio.models import wav2vec2_base

model = wav2vec2_base(num_out=32)
model.load_state_dict(imported.state_dict())

# Quantize / script / optimize for mobile
quantized_model = torch.quantization.quantize_dynamic(
    model, qconfig_spec={torch.nn.Linear}, dtype=torch.qint8)
scripted_model = torch.jit.script(quantized_model)
optimized_model = optimize_for_mobile(scripted_model)
optimized_model.save("model_for_deployment.pt")

Filtering Improvement

The internal implementation of lfilter has been updated to support autograd on both CPU and CUDA. Additionally, the performance on CPU is significantly improved. These improvements also apply to biquad variants.

The following table illustrates the performance improvements compared against the previous releases. lfilter was applied on float32 tensors with one channel and different number of frames.

torchaudio version

256

512

1024

0.9

0.282

0.381

0.564

0.8

0.493

0.780

1.37

0.7

5.42

10.8

22.3

Unit: msec

Complex Tensor Migration

torchaudio has functions that handle complex-valued tensors. In early days when PyTorch did not have a complex dtype, torchaudio adopted the convention to use an extra dimension to represent real and imaginary parts. In PyTorch 1.6, new dtyps, such as torch.cfloat and torch.cdouble were introduced to represent complex values natively. (In the following, we refer to torchaudio’s original convention as pseudo complex types, and PyTorch’s native dtype as native complex types.)

As the native complex types have become mature and stable, torchaudio has started to migrate complex functions to use the native complex type. In this release, the internal implementation was updated to use the native complex types, and interfaces were updated to allow passing/receiving native complex type directly. Users can choose to keep using the pseudo complex type or opt in to use native complex type. However, please note that the use of the pseudo complex type is now deprecated. These functions are tested to support TorchScript and autograd. For the detail of this migration plan, please refer to #1337.

Additionally, switching the internal implementation to the native complex types improved the performance. Since the internal implementation uses native complex type regardless of which complex type is passed/returned, users will automatically benefit from this performance improvement.

The following table illustrates the performance improvements from the previous release by comparing the time it takes for complex transforms to perform operation on float32 Tensor with two channels and 256 frames.

CPU
torchaudio version Spectrogram TimeStretch GriffinLim
0.9

0.229

12.6

3320

0.8

0.283

126

5320

Unit: msec

CUDA
torchaudio version Spectrogram TimeStretch GriffinLim
0.9

0.195

0.599

36

0.8

0.219

0.687

60.2

Unit: msec

Improved Autograd Support

Along with the work of Complex Tensor Migration and Filtering Improvement mentioned above, more tests were added to ensure the autograd support. Now the following operations are guaranteed to support autograd up to second order.

Functionals
  • lfilter
  • allpass_biquad
  • biquad
  • band_biquad
  • bandpass_biquad
  • bandrefect_biquad
  • bass_biquad
  • equalizer_biquad
  • treble_biquad
  • highpass_biquad
  • lowpass_biquad
Transforms
  • AmplitudeToDB
  • ComputeDeltas
  • Fade
  • GriffinLim
  • TimeMasking
  • FrequencyMasking
  • MFCC
  • MelScale
  • MelSpectrogram
  • Resample
  • SpectralCentroid
  • Spectrogram
  • SlidingWindowCmn
  • TimeStretch*
  • Vol

NOTE:

  1. Autograd test for transforms also covers the following functionals.
    • amplitude_to_DB
    • spectrogram
    • griffinlim
    • resample
    • phase_vocoder*
    • mask_along_axis_iid
    • mask_along_axis
    • gain
    • spectral_centroid
  2. torchaudio.transforms.TimeStretch and torchaudio.functional.phase_vocoder call atan2, which is not differentiable around zero. Therefore these functions are differentiable only when the input spectrogram does not contain values around zero.

[Beta] Resampling Improvement

In release 0.8, the resampling operation was vectorized and its performance improved. In this release, the implementation of the resampling algorithm has been further revised.

  • Kaiser window has been added for a wider range of resampling quality.
  • rolloff parameter has been added for anti-aliasing control.
  • torchaudio.transforms.Resample precomputes the kernel using float64 precision and caches it for even faster operation.
  • New entry point, torchaudio.functional.resample has been added and the original entry point, torchaudio.compliance.kaldi.resample_waveform is deprecated.

The following table illustrates the performance improvements from the previous release by comparing the time it takes for torchaudio.transforms.Resample to complete the operation on float32 tensor with two channels and one-second duration.

CPU
torchaudio version 8k → 16k [Hz] 16k → 8k 16k → 44.1k 44.1k → 16k
0.9

0.192

0.559

0.478

0.467

0.8

0.537

0.753

43.9

17.6

Unit: msec

CUDA
...
torchaudio version 8k → 16k 16k → 8k 16k → 44.1k 44.1k → 16k
Read more

v0.8.1

25 Mar 16:30
e4e171a
Compare
Choose a tag to compare

Highlights

This release depends on pytorch 1.8.1.

Bug Fixes

  • Added back support for 24-bit signed LPCM wav via sox_io backend. (#1389)

v0.8.0

04 Mar 20:43
099d788
Compare
Choose a tag to compare

Highlights

This release supports Python 3.9.

I/O Improvements

Continuing from the previous release, torchaudio improves the audio I/O mechanism. In this release, we have four major updates.

  1. Backend migration.
    We have migrated the default backend for audio I/O. The new default backend is “sox_io” (for Linux/macOS). The interface for “soundfile” backend has been also changed to align that of “sox_io”. Following the change of default backends, the legacy backend/interface have been marked as deprecated. The legacy backend/interface are still accessible, though it is strongly discouraged to use them. For the detail on the migration, please refer to #903.

  2. File-like object support.
    We have added file-like object support to I/O functions and sox_effects. You can perform the info, load, save and apply_effects_file operation on file-like objects.

    # Query audio metadata over HTTP
    # Will only fetch the first few kB
    with requests.get(URL, stream=True) as response:
      metadata = torchaudio.info(response.raw)
    
    # Load audio from tar file
    # No need to extract TAR file.
    with tarfile.open(TAR_PATH, mode='r') as tarfile_:
      fileobj = tarfile_.extractfile(SAMPLE_TAR_ITEM)
      waveform, sample_rate = torchaudio.load(fileobj)
    
    # Saving to Bytes buffer
    # Using BytesIO, you can perform in-memory encoding/decoding.
    buffer_ = io.BytesIO()
    torchaudio.save(buffer_, waveform, sample_rate, format="wav")
    
    # Apply effects (lowpass filter / resampling) while loading audio from S3
    client = boto3.client('s3')
    response = client.get_object(Bucket=S3_BUCKET, Key=S3_KEY)
    waveform, sample_rate = torchaudio.sox_effects.apply_effect_file(
      response['Body'], [["lowpass", "-1", "300"], ["rate", "8000"]])
  3. [Beta] Codec Application.
    Built upon the file-like object support, we added functional.apply_codec function, which can degrades audio data by applying audio codecs supported by “sox_io” backend, in in-memory fashion.

    # Apply MP3 codec
    degraded = F.apply_codec(
      waveform, sample_rate, format="mp3", compression=-9)
    # Apply GSM codec
    degraded = F.apply_codec(waveform, sample_rate, format="gsm")
  4. Encoding options.
    We have added encoding options to save function of new backends. Now you can change the format and encodings with format, encoding and bits_per_sample options

    # Save without any encoding option.
    # The function will pick the encoding which the provided data fit
    # For Tensor of float32 type, that is 32-bit floating-point PCM.
    torchaudio.save("data.wav", waveform, sample_rate)
    
    # Save as 16-bit signed integer Linear PCM
    # The resulting file occupies half the storage but loses precision
    torchaudio.save(
      "data.wav", waveform, sample_rate, encoding="PCM_S", bits_per_sample=16)
  5. More format support to "sox_io"’s save function.
    We have added support for GSM, HTK, AMB, and AMR-NB formats to "sox_io"’s save function.

Switch to CMake-based build

torchaudio was utilizing CMake to build third party dependencies. Now torchaudio uses CMake to build its C++ extension. This will open the door to integrate torchaudio in non-Python environments (such as C++ applications and mobile). We will work on adding example applications and mobile integrations in upcoming releases.

Backwards Incompatible Changes

  • Removed deprecated transform and target_transform arguments from VCTK and YESNO datasets. (#1120) If you were relying on the previous behavior, we recommend that you apply the transforms in the collate function.
  • Removed torchaudio.datasets.utils.walk_files (#1111) and replaced by Path and glob. (#1069, #1101). If you relied on the function, we recommend that you use glob instead.
  • Removed torchaudio.data.utils.unicode_csv_reader. (#1086) If you relied on the function, we recommend that you replace by csv.reader.
  • Disabled CommonVoice download as users are required to sign user agreement. Please download and extract the dataset manually, and replace the root argument by the subfolder for the version and language of interest, see #1082 for more details. (#1018, #1079, #1080, #1082)
  • Removed legacy sox effects (#977, #1001). Please migrate to apply_effects_file or apply_effects_tensor.
  • Switched the default backend to the ones with new interfaces (#978). If you were relying on the previous behavior, you can return to the previous behavior by following instructions in #975 for one more release.

New Features

  • Added GSM, HTK, AMB, AMR-NB and AMR-WB format support to “sox_io” backend. (#1276, #1291, #1277, #1275, #1066)
  • Added encoding options (format, bits_per_sample and encoding) to save function. (#1226, #1177, #1129, #1104)
  • Added new attributes (bits_per_sample and encoding) to the info function return type (AudioMetaData) (#1177, #1206, #1324)
  • Added format override to libsox-based file input. (load, info, sox_effects.apply_effects_file) (#1104)
  • Added file-like object support in “sox_io”, and “soundfile” backend and sox_effects.apply_effects_file. (#1115)
  • [Beta] Added the Kaldi Pitch feature. (#1243, #1260)
  • [Beta] Added the SpectralCentroid transform. (#1167, #1216, #1316)
  • [Beta] Added codec transformation apply_codec. (#1200)

Improvements

  • Exposed normalization method to Mel transforms. (#1212)
  • Exposed additional STFT arguments to Spectrogram (#892) and to MelSpectrogram (#1211).
  • Added support for pathlib.Path to apply_effects_file (#1048) and to CMUARCTIC (#1025), YESNO (#1015), COMMONVOICE (#1027), VCTK and LJSPEECH (#1028), GTZAN (#1032), SPEECHCOMMANDS (#1039), TEDLIUM (#1045), LIBRITTS and LIBRISPEECH (#1046).
  • Added SpeechCommands train/valid/test split. (#966, #1012)

Internals

  • Replaced if-elseif-else with switch in sox C++ code. (#1270)
  • Refactored C++ interface for sox_io's get_info_file (#1232) and get_encodinginfo (#1233).
  • Add explicit functional import in init. (#1228)
  • Refactored YESNO dataset (#1127), LJSPEECH dataset (#1143).
  • Removed Python 2.7 reference from setup.py. (#1182)
  • Merged flake8 configurations into single .flake8 file. (#1172, #1214)
  • Updated calls to torch.stft to use return_complex=True. (#1096, #1013)
  • Cleaned up handling of optional args in C++ with c10:optional. (#1043)
  • Removed unused imports in sox effects. (#1052)
  • Introduced functional submodule to organize functionals. (#1003)
  • [Testing] Refactored MelSpectrogram librosa compatibility test to decouple from other tests. (#1267)
  • [Testing] Moved batch tests for functionals. (#1254)
  • [Testing] Refactored tests for backend (#1239) and for functionals (#1237).
  • [Testing] Removed dependency on pytest from testing (#1157, #1188)
  • [Testing] Refactored unitests for VCTK (#1134), SPEECHCOMMANDS (#1136), LIBRISPEECH (#1140), TEDLIUM (#1135), LJSPEECH (#1138), LIBRITTS (#1139), CMUARCTIC (#1147), GTZAN(#1148), COMMONVOICE and YESNO (#1133).
  • [Testing] Removed dependency on COMMONVOICE dataset from tests. (#1132)
  • [Build] Fixed Python 3.9 support (#1242)
  • [Build] Switched to cmake for build. (#1187, #1246, #1249)
  • [Build] Restructured C++ code to allow per file registration of custom ops. (#1221)
  • [Build] Added logging to sox/CMakeLists.txt. (#1190)
  • [Build] Disabled C++11 ABI when necessary for libtorch compatibility. (#880)
  • [Build] Reorganized libsox source and build directory to accommodate additional third party code. (#1161, #1176)
  • [Build] Refactored sox source files and moved into dedicated subfolder. (#1106)
  • [Build] Enabled custom clean function for python setup.py clean. (#1142)
  • [CI] Documented undocumented parameters. Added CI check. (#1248)
  • [CI] Fixed sphinx warnings in documentation. Turned warnings into errors. (#1247)
  • [CI] Print CPU info before running unit test. (#1218)
  • [CI] Fixed clang-format job and fixed newly detected formatting issues. (#981, #1198, #1222)
  • [CI] Updated unit test base Docker image. (#1193)
  • [CI] Disabled CCI cache which is now known to be flaky. (#1189)
  • [CI] Disabled torchscript BC test which is known to fail. (#1192)
  • [CI] Stripped version suffix for pytorch. (#1185)
  • [CI] Ran smoke test with CPU package for pytorch due to known issue with CUDA 11. (#1105)
  • [CI] Added missing empty line at the end of config.yml. (#1020)
  • [CI] Added automatic documentation build and push to branch in CI. (#1006, #1034, #1041, #1049, #1091, #1093, #1098, #1100, #1121)
  • [CI] Ran GPU test for all pull requests and fixed current setup. (#998, #1014, #1191)
  • [CI] Skipped tests that is known to fail on macOS Python 3.6/3.7. (#999)
  • [CI] Changed the order of installation and aligned with Windows. (#987)
  • [CI] Fixed documentation rendering by using Sphinx 2.4.4. (#974)
  • [Doc] Added subcategories to functional documentation. (#1325)
  • [Doc] Added a version selector in documentation. (#1273)
  • [Doc] Updated compilation recommendation in README. (#1263)
  • [Doc] Added CONTRIBUTING.md. (#1241)
  • [Doc] Added instructions to install parametrized package. (#1164)
  • [Doc] Fixed the return type for load functions. (#1122)
  • [Doc] Added missing modules and minor fixes. (#1022, #1056, #1117)
  • [Doc] Fixed spelling and links in README. (#1029, #1037, #1062, #1110, #1261)
  • [Doc] Grouped filtering functionals in documentation page. (#1005, #1004)
  • [Doc] Updated the compatibility matrix with torchaudio 0.7 (#979)
  • [Doc] Added description of prototype/beta/stable features. (#968)

Bug Fixes

  • Fixed amplitude_to_DB clamping behaviour on batches. (#1113)
  • Disabled audio devices in sox builds which could interfere in the build process when detected. (#1153)
  • Fixed COMMONVOICE for French where the audio file extension was missing on load. (#1126)
  • Disabled OpenMP support for libsox which can produce errors when used i...
Read more

v0.7.2

10 Dec 17:02
a853dff
Compare
Choose a tag to compare

Highlights

This release introduces support for python 3.9. There is no 0.7.1 release, and the following changes are compared to 0.7.0.

Improvements

  • Add python 3.9 support (#1061)

Bug Fixes

  • Temporarily disable OpenMP support for libsox (#1054)

Deprecations

  • Disallow download=True in CommonVoice (#1076)

v0.7.0

27 Oct 16:17
ac17b64
Compare
Choose a tag to compare

Highlights

Example Pipelines

torchaudio is expanding its support for models and end-to-end applications. Please file an issue on github to provide feedback on them.

  • Speech Recognition: Building on the addition of the Wav2Letter model for speech recognition in the last release, we added a training example pipelines for speech recognition that uses the LibriSpeech dataset.
  • Text-to-Speech: With the goal of supporting text-to-speech applications, we added a vocoder based on the WaveRNN model. WaveRNN model is based on the implementation from this repository. The original implementation was introduced in "Efficient Neural Audio Synthesis". We provide an example training pipeline in the example folder that uses the LibriTTS dataset added to torchaudio in this release.
  • Source Separation: We also support source separation with the addition of the ConvTasNet model, based on the paper "Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation." An example training pipeline is provided with the wsj0-mix dataset.

I/O Improvements

As you are likely already aware from the last release we’re currently in the process of making sox_io, which ships with new features such as TorchScript support and performance improvements, the new default. If you want to benefit from these features now, we encourage you to migrate. For more information see issue #903.

Backwards Incompatible Changes

  • Switched all %-based string formatting to str.format to adopt changes in PyTorch, leading to improved error messages for TorchScript (#850)
  • Split sox_utils.list_formats() for read and write (#811)
  • Made directory traversal order alphabetical and breadth-first, consistent across operating systems (#814)
  • Changed GTZAN so that it only traverses filenames belonging to the dataset (#791)

New Features

  • Added ConvTasNet model (#920, #933) with pipeline (#894)
  • Added canonical pipeline with wav2letter (#632)
  • The WaveRNN model (#705, #797, #801, #810, #836) is available with a canonical pipeline (#749, #802, #831, #863)
  • Added all 3 releases of tedlium dataset (#882, #934, #945, #895)
  • Added VCTK_092 dataset (#812)
  • Added LibriTTS (#790, #820)
  • Added SPHERE support to sox_io backend (#871)
  • Added torchscript sox effects (#760)
  • Added a flag to change the interface of soundfile backend to the one identical to sox_io backend. (#922)

Improvements

  • Added soundfile compatibility backend. (#922)
  • Improved the speed of torchaudio.compliance.kaldi.fbank (#947)
  • Improved the speed of phaser (#660)
  • Added warning when a Mel filter is all zero (#914)
  • Added pathlib.Path support to sox_io backend (#907)
  • Simplified C++ registration with TORCH_LIBRARY (#840)
  • Merged sox effect and sox_io C++ implementation (#779)

Internal

  • CI: Added test to validate torchscript backward compatibility (#838)
  • CI: Used mocked datasets to test CMUArctic (#829), CommonVoice (#827), Speech Commands (#824), LJSpeech (#826), LibriSpeech (#825), YESNO (#792, #832)
  • CI: Made *nix unit test fail if C++ extension is not available (#847, #849)
  • CI: Separated I/O in testing. (#813, #773, #783)
  • CI: Added smoke tests to sox_io and sox_effects (#806)
  • CI: Tested utilities have been refactored (#805, #808, #809, #817, #822, #831)
  • Doc: Added how to run tests (#843)
  • Doc: Added 0.6.0 to version matrix in README (#833)

Bug Fixes

  • Fixed device in interactive ASR example (#900)
  • Fixed incorrect extension parsing (#885)
  • Fixed dither with noise_shaping = True (#865)
  • Run unit test with non-editable installation (#845), and set zip_safe = False to disable egg installation (#842)
  • Sorted GTZAN dataset and use on-the-fly data in GTZAN test (#819)

Deprecations

  • Removed istft wrapper in favor of torch.istft. (#841)
  • Deprecated SoxEffect and SoxEffectsChain (#787)
  • I/O: Deprecated sox backend. (#904)
  • I/O: Deprecated the current interface of soundfile. (#922)
  • I/O: Deprecated load_wav functions. (#905)

v0.6.0

28 Jul 15:19
f17ae39
Compare
Choose a tag to compare

Highlights

torchaudio now includes a new model module (with wav2letter included), new functionals (contrast, cvm, dcshift, overdrive, vad, phaser, flanger, biquad), datasets (GTZAN, CMU), and a new optional sox backend with support for torchscript. torchaudio now also supports Windows, with the soundfile backend.

torchaudio requires python 3.6 or more recent.

Backwards Incompatible Changes

  • We reorganized the C++ resources (#630) and replaced C++ bindings for sox_effects init/list/shutdown with torch binding (#748).
  • We removed code specific to python 2 (#691), and we no longer tests against python 2 (#575) and 3.5 (#577)

New Features

  • We now support Windows. (#604, #637, #642, #655, #743)
  • We now have a model module which includes wav2letter. (#462, #722)
  • We added the GTZAN and CMU datasets. (#668, #710)
  • We now have the contrast functional (#551), cvm (#540), dcshift (#558), overdrive (#569), vad (#578, #599), phaser (#587, #607, #702), flanger (#651, #702), biquad (#661).
  • We added a new sox_io backend (#718, #728, #734, #727, #763, #752, #731, #732, #726, #780) that is compatible with torchscript with a new AudioMetaData class (#761).
  • MelSpectrogram now has power and normalized parameters (#633), and slaney normalization (#589, #641).
  • lfilter now has a clamp option. (#600)
  • Griffin-Lim can now have zero momentum. (#601)
  • sliding_window_cmn now supports batching. (#570)
  • Downloaded datasets now verify checksums. (#499)

Improvements

  • We added ogg/vorbis/opus support to binary distribution (#750, #755).
  • We replaced the use of torch.norm in spectrogram to improve performance (#747).
  • We now use fused operations in lfilter for faster computation. (#517, #564)
  • STFT is now called directly from torchaudio. (#531)
  • We redesigned the backend mechanism to support torchscript, by restructuring the code (#695, #696, #700, #706, #707, #698), adding dynamic listing (#697)
  • torchaudio can be built along with sox, or can use external sox. (#625, #669, #739)
  • We redesigned the sox_effects module. (#708)
  • We added more details to compilation instructions. (#667)
  • We updated the README with instructions on changing the backend. (#553)
  • We now have a version compatibility matrix in README. (#685)
  • We now use cmake to build third party libraries (#753).
  • We now use CircleCI instead of travis (#576, #584, #598, #603, #636, #738) and we test on GPU (#586, #777).
  • We run the test suite against nightlies. (#538, #678)
  • We redesigned our test suite: with new helper functions (#514, #519, #521, #565, #616, #690, #692, #694), standard pytorch test utilities (#513, #640, #643, #645, #646, #652, #650, #712), separated CPU and GPU tests (#513, #528, #644), more descriptive names (#532), clearer organization (#539, #541, #542, #664, #672, #687, #703, #716, #732), standardized name (#559), and backend aware (#719). This is detailed in a new README for testing (#566, #759).
  • We now support typing, for datasets (#511, #522), for backends (#527), for init (#526), and inline (#530), with mypy configuration (#524, #544, #590).

Bug Fixes

  • We removed in place operations so that Griffin-Lim can be backpropagated through. (#730)
  • We fixed kaldi MFCC on GPU. (#681)
  • We removed multiple definitions of SoxEffect in C++. (#635)
  • We fixed the docstring of masking. (#612)
  • We replaced views by reshape for batching. (#594)
  • We fixed missing conda environment when testing in python 3.8. (#582)
  • We ensure that sox is not exposed in windows. (#579)
  • We corrected the instructions to install nightlies. (#547, #552)
  • We fix the seed of mask_along_iid. (#529)
  • We correctly report GPU tests as skipped instead of passed. (#516)

Deprecations

  • Since sox_effects is now automatically initialized and shutdown (#572, #693), we are deprecating these functions (#709).
  • ISTFT is migrating to torch. (#523)

v0.5.1

22 Jun 18:17
7143479
Compare
Choose a tag to compare

Highlights

  • Updated pinned version of PyTorch to v1.5.1

v0.5.0

21 Apr 14:55
Compare
Choose a tag to compare

Highlights

torchaudio includes new transforms (e.g. Griffin-Lim and inverse Mel scale), new filters (e.g. all pass, fade, band pass/reject, band, treble, deemph, riaa), and datasets (LJ Speech and SpeechCommands).

Backwards Incompatible Changes

  • torchaudio no longer supports python 2. We removed future and six imports. We added inline typing. (#413, #478, #479, #482, #486)
  • We fixed CommonVoice dataset download, and updated to the latest version. (#498)
  • We now skip data point with missing data in VCTK dataset. (#484)

New Features

  • We now have the Vol transforms, and DB_to_amplitude.(#468, #469)
  • We now have the InverseMelScale (#448)
  • We now have the Griffin-Lim functional. (#365)
  • We now support allpass, fade, bandpass, bandreject, band, treble, deemph, riaa. (#444, #449, #464, #470, #508)
  • We now offer LJSpeech and SpeechCommands datasets. (#439, #437)

Improvements

  • We added inline typing to SoxEffects and Kaldi compliance. (#490, #497)
  • We refactored the tests. (#480, #485, #496, #491, #501, #502, #503, #506, #507, #509)
  • We now run tests with sox only when sox is available. (#419)
  • We extended batch support to MelScale, MelSpectrogram, MFCC, Resample. (#391, #435)
  • The speed of torchaudio.functional.istft was improved. (#471)
  • We now have transform and functional tests for AmplitudeToDB. (#463)
  • We now ignore pycharm and OSX files in git. (#461)
  • TimeStretch now has a batch test. (#459)
  • Docstrings in transforms were polished. (#442)
  • TimeStretch and AmplitudeToDB are now torch.nn.Module. (#456)
  • Resample is now jitable. (#441)
  • We support python 3.8. (#397)
  • Add cuda test for complex norm. (#421)
  • Dither is jitable with the latest version of pytorch. (#417)
  • Batching uses view instead of reshape. (#409)
  • We refactored the jitability test. (#395)
  • In .circleci, we removed a conditional block that wasn't doing anything. (#399)
  • We now have Windows CI for building. (#394 and #398)
  • We corrected the use of standard variable names in code. (#393)
  • We adopted native-Python code generation convention. (#378)
  • torchaudio.istft creates tensors directly on device. (#377)
  • torchaudio.compliance.kaldi.resample_waveform is now jitable. (#362)
  • The runtime of torchaudio.functional.lfilter was decreased. (#374)

Bug Fixes

  • We fixed flake8 errors. (#504, #505)
  • We fixed Windows test by only testing with cpu-only binaries. (#489)
  • Spelling correction in docstrings for transforms.FrequencyMasking and transforms.TimeMasking. (#474)
  • In .circleci, we switched to use token for conda uploads. (#460)
  • The default value of dither parameter was changed. (#453)
  • TimeStretch moves device correctly. (#457)
  • Adding dev-other option in librispeech. (#433)
  • In build script, we install the correct version of pytorch for pip. (#412)
  • Upgrading dataset DeprecationWarning to UserWarning so that the user gets the warning. (#402)
  • Make power of spectrogram a float to work with complex norm. (#392)
  • Fix random seed for flaky test_griffinlim test. (#388)
  • Apply 'nightly' branch filter to binary uploads. (#385)
  • Fixed build errors: added explicitly utf8 decoration, added explicit utf_8_encoder definition if not available, explicitly cast to int. (#380)

Deprecations

  • None