Skip to content

v0.10.0

Compare
Choose a tag to compare
@carolineechen carolineechen released this 21 Oct 15:55
· 1287 commits to main since this release

torchaudio 0.10.0 Release Note

Highlights

torchaudio 0.10.0 release includes:

  • New models (Tacotron2, HuBERT) and datasets (CMUDict, LibriMix)
  • Pretrained model support for ASR (Wav2Vec2, HuBERT) and TTS (WaveRNN, Tacotron2)
  • New operations (RNN Transducer loss, MVDR beamforming, PitchShift, etc)
  • CUDA-enabled binaries

[Beta] Wav2Vec2 / HuBERT Models and Pretrained Weights

HuBERT model architectures (“base”, “large” and “extra large” configurations) are added. In addition to that, support for pretrained weights from wav2vec 2.0, Unsupervised Cross-lingual Representation Learning and HuBERT are added.

These pretrained weights can be used for feature extractions and downstream task adaptation.

>>> import torchaudio
>>>
>>> # Build the model and load pretrained weight.
>>> model = torchaudio.pipelines.HUBERT_BASE.get_model()
>>> # Perform feature extraction.
>>> features, lengths = model.extract_features(waveforms)
>>> # Pass the features to downstream task
>>> ...

Some of the pretrained weights are fine-tuned for ASR tasks. The following example illustrates how to use weights and access to associated information, such as labels, which can be used in subsequent CTC decoding steps. (Note: torchaudio does not provide a CTC decoding mechanism.)

>>> import torchaudio
>>>
>>> bundle = torchaudio.pipelines.HUBERT_ASR_LARGE
>>>
>>> # Build the model and load pretrained weight.
>>> model = bundle.get_model()
Downloading:
100%|███████████████████████████████| 1.18G/1.18G [00:17<00:00, 73.8MB/s]
>>> # Check the corresponding labels of the output.
>>> labels = bundle.get_labels()
>>> print(labels)
('<s>', '<pad>', '</s>', '<unk>', '|', 'E', 'T', 'A', 'O', 'N', 'I', 'H', 'S', 'R', 'D', 'L', 'U', 'M', 'W', 'C', 'F', 'G', 'Y', 'P', 'B', 'V', 'K', "'", 'X', 'J', 'Q', 'Z')
>>>
>>> # Infer the label probability distribution
>>> waveform, sample_rate = torchaudio.load(hello-world.wav')
>>>
>>> emissions, _ = model(waveform)
>>>
>>> # Pass emission to (hypothetical) decoder
>>> transcripts = ctc_decode(emissions, labels)
>>> print(transcripts[0])
HELLO WORLD

[Beta] Tacotron2 and TTS Pipeline

A new model architecture, Tacotron2 is added, alongside several pretrained weights for TTS (text-to-speech). Since these TTS pipelines are composed of multiple models and specific data processing, so as to make it easy to use associated objects, a notion of bundle is introduced. Bundles provide a common access point to create a pipeline with a set of pretrained weights. They are available under torchaudio.pipelines module.
The following example illustrates a TTS pipeline where two models (Tacotron2 and WaveRNN) are used together.

>>> import torchaudio
>>>
>>> bundle = torchaudio.pipelines.TACOTRON2_WAVERNN_CHAR_LJSPEECH
>>>
>>> # Build text processor, Tacotron2 and vocoder (WaveRNN) model
>>> processor = bundle.get_text_preprocessor()
>>> tacotron2 = bundle.get_tacotron2()
Downloading:
100%|███████████████████████████████| 107M/107M [00:01<00:00, 87.9MB/s]
>>> vocoder = bundle.get_vocoder()
Downloading:
100%|███████████████████████████████| 16.7M/16.7M [00:00<00:00, 78.1MB/s]
>>>
>>> text = "Hello World!"
>>>
>>> # Encode text
>>> input, lengths = processor(text)
>>>
>>> # Generate (mel-scale) spectrogram
>>> specgram, lengths, _ = tacotron2.infer(input, lengths)
>>>
>>> # Convert spectrogram to waveform
>>> waveforms, lengths = vocoder(specgram, lengths)
>>>
>>> # Save audio
>>> torchaudio.save('hello-world.wav', waveforms, vocoder.sample_rate)

[Beta] RNN Transducer Loss

The loss function used in the RNN transducer architecture, which is widely used for speech recognition tasks, is added. The loss function (torchaudio.functional.rnnt_loss or torchaudio.transforms.RNNTLoss) supports float16 and float32 logits, has autograd and torchscript support, and can be run on both CPU and GPU, which has a custom CUDA kernel implementation for improved performance.

[Beta] MVDR Beamforming

This release adds support for MVDR beamforming on multi-channel audio using Time-Frequency masks. There are three solutions (ref_channel, stv_evd, stv_power) and it supports single-channel and multi-channel (perform average in the method) masks. It provides an online option that recursively updates the parameters for streaming audio.
Please refer to the MVDR tutorial.

GPU Build

This release adds GPU builds that support custom CUDA kernels in torchaudio, like the one being used for RNN transducer loss. Following this change, torchaudio’s binary distribution now includes CPU-only versions and CUDA-enabled versions. To use CUDA-enabled binaries, PyTorch also needs to be compatible with CUDA.

Additional Features

torchaudio.functional.lfilter now supports batch processing and multiple filters. Additional operations, including pitch shift, LFCC, and inverse spectrogram, are now supported in this release. The datasets CMUDict and LibriMix are added as well.

Backward Incompatible Changes

I/O

  • Default to PCM_16 for flac on soundfile backend (#1604)
    • When saving FLAC format with “soundfile” backend, PCM_24 (the previous default) could cause warping. The default has been changed to PCM_16, which does not suffer this.

Ops

  • Default to native complex type when returning raw spectrogram (#1549)
    • When power=None, torchaudio.functional.spectrogram and torchaudio.transforms.Spectrogram now defaults to return_complex=True, which returns Tensor of native complex type (such as torch.cfloat and torch.cdouble). To use a pseudo complex type, pass the resulting tensor to torch.view_as_real.
  • Remove deprecated kaldi.resample_waveform (#1555)
    • Please use torchaudio.functional.resample.
  • Replace waveform with specgram in SlidingWindowCmn (#1859)
    • The argument name was corrected to specgram.
  • Ensure integer input frequencies for resample (#1857)
    • Sampling rates were silently cast to integers in the resampling implementation, so it now requires integer sampling rate inputs to ensure expected resampling quality.

Wav2Vec2

  • Update extract_features of Wav2Vec2Model (#1776)
    • The previous implementation returned outputs from convolutional feature extractors. To match the behavior with the original fairseq’s implementation, the method was changed to return the outputs of the intermediate layers of transformer layers. To achieve the original behavior, please use Wav2Vec2Model.feature_extractor().
  • Move fine-tune specific module out of wav2vec2 encoder (#1782)
    • The internal structure of Wav2Vec2Model was updated. Wav2Vec2Model.encoder.read_out module is moved to Wav2Vec2Model.aux. If you have serialized state dict, please replace the key encoder.read_out with aux.
  • Updated wav2vec2 factory functions for more customizability (#1783, #1804, #1830)
    • The signatures of wav2vec2 factory functions are changed. num_out parameter has been changed to aux_num_out and other parameters are added before it. Please update the code from wav2vec2_base(num_out) to wav2vec2_base(aux_num_out=num_out).

Deprecations

  • Add melscale_fbanks and deprecate create_fb_matrix (#1653)
    • As linear_fbanks is introduced, create_fb_matrix is renamed to melscale_fbanks. The original create_fb_matrix is now deprecated. Please use melscale_fbanks.
  • Deprecate VCTK dataset (#1810)
    • This dataset has been taken down and is no longer available. Please use VCTK_092 dataset.
  • Deprecate data utils (#1809)
    • bg_iterator and diskcache_iterator are known to not improve the throughput of data loaders. Please cease their usage.

New Features

Models

Tacotron2

  • Add Tacotron2 model (#1621, #1647, #1844)
  • Add Tacotron2 loss function (#1764)
  • Add Tacotron2 inference method (#1648, #1839, #1849)
  • Add phoneme text preprocessing for Tacotron2 (#1668)
  • Move Tacotron2 out of prototype (#1714)

HuBERT

Pretrained Weights and Pipelines

  • Add pretrained weights for wavernn (#1612)

  • Add Tacotron2 pretrained models (#1693)

  • Add HUBERT pretrained weights (#1821, #1824)

  • Add pretrained weights from wav2vec2.0 and XLSR papers (#1827)

  • Add customization support to wav2vec2 labels (#1834)

  • Default pretrained weights to eval mode (#1843)

  • Move wav2vec2 pretrained models to pipelines module (#1876)

  • Add TTS bundle/pipelines (#1872)

  • Fix vocoder interface (#1895)

  • Fix Phonemizer download (#1897)


RNN Transducer Loss

  • Add reduction parameter for RNNT loss (#1590)

  • Rename RNNT loss C++ parameters (#1602)

  • Rename transducer to RNNT (#1603)

  • Remove gradient variable from RNNT loss Python code (#1616)

  • Remove reuse_logits_for_grads option for RNNT loss (#1610)

  • Remove fused_log_softmax option from RNNT loss (#1615)

  • RNNT loss resolve null gradient (#1707)

  • Move RNNT loss out of prototype (#1711)


MVDR Beamforming

  • Add MVDR module to example (#1709)

  • Add normalization to steering vector solutions in MVDR Module (#1765)

  • Move MVDR and PSD modules to transforms (#1771)

  • Add MVDR beamforming tutorial to example directory (#1768)


Ops

  • Add edit_distance (#1601)

  • Add PitchShift to functional and transform (#1629)

  • Add LFCC feature to transforms (#1611)

  • Add InverseSpectrogram to transforms and functional (#1652)


Datasets

  • Add CMUDict dataset (#1627)

  • Move LibriMix dataset to datasets directory (#1833)


Improvements

I/O

  • Make buffer size for function info configurable (#1634)


Ops

  • Replace deprecated AutoNonVariableTypeMode (#1583)

  • Remove lazy behavior from MelScale (#1636)

  • Simplify axis value checks (#1501)

  • Use at::parallel_for in lfilter core loop (#1557)

  • Add filterbanks support to lfilter (#1587)

  • Add batch support to lfilter (#1638)

  • Use integer rates in pitch shift resample (#1861)


Models

  • Rename infer method to forward for WaveRNNInferenceWrapper (#1650)

  • Refactor WaveRNN infer and move it to the codebase (#1704)

  • Make the core wav2vec2 factory function public (#1829)

  • Refactor WaveRNNInferenceWrapper (#1845)

  • Store n_bits in WaveRNN (#1847)

  • Replace custom padding with torch’s native impl (#1846)

  • Avoid concatenation in loop (#1850)

  • Add lengths param to WaveRNN.infer (#1851)

  • Add sample rate to wav2vec2 bundle (#1878)

  • Remove factory functions of Tacotron2 and WaveRNN (#1874)


Datasets

  • Fix encoding of CMUDict data reading (#1665)

  • Rename utterance to transcript in datasets (#1841)

  • Clean up constructor of CMUDict (#1852)


Performance

  • Refactor transforms.Fade on GPU computation (#1871)

CUDA

Tensor shape [1,4,8000] [1,4,16000] [1,4,32000]
0.10 119 120 123
0.9 160 184 240

Unit: msec

Examples

  • Add text preprocessing utilities for TTS pipeline (#1639)

  • Replace simple_ctc with Python greedy decoder (#1558)

  • Add an inference example for WaveRNN (#1637)

  • Refactor coding style for WaveRNN example (#1663)

  • Add style checks on example files on CI (#1667)

  • Add Tacotron2 training script (#1642)

  • Add an inference example for Tacotron2 (#1654)

  • Fix Tacotron2 inference example (#1716)

  • Fix WaveRNN training example (#1740)

  • Training recipe for ConvTasNet on Libri2Mix dataset (#1757)


Build

  • Update skipIfNoCuda decorator and force GPU tests in GPU CIs (#1559)

  • Temporarily pin nightly version on Linux/macOS CPU unittest (#1598)

  • Temporarily pin nightly version on Linux GPU unitest (#1606)

  • Revert CI hot fix (#1614)

  • Expose USE_CUDA in build (#1609)

  • Pin MKL to 2021.2.0 (#1655)

  • Simplify extension initialization (#1649)

  • Synchronize extension initialization mechanism with fbcode (#1682)

  • Ensure we’re propagating BUILD_VERSION (#1697)

  • Guard Kaldi’s version generation (#1715)

  • Update sphinx to 3.5.4 (#1685)

  • Default to BUILD_SOX=1 in non-Windows systems (#1725)

  • Add CUDA install step to Win Packaging jobs (#1732)

  • setup.py should parse TORCH_CUDA_ARCH_LIST (#1733)

  • Simplify the extension initialization process (#1734)

  • Fix CUDA build logic for _torchaudio.so (#1737)

  • Enable Linux wheel/conda GPU package builds (#1730)

  • Increase no_output_timeout to 20m for WinConda (#1738)

  • Build torchaudio for 11.3 as well (#1747)

  • Upload wheels to respective folders (#1751)

  • Extract PyBind11 feature implementations (#1739)

  • Update the way to access libsox global config (#1755)

  • Fix ROCM build error (#1729)

  • Fix compile warnings (#1762)

  • Migrate CircleCI docker image (#1767)

  • Split extension into custom impl and Python wrapper libraries (#1752)

  • Put libtorchaudio in lib directory (#1773)

  • Update win gpu image from previous to stable (#1786)

  • Set libtorch audio suffix as pyd on Windows (#1788)

  • Fix build on Windows with CUDA (#1787)

  • Enable audio windows cuda tests (#1777)

  • Set release and base PyTorch version (#1816)

  • Exclude prototype if it is in release (#1870)

  • Log prototype exclusion (#1882)

  • Update prototype exclusion (#1885)

  • Remove alpha from version number (#1901)


Testing

  • Migrate resample tests from kaldi to functional (#1520)

  • Add autograd gradcheck test for RNN transducer loss (#1532)

  • Fix HF wav2vec2 test (#1585)

  • Update unit test CUDA to 10.2 (#1605)

  • Fix CircleCI unittest environemnt

  • Remove skipIfRocm from test_fileobj_flac in soundfile.save_test (#1626)

  • MFCC test refactor (#1618)

  • Refactor RNNT Loss Unit Tests (#1630)

  • Reduce sample rate to avoid test time out (#1640)

  • Refactor text preprocessing tests in Tacotron2 example (#1635)

  • Move test initialization logic to dedicated directory (#1680)

  • Update pitch shift batch consistency test (#1700)

  • Refactor scripting in test (#1727)

  • Update the version of fairseq used for testing (#1745)

  • Put output tensor on proper device in get_whitenoise (#1744)

  • Refactor batch consistency test in transforms (#1772)

  • Tweak test name by appending factory function name (#1780)

  • Enable audio windows cuda tests (#1777)

  • Skip hubert_asr_xlarge TS test on Windows (#1800)

  • Skip hubert_xlarge TS test on Windows (#1807)


Others

  • Remove unused files (#1588)

  • Remove residuals for removed modules (#1599)

  • Remove torchscript bc test references (#1623)

  • Remove torchaudio._internal.fft module (#1631)


Misc

  • Rename master branch to main (#1649)

  • Fix Python spacing (#1670)

  • Lint fix (#1726)

  • Add .gitattributes (#1731)

  • Style fixes (#1766)

  • Update reference from master to main elsewhere (#1784)


Bug Fixes

  • Fix models import (#1664)

  • Fix HF model integration (#1781)


Documentation

  • README Updates

    • Update README (#1544)

    • Remove NumPy dependency from README (#1582)

    • Fix typos and sentence structure in README.md (#1633)

    • Update and move convention section to CONTRIBUTING.md (#1635)

    • Remove unnecessary README (#1728)

    • Add link to TTS colab example to README (#1748)

    • Fix typo in source separation README (#1774)

  • Docstring Changes

    • Set removal version of pseudo complex support (#1553)

    • Update docs (#1584)

    • Add return type in doc for RNNT loss (#1591)

    • Improve RNNT loss docstrings (#1642)

    • Add documentation for CMUDict’s property (#1683)

    • Refactor lfilter docs (#1698)

    • Standardize optional types in docstrings (#1746)

    • Fix return type of wav2vec2 model (#1790)

    • Add equations to MVDR docstring (#1789)

    • Standardize tensor shapes format in docs (#1838)

    • Add license to pre-trained model doc (#1836)

    • Update Tacotron2 docs (#1840)

    • Fix PitchShift docstring (#1866)

    • Update descriptions of lengths parameters (#1890)

    • Standardization and minor fixes (#1892)

    • Update models/pipelines doc (#1894)

  • Docs formatting

    • Remove override CSS (#1554)

    • Add prototype.tacotron2 page to docs (#1695)

    • Add doc for InverseSepctrogram (#1706)

    • Add sections to transforms docs (#1720)

    • Add edit_distance to documentation with a new category Metric (#1743)

    • Fix model subsections (#1775)

    • List all the pre-trained models on right bar (#1828)

    • Put pretrained weights to subsection (#1879)

  • Examples (see #1564)

    • Add example code for Resample (#1644)

    • Fix examples in transforms (#1646)

    • Add example for ComplexNorm (#1658)

    • Add example for MuLawEncoding (#1586)

    • Add example for Spectrogram (#1566)

    • Add example for GriffinLim (#1671)

    • Add example for MuLawDecoding (#1684)

    • Add example for Fade transform (#1719)

    • Update RNNT loss docs and add example (#1835)

    • Add SpecAugment figure/citation (#1887)

    • Add filter bank figures (#1891)