Skip to content

Releases: pytorch/audio

v0.4.0

15 Jan 22:13
8afed30
Compare
Choose a tag to compare

torchaudio 0.4 improves on current transformations, datasets, and backend support.

  • We introduce an interactive speech recognition demo. (#266, #229, #248)
  • SoX is now optional, and a new extensible backend dispatch mechanism exposes SoundFile as an alternative to SoX.
  • The interface for datasets has been unified. This enables the addition of two large datasets: LibriSpeech and Common Voice.
  • New filters such as biquad, data augmentation such as time and frequency masking, and transforms such as gain and dither, and new feature computation such as deltas, are now available.
  • Transformations now support batches and are jitable.

We would like to thank again our contributors and the wider community for their significant contributions to this release. In particular we'd like to thank @keunwoochoi, @ksanjeevan, and all the other maintainers and contributors of torchaudio-contrib for their significant and valuable additions around augmentations (#285) and batching (#327).

Breaking Changes

  • torchaudio now requires PyTorch 1.3.0 or newer, see https://pytorch.org/ for installation instructions. (#312)
  • We make jit compilation optional for functions and use nn.Module where possible. (#314, #326, #342, #369)
  • By unifying the interface for datasets, we changed the interface for VCTK and YESNO (#303, #316). In particular, the construction parameters downsample, transform, target_transform, and return_dict are being deprecated.
  • SoxEffectsChain.EFFECTS_AVAILABLE replaced by SoxEffectsChain().EFFECTS_AVAILABLE (#355)
  • This is the last version to support Python 2.

New Features

  • SoX is now optional, and a new extensible backend dispatch mechanism exposes SoundFile as an alternative to SoX. This makes it possible to use torchaudio even when SoX or SoundFile are not installed or available. (#355)
  • We now have a unified dataset interface that loads in memory only one item at a time enabling new large datasets: LibriSpeech and CommonVoice. (#303, #316, #330)
  • We introduce a pitch detection algorithm: torchaudio.functional.detect_pitch_frequency. (#313, #322)
  • We offer data augmentations in torchaudio.transforms: TimeStretch, FrequencyMasking, TimeMasking. (#285, #333, #348)
  • We introduce a complex norm transform: torchaudio.transform.ComplexNorm. (#285, #333)
  • We now have a new audio feature generation for computing deltas: torchaudio.functional.compute_deltas. (#268, #326)
  • We introduce torchaudio.functional.gain and torchaudio.functional.dither (#319, #360). We welcome work to continue the effort to implement features available in SoX, see #260.
  • We now include equalizer_biquad (#315, #340), lowpass_biquad, highpass_biquad (#275), lfilter, and biquad (#275, #291, #326) in torchaudio.functional.
  • MFCC is available as torchaudio.functional.mfcc. (#228)

Improvements

  • We now support batching in transforms. (#327, #337, #404)
  • Functions are now jitable, and nn.Module is used where possible. (#314, #326, #342, #362, #369, #395)
  • Downloads of large files are now automatically resumed with new download function. (#320)
  • New tests for ISTFT are added. (#279)
  • We introduce nightly builds. (#301)
  • We now have smoke tests for builds. (#346, #359)

Bug Fixes

  • Fix mismatch between MelScale and librosa. (#294)
  • Fix torchaudio.compliance.kaldi.resample_waveform where internal variables where not moved to the GPU when used. (#277)
  • Fix a bug that occurred when importing torchaudio built outside of a git repository. (#276)
  • Fix istft where the dtype and device of parameters were not created on the same device as the tensor provided by the user. (#264)
  • Fix size mismatch when saving and loading from state dictionary (load_state_dict). (#246)
  • Clarified internal naming convention within transforms and functionals. (#298)
  • Fix build script to be more tolerant to download drops. (#280, #284, #305)
  • Correct documentation for SoxEffectsChain. (#283)
  • Fix resample error with cuda tensors. (#277)
  • Fix error when importing version outside of git. (#276)
  • Fix missing asound in linux build. (#254)
  • Fix deprecated torch. (#254)
  • Fix link in README. (#253)
  • Fix window device in ISTFT. (#240)
  • Documentation: Fix range in documentation for torchaudio.load to [-1, 1]. (#283)

v0.3.2

14 Jan 15:56
Compare
Choose a tag to compare

This release is to update the dependency to PyTorch 1.3.1.

v0.3.1

08 Jan 22:27
Compare
Choose a tag to compare

This release is to update the dependency to PyTorch 1.3.0.

Minor Fix

  • Updated settings for curl in build scripts (#280, #284, #297).

v0.3.0 Standardization, JIT/CUDA Support, Kaldi Compliance Interface, ISTFT

08 Aug 16:25
b2c73b6
Compare
Choose a tag to compare

Highlights

torchaudio as an extension of PyTorch

torchaudio has been redesigned to be an extension of PyTorch and part of the domain APIs (DAPI) ecosystem. Domain specific libraries such as this one are kept separated in order to maintain a coherent environment for each of them. As such, torchaudio is an ML library that provides relevant signal processing functionality, but it is not a general signal processing library. The full rationale of this new standardization can be found in the README.md.

In light of these changes some transforms have been removed or have different argument names and conventions. See the section on backwards breaking changes for a migration guide.

We provide binaries via pip and conda. They require PyTorch 1.2.0 and newer. See https://pytorch.org/ for installation instructions.

Community

We would like to thank our contributors and the wider community for their significant contributions to this release. We are happy to see an active community around torchaudio and are eager to further grow and support it.

In particular we'd like to thank @keunwoochoi, @ksanjeevan, and all the other maintainers and contributors of torchaudio-contrib for their significant and valuable additions around standardization and the support of complex numbers (#131, #110, keunwoochoi/torchaudio-contrib#61, keunwoochoi/torchaudio-contrib#36).

Kaldi Compliance Interface

An implementation of basic transforms with a Kaldi-like interface.

We added the functions spectrogram, fbank, and resample_waveform (#119, #127, and #134). For more details see the documentation on torchaudio.compliance.kaldi which mirrors the arguments and outputs of Kaldi features.

As an example we can look at the sinc interpolation resampling similar to Kaldi’s implementation. In the figure below, the blue dots are the original signal and red dots are the downsampled signal with half the original frequency. The red dot elements are approximately every other original element.

resampling

specgram = torchaudio.compliance.kaldi.spectrogram(waveform, frame_length=...)
fbank = torchaudio.compliance.kaldi.fbank(waveform, num_mel_bins=...)
resampled_waveform = torchaudio.compliance.kaldi.resample_waveform(waveform, orig_freq=...)

Inverse short time Fourier transform

Constructing a signal from a spectrogram can be used in applications like source separation or to generate audio signals to listen to. More specifically torchaudio.functional.istft is the inverse of torch.stft. It has the same parameters (+ additional optional parameter of length) and returns the least squares estimation of an original signal.

torch.manual_seed(0)
n_fft = 5
waveform = torch.rand(2, 5)
stft = torch.stft(waveform, n_fft=n_fft)
approx_waveform = torchaudio.functional.istft(stft, n_fft=n_fft, length=waveform.size(1))
>>> waveform
tensor([[0.4963, 0.7682, 0.0885, 0.1320, 0.3074],
        [0.6341, 0.4901, 0.8964, 0.4556, 0.6323]])
>>> approx_waveform
tensor([[0.4963, 0.7682, 0.0885, 0.1320, 0.3074],
        [0.6341, 0.4901, 0.8964, 0.4556, 0.6323]])

Breaking Changes

  • Removed Compose:
    Please use core abstractions such as nn.Sequential() or a for-loop over a list of transforms.
  • SPECTROGRAM, F2M, and MEL have been removed. Please use Spectrogram, MelScale, and MelSpectrogram
  • Removed formatting transforms ( LC2CL and BLC2CBL): While the LC layout might be common in signal processing, support for it is out of scope of this library and transforms such as LC2CL only aid their proliferation. Please use transpose if you need this behavior.
  • Removed Scale, PadTrim, DownmixMono: Please use division in place of Scale torch.nn.functional.pad/trim in place of PadTrim , torch.mean on the channel dimension in place of DownmixMono.
  • torchaudio.legacy has been removed. Please use torchaudio.load and torchaudio.save
  • Spectrogram used to be of dimension (channel, time, freq) and is now (channel, freq, time). Similarly for MelScale, MelSpectrogram, and MFCC, time is the last dimension. Please see our README for an explanation of the rationale behind these changes. Please use transpose to get the previous behavior.
  • MuLawExpanding was renamed to MuLawDecoding as the inverse of MuLawEncoding ( #159)
  • SpectrogramToDB was renamed to AmplitudeToDB ( #170). The input does not necessarily have to be a spectrogram and as such can be used in many more cases as the name should reflect.

New Features

Performance

JIT and CUDA

  • JIT support added to Spectrogram, AmplitudeToDB, MelScale, MelSpectrogram, MFCC, MuLawEncoding, and MuLawDecoding. (#118)
  • CUDA support added to Spectrogram, AmplitudeToDB, MelScale, MelSpectrogram, MFCC, MuLawEncoding, and MuLawDecoding (#118)

Bug Fixes

  • Fix test_transforms.py where double tensors were compared with floats (#132)
  • Fix vctk.read_audio (issue #143) as there were issues with downsampling using SoxEffectsChain (#145)
  • Fix segfault passing null to sox_close (#174)

torchaudio's First Official Release (v0.2.0)

08 Aug 15:56
e3c7784
Compare
Choose a tag to compare

Background

The goal of this release is to fix the current API as there will be future changes that breaking backward compatibility in order to improve the library as more thought is given to design, capabilities, and usability.

While this release is compatible with all currently known PyTorch versions (<=1.2.0), the available binaries will only require Pytorch 1.1.0. Installation commands:

# Wheels for Python 2 are NOT supported
# Python 3.5
$ pip3 install http://download.pytorch.org/whl/torchaudio-0.2-cp35-cp35m-linux_x86_64.whl
# Python 3.6
$ pip3 install http://download.pytorch.org/whl/torchaudio-0.2-cp36-cp36m-linux_x86_64.whl
# Python 3.7
$ pip3 install http://download.pytorch.org/whl/torchaudio-0.2-cp37-cp37m-linux_x86_64.whl

What's new?

  • Fixed broken tests and setup automatic testing environment
  • Read in Kaldi files (“.ark”, “.scp”)
  • Separation of state and computation into transforms.py and functional.py
  • Loading and saving to file
  • Datasets VCTK and YESNO
  • SoxEffects and SoxEffectsChain in torchaudio.sox_effects

CI and Testing

A continuous integration (Travis CI) has been setup in #117. This means all the tests have been fixed and their status can be checked in https://travis-ci.org/pytorch/audio. The test files have to be run separately via build_tools/travis/test_script.sh because closing sox after a test file is completed prevents it from being reopened. The testing framework is pytest.

# Run the whole test suite
$ build_tools/travis/test_script.sh
# Run an individual test
$ python -m pytest test/test_transforms.py

Kaldi IO

Kaldi IO has been added as an optional dependency in #111. torchaudio provides a simple wrapper around this by converting the np.ndarray into torch.Tensor. Functions include: read_vec_int_ark, read_vec_flt_scp, read_vec_flt_ark, read_mat_scp, and read_mat_ark.

>>> # read ark to a 'dictionary'
>>> d = { u:d for u,d in torchaudio.kaldi_io.read_vec_int_ark(file) }

Separation of State and Computation

In #105, the computations have been moved into functional.py. The reasoning behind this is that tracking state is a separate problem by itself and should be separate from computing a function. It also allows us to annotate the functional as weak scriptable, which in turn allows us to utilize the JIT and create efficient code. The functional itself might then also be used by other functionals, which is much easier and more efficient than having another Module create an instance of the class. This also makes it easier to implement performance improvements and create a generic API. If someone implements a function that adheres to the contract of your functional, it can be an immediate drop-in. This is important if we want to support different backends (e.g. move a functional entirely into C++).

>>> torchaudio.transforms.Spectrogram(n_fft=...)(waveform)
>>> torchaudio.functional.spectrogram(waveform, …)

Loading and saving to file

Tensors can be read and written to various file formats (e.g. “mp3”, “wav”, etc.) through torchaudio.

sound, sample_rate = torchaudio.load(‘input.wav’)
torchaudio.save(‘output.wav’, sound)

Transforms and functionals

Transforms

class Compose(object):
    def __init__(self, transforms):
    def __call__(self, audio):
        
class Scale(object):
    def __init__(self, factor=2**31):
    def __call__(self, tensor):
        
class PadTrim(object):
    def __init__(self, max_len, fill_value=0, channels_first=True):
    def __call__(self, tensor):
       
class DownmixMono(object):
    def __init__(self, channels_first=None):
    def __call__(self, tensor):

class LC2CL(object):
    def __call__(self, tensor):

def SPECTROGRAM(*args, **kwargs):

class Spectrogram(object):
    def __init__(self, n_fft=400, ws=None, hop=None,
                 pad=0, window=torch.hann_window,
                 power=2, normalize=False, wkwargs=None):
    def __call__(self, sig):
        
def F2M(*args, **kwargs):

class MelScale(object):
    def __init__(self, n_mels=128, sr=16000, f_max=None, f_min=0., n_stft=None):
    def __call__(self, spec_f):

class SpectrogramToDB(object):
    def __init__(self, stype="power", top_db=None):
    def __call__(self, spec):
       
class MFCC(object):
    def __init__(self, sr=16000, n_mfcc=40, dct_type=2, norm='ortho', log_mels=False,
                 melkwargs=None):
    def __call__(self, sig):

class MelSpectrogram(object):
    def __init__(self, sr=16000, n_fft=400, ws=None, hop=None, f_min=0., f_max=None,
                 pad=0, n_mels=128, window=torch.hann_window, wkwargs=None):
    def __call__(self, sig):

def MEL(*args, **kwargs):

class BLC2CBL(object):
    def __call__(self, tensor):

class MuLawEncoding(object):
    def __init__(self, quantization_channels=256):
    def __call__(self, x):

class MuLawExpanding(object):
    def __init__(self, quantization_channels=256):
    def __call__(self, x_mu):

Functional

def scale(tensor, factor):
    # type: (Tensor, int) -> Tensor

def pad_trim(tensor, ch_dim, max_len, len_dim, fill_value):
    # type: (Tensor, int, int, int, float) -> Tensor

def downmix_mono(tensor, ch_dim):
    # type: (Tensor, int) -> Tensor

def LC2CL(tensor):
    # type: (Tensor) -> Tensor

def spectrogram(sig, pad, window, n_fft, hop, ws, power, normalize):
    # type: (Tensor, int, Tensor, int, int, int, int, bool) -> Tensor

def create_fb_matrix(n_stft, f_min, f_max, n_mels):
    # type: (int, float, float, int) -> Tensor

def mel_scale(spec_f, f_min, f_max, n_mels, fb=None):
    # type: (Tensor, float, float, int, Optional[Tensor]) -> Tuple[Tensor, Tensor]

def spectrogram_to_DB(spec, multiplier, amin, db_multiplier, top_db=None):
    # type: (Tensor, float, float, float, Optional[float]) -> Tensor

def create_dct(n_mfcc, n_mels, norm):
    # type: (int, int, string) -> Tensor

def MFCC(sig, mel_spect, log_mels, s2db, dct_mat):
    # type: (Tensor, MelSpectrogram, bool, SpectrogramToDB, Tensor) -> Tensor

def BLC2CBL(tensor):
    # type: (Tensor) -> Tensor

def mu_law_encoding(x, qc):
    # type: (Tensor, int) -> Tensor

def mu_law_expanding(x_mu, qc):
    # type: (Tensor, int) -> Tensor

Datasets VCTK and YESNO

All datasets are subclasses of torch.utils.data.Dataset i.e, they have __getitem__ and __len__ methods implemented. Hence, they can all be passed to a torch.utils.data.DataLoader which can load multiple samples parallelly using torch.multiprocessing workers. For example:

yesno_data = torchaudio.datasets.YESNO('.', download=True)
data_loader = torch.utils.data.DataLoader(yesno_data,
                                          batch_size=1,
                                          shuffle=True,
                                          num_workers=args.nThreads)

The two datasets available are VCTK and YESNO. They download the datasets and preprocess them so that the loaded data is in convenient format.

SoxEffects and SoxEffectsChain

SoxEffects and SoxEffectsChain in torchaudio.sox_effects expose sox operations through a Python interface. Various useful effects like downmixing a multichannel signal or resampling a signal can be done here.

torchaudio.initialize_sox()
E = torchaudio.sox_effects.SoxEffectsChain()
E.append_effect_to_chain("rate", [16000])  # resample to 16000hz
E.append_effect_to_chain("channels", ["1"])  # mono signal
E.set_input_file(fn)
waveform, sample_rate = E.sox_build_flow_effects()
torchaudio.shutdown_sox()