Original TOC
- Related Work and Inspiration
- Open Speech To Text (Russian)
- Making a Great Speech To Text Model
- Model Benchmarks and Generalization Gap
- Further Work
Abstract
- High compute requirements that are usually used in papers erect artificially high entry barriers;
- Speech requiring significant data due to the diverse vocabulary, speakers, and compression artifacts;
- A mentality where practical solutions are abandoned in favor of impractical, yet state of the art (SOTA) solutions.
- Introducing the diverse 20,000 hour Open STT dataset published under CC-NC-BY license;
- Demonstrating that it is possible to achieve competitive results using only TWO consumer-grade and widely available GPUs;
- Offering a plethora of design patterns that democratize entry to the speech domain for a wide range of researchers and practitioners.
Introduction
- The architectures and model building blocks required to solve 95% of standard “useful” tasks are widely available as standard and tested open-source framework modules;
- Most popular models are available with pre-trained weights;
- Knowledge transfer from standard tasks using pre-trained models to different everyday tasks is solved;
- The compute required to train models for everyday tasks is minimal (e.g. 1–10 GPU days in STT) compared to the compute requirements previously reported in papers (100–1000 GPU days in STT);
- The compute for pre-training large models is available to small independent companies and research groups;
Related Work and Inspiration
- Feed-forward neural networks for acoustic modelling (mostly grouped 1D convolutions with squeeze and excitation and transformer blocks);
- Connectionist temporal classification loss (CTC loss);
- Composite tokens consisting of graphemes (i.e. alphabet letters) as modelling units (opposed to phonemes);
- Beam search with a pre-trained language model (LM) as a decoder.
- Scalability. You can scale your compute by adding GPUs;
- Future proofing. Should a new neural network block become mainstream, it can be integrated and tested within days. Migrating to another framework is also easy;
- Simplicity. Namely using Python and PyTorch you can focus on experimentation and not solving legacy constraints;
- Flexibility. Building proper code in Python you can test new features (i.e. speaker diarization) in days;
- By not using attention in the decoder nor phonemes or recurrent neural networks we achieve faster convergence and need less maintenance for our models;
Open Speech To Text (Russian)
- Too ideal. Recorded in studio or too clean compared to real world applications;
- Too narrow of a domain. Difficulty in STT follows this simple formula: noise level * vocabulary size * number of speakers;
- Mostly only English. Though projects like Common Voice alleviate this constraint to some extent, you cannot reliably find a lot of data in languages other than German and English. Also Common Voice is probably more suitable for speaker identification task more than speech-to-text because their texts are not very diverse;
- Different compression. Wav files have little to no compression artifacts and therefore don’t represent real world sound bytes that are compressed in different ways;
- Collect some data then clean it using heuristics;
- Train some models and use those models to further clean the data;
- Collect more data and use alignment to align transcripts with audio;
- Train better models and use those models to further clean the data;Collect more data and manually annotate some data;
- Repeat all the steps.
- Do some housekeeping, clean the data more, and clean-up some legacy code;
- Migrate to .ogg in order to minimize data storage space while maintaining quality;
- Add several new domains (courtroom dialogues, medical lectures and seminars, poetry).
PS. We did all of this, our dataset was even featured on azure datasets, now we are planning in releasing pre-trained models for 3 new languages: English / German / Spanish.
Making a Great Speech To Text Model
- Quick inference;
- Parameter-efficient;Easy to maintain and improve;
- Does not require a lot of compute to train, a 2 x 1080Ti machine or less should suffice;
explained why this is sub-optimal if you have real world usage in mind
and the only datasets available are academic datasets. Given limited
resources to properly compare models you need a radically different
approach, which we present in this section. Also keep in mind that there
is no “ideal” validation dataset when you are dealing with real
in-the-wild data — you need to validate on each domain separately.
ImageNet), researchers allegedly run full experiments with different
hyper-parameters from scratch until convergence. Also, a good practice
is to run the so-called ablation tests, i.e. experiments that test
whether or not additional features of a model were actually useful by
comparing the performance of the model with and without those features.
running hundreds or thousands of experiments from scratch till
convergence or building some fancy reinforcement learning code to
control experiments. Also, the dominance of over-parameterized methods
in the literature and the availability of enterprise oriented toolkits
discourages researchers from deeply optimizing their pipelines. When you
explore the hardware options, in the professional or cloud segment there
is a bias towards expensive and impractical solutions.
Overall Progress Made
- Reduce the model size around 5x;
- Speed up its convergence 5–10x;
- The small (25M-35M params) final model can be trained on 2x1080 Ti GPUs instead of 4;
- The large model still requires 4x1080 Ti but has a bit lower final CER (1–1.5 percentage point lower) compared to the small model.
- Used an existing implementation of Deep Speech 2;
- Run a few experiments on LibriSpeech, where we noticed that RNN models are typically very slow compared to their convolutional counterparts;
- Added a plain Wav2Letter inspired model, which was actually underparameterized for Russian, so we increased the model size;
- Noticed that the model was okay, but very slow to train, so we tried to optimize the training time.
- Idea 1 — Model Stride;
- Idea 2 — Compact Regularized Networks;
- Idea 3 — Using Byte-Pair Encoding;
- Idea 4 — Better Encoder;
- Idea 5 — Balance Capacity — Never Use 4 GPUs Again;
- Idea 6 — Stabilize the Training in Different Domains, Balance Generalization;
- Idea 7 — Make A Very Fast Decoder;
Model Benchmarks and Generalization Gap
- Overall noise level;
- Vocabulary and pronunciation;
- The codecs or hardware used to compress audio;
This benchmark includes both an acoustic model and a language model. The acoustic model is run on GPU, the results are accumulated, and then language model post-processing is run on multiple CPUs
Further Work
- Getting rid of gradient clipping. Gradient clipping takes from 25% to 40% of batch time. We tried various hacks to get rid of it, but could not do it without suffering a severe drop in convergence speed;
- ADAM, Novograd and other new and promising optimizers. In our experience, they worked only with simpler non speech related domains or toy datasets;
- Sequence-to-sequence decoder, double supervision. These ideas work. Attention-based decoders with categorical cross-entropy loss instead of CTC are notoriously slow starters (you add speech decoding to the already burdensome task of alignment). Hybrid networks did not perform much better to justify their complexity. This probably just means that hybrid networks require a lot of parameter fine-tuning;
- Phoneme-based and phoneme-augmented methods. Though these helped us regularize a few over-parametrized models (100–150M params), they proved not very useful for smaller models. Surprisingly an extensive tokenization study by Google arrived at the similar result;
- Networks that increase in width gradually. A common design pattern in computer vision, so far such networks converged worse that their counterparts with the same network width;
- Usage of IdleBlocks. At first glance, this did not work, but maybe more time was needed to make it work;
- Try any sort of tunable filters instead of STFT. We tried various implementations of tunable STFT filters and SincNet filters, but in most cases we could not even stabilize the training of the models with such filters;
- Train a pyramid-shaped model with different strides. We failed to achieve any improvement here;Use model distillation and quantization to speed up inference. At the moment when we tried native quantization in PyTorch it was still in beta and did not support our modules yet;
- Add complementary objectives like speaker diarization or noise cancelling. Noise cancelling works, but it proved to be more of an aesthetic use;
Alexander Veysov is a Data Scientist in Silero, a small company building NLP / Speech / CV enabled products, and author of Open STT. Silero has recently shipped its own Russian STT engine. Previously he worked in a then Moscow-based VC firm and Ponominalu.ru, a ticketing startup acquired by MTS (major Russian TelCo). He received his BA and MA in Economics in Moscow State University for International Relations (MGIMO). You can follow his channel in telegram (@snakers41).
For attribution in academic contexts or books, please cite this work as
author = {Veysov, Alexander},
title = {Toward’s an ImageNet Moment for Speech-to-Text},
journal = {The Gradient},
year = {2020},
howpublished = {\url{ https://thegradient.pub/towards-an-imagenet-moment-for-speech-to-text/ } },
}