Image by Author
Automatic speech recognition (ASR) is a crucial technology in many applications, from voice assistants to transcription services. In this tutorial, we aim to build an ASR pipeline capable of transcribing speech into text using pre-trained models from Hugging Face. We will use a lightweight dataset for efficiency and employ Wav2Vec2, a powerful self-supervised model for speech recognition.
Our system will:
- Load and preprocess a speech dataset
- Fine-tune a pre-trained Wav2Vec2 model
- Evaluate the model’s performance using word error rate (WER)
- Deploy the model for real-time speech-to-text inference
To keep our model lightweight and efficient, we will use a small speech dataset rather than large datasets like Common Voice.
Step 1: Installing Dependencies
Before we start, we need to install the necessary libraries. These libraries will allow us to load datasets, process audio files, and fine-tune our model.
pip install torch torchaudio transformers datasets soundfile jiwer
The main purpose for the following libraries:
- transformers: Provides pre-trained Wav2Vec2 models for speech recognition
- datasets: Loads and processes speech datasets
- torchaudio: Handles audio processing and manipulation
- soundfile: Reads and writes .wav files
- jiwer: Computes the WER for evaluating ASR performance
Step 2: Loading a Lightweight Speech Dataset
Instead of using large datasets like Common Voice, we use SUPERB KS, a small dataset ideal for quick experimentation. This dataset consists of short spoken commands like “yes,” “no,” and “stop.”
from datasets import load_dataset
dataset = load_dataset("superb", "ks", split="train[:1%]") # Load only 1% of the data for quick testing
print(dataset)
This loads a tiny subset of the dataset to reduce computational cost while still allowing us to fine-tune the model. Warning: the dataset still requires storage space, so be mindful of disk usage when working with larger splits.
Step 3: Preprocessing the Audio Data
To train our ASR model, we need to ensure that the audio data is in the correct format. The Wav2Vec2 model requires:
- 16 kHz sample rate
- No padding or truncation (handled dynamically)
We define a function to process the audio and extract relevant features.
import torchaudio
def preprocess_audio(batch):
speech_array, sampling_rate = torchaudio.load(batch["audio"]["path"])
batch["speech"] = speech_array.squeeze().numpy()
batch["sampling_rate"] = sampling_rate
batch["target_text"] = batch["label"] # Use labels as text output
return batch
dataset = dataset.map(preprocess_audio)
This ensures all audio files are loaded correctly and formatted properly for further processing.
Step 4: Loading a Pre-trained Wav2Vec2 Model
We use a pre-trained Wav2Vec2 model from Hugging Face’s model hub. This model has already been trained on a large dataset and can be fine-tuned for our specific task.
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")
Here we define both the processor that converts raw audio into model-friendly features and the model, consisting of a Wav2Vec2 pre-trained on 960 hours of speech.
Step 5: Preparing Data for the Model
We must tokenize and encode the audio so that the model can understand it.
def preprocess_for_model(batch):
inputs = processor(batch["speech"], sampling_rate=16000, return_tensors="pt", padding=True)
batch["input_values"] = inputs.input_values[0]
return batch
dataset = dataset.map(preprocess_for_model, remove_columns=["speech", "sampling_rate", "audio"])
This step ensures that our dataset is compatible with the Wav2Vec2 model.
Step 6: Defining Training Arguments
Before training, we need to set up our training configuration. This includes batch size, learning rate, and optimization steps.
from transformers import TrainingArguments
training_args = TrainingArguments(
output_dir="./wav2vec2",
per_device_train_batch_size=4,
evaluation_strategy="epoch",
save_strategy="epoch",
logging_dir="./logs",
learning_rate=1e-4,
warmup_steps=500,
max_steps=4000,
save_total_limit=2,
gradient_accumulation_steps=2,
fp16=True,
push_to_hub=False,
)
Step 7: Training the Model
Using Hugging Face’s Trainer, we fine-tune our Wav2Vec2 model.
from transformers import Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset,
tokenizer=processor,
)
trainer.train()
Step 8: Evaluating the Model
To measure how well our model transcribes speech, we compute the WER.
import torch
from jiwer import wer
def transcribe(batch):
inputs = processor(batch["input_values"], return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(inputs.input_values).logits
predicted_ids = torch.argmax(logits, dim=-1)
batch["predicted_text"] = processor.batch_decode(predicted_ids)[0]
return batch
results = dataset.map(transcribe)
wer_score = wer(results["target_text"], results["predicted_text"])
print(f"Word Error Rate: {wer_score:.2f}")
A lower WER score indicates better performance.
Step 9: Running Inference on New Audio
Finally, we can use our trained model to transcribe real-world speech.
import torchaudio
import soundfile as sf
speech_array, sampling_rate = torchaudio.load("example.wav")
inputs = processor(speech_array.squeeze().numpy(), sampling_rate=16000, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(inputs.input_values).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)
Conclusion
And that’s it. You’ve successfully built an ASR system using PyTorch & Hugging Face with a lightweight dataset.
Josep Ferrer is an analytics engineer from Barcelona. He graduated in physics engineering and is currently working in the data science field applied to human mobility. He is a part-time content creator focused on data science and technology. Josep writes on all things AI, covering the application of the ongoing explosion in the field.