In the development of AI technology, many amazing closed-source models are locked behind company doors, and we can only access them if we are involved in internal work.
In contrast, the community has tried to strive for the level of closed-source models by building open-source models and allowing everyone to improve the models. One of the projects that we need to know about is the Hugging Face’s Speech-to-Speech.
What is the Hugging Face’s Speech-to-Speech project, and why should you know?
Let’s discuss.
Hugging Face’s Speech-to-Speech Project
The Hugging Face’s Speech-to-Speech Project is a modular project that uses the Transformers library to integrate several open-source models into the speech-to-speech pipeline.
The project aims to meet the GPT4-o capability by leveraging the open-source model, designed to be easily modified and support many developer needs.
The pipeline consists of several model functionalities in a cascading manner, including:
- Voice Activity Detection (VAD)
- Speech to Text (STT)
- Any Whisper model
- Lightning Whisper MLX
- Paraformer – FunASR
- Language Model (LM)
- Any Instruction-Model in Hugging Face Hub
- max-lm
- OpenAI API
- Text to Speech (TTS)
- Parler-TTS
- MeloTTS
- ChatTTS
This doesn’t mean you need to use every model available above, but the pipeline requires four models above to run correctly.
The main objective of the pipeline above is to transform any speech given to it into another kind, such as speech in a different language or tone.
Let’s set up the project in your environment to test the pipeline.
Project Setup
First, we need to clone the GitHub repository into your environment. The following code will help you do that.
git clone https://github.com/huggingface/speech-to-speech.git
cd speech-to-speech
Given the setup above, you can install the required packages with pip. The recommended method is to use uv, but you can install them using pip.
pip install -r requirements.txt
If you are using a Mac, use the following code.
pip install -r requirements_mac.txt
Ensure that your installation is finished before we proceed. It’s also recommended that you use a virtual environment so it does not disturb your main environment.
Project Usage
There are several recommended ways to implement the pipeline. The first one is using the Server/Client Approach.
To do that, you can run the following code to run the pipeline on your server.
python s2s_pipeline.py --recv_host 0.0.0.0 --send_host 0.0.0.0
Then, run the following code locally to receive microphone input and the generated audio output.
python listen_and_play.py --host
Additionally, you can use the following arguments if you are using a Mac to use it local.
python s2s_pipeline.py --local_mac_optimal_settings --host
If you prefer that method, you can also use Docker. However, you would need the NVIDIA Container Toolkit to run it. With the environment ready, you only need to run the following code.
That’s how you could run the pipeline; let’s look at some arguments you can explore with the Hugging Face Speech-to-Speech pipeline.
Additional Arguments
Each of the STT (Speech-to-Text), LM (Language Model) and TTS (Text-to-Speech) has pipeline arguments with the prefixstt
, lm
or tts
.
For example, this is how to run the pipeline using CUDA.
python s2s_pipeline.py --lm_model_name microsoft/Phi-3-mini-4k-instruct --stt_compile_mode reduce-overhead --tts_compile_mode default --recv_host 0.0.0.0 --send_host 0.0.0.0
In the code above, we explicitly decide which Language Model (LM) to use while controlling the other models’ behaviour.
The pipeline also supports multi-language use cases, which include English, French, Spanish, Chinese, Japanese, and Korean.
We can add the language argument with the following code for automatic language detection.
python s2s_pipeline.py --stt_model_name large-v3 --language auto --mlx_lm_model_name mlx-community/Meta-Llama-3.1-8B-Instruct
Enforcing a specific language (e.g. Chinese) with the following code is also possible.
python s2s_pipeline.py --stt_model_name large-v3 --language auto --mlx_lm_model_name mlx-community/Meta-Llama-3.1-8B-Instruct
You can check out their arguments repository for the list to see if they suit your use cases.
Conclusion
In the pursuit of achieving the closed-source model capability, Hugging Face tries to emulate a project called Speech-to-Speech. The project utilizes models from the Hugging Face Transformers library on the hub to create a pipeline that can perform Speech-to-Speech tasks. In this article, we have explored how the project is structured and how to set it up.
I hope this has helped!
Cornellius Yudha Wijaya is a data science assistant manager and data writer. While working full-time at Allianz Indonesia, he loves to share Python and data tips via social media and writing media. Cornellius writes on a variety of AI and machine learning topics.