ailia AI Voice is a library that performs speech synthesis using GPT-SoVITS, while ailia AI Speech is a library that performs speech recognition using Whisper.
Previously, these libraries provided bindings for C++, C#, and Flutter, and we just added Python bindings.
ailia AI Voice and ailia AI Speech have very few dependencies and run on ONNX without using PyTorch, enabling stable operation without relying on framework versions. Additionally, after prototyping in Python, you can seamlessly deploy to mobile devices like iOS or Android using bindings for Unity or Flutter.
Both modules can be install via pip
pip3 install ailia_voice
pip3 install ailia_speech
Using the Python bindings for ailia AI Voice and ailia AI Speech, speech synthesis and speech recognition can be achieved in just a few line of code. The models are also downloaded automatically.
Speech synthesis with ailia AI Voice
As shown in the sample below, download the reference_audio_girl.wav
file, perform speech synthesis based on the voice in this file, and save the result.
import ailia_voiceimport librosa
import time
import soundfile
import os
import urllib.request
# Load reference audio
ref_text = "水をマレーシアから買わなくてはならない。"
ref_file_path = "reference_audio_girl.wav"
if not os.path.exists(ref_file_path):
urllib.request.urlretrieve(
"https://github.com/axinc-ai/ailia-models/raw/refs/heads/master/audio_processing/gpt-sovits/reference_audio_captured_by_ax.wav",
"reference_audio_girl.wav"
)
audio_waveform, sampling_rate = librosa.load(ref_file_path, mono=True)
# Infer
voice = ailia_voice.GPTSoVITS()
voice.initialize_model(model_path = "./models/")
voice.set_reference_audio(ref_text, ailia_voice.AILIA_VOICE_G2P_TYPE_GPT_SOVITS_JA, audio_waveform, sampling_rate)
buf, sampling_rate = voice.synthesize_voice("こんにちは。今日はいい天気ですね。", ailia_voice.AILIA_VOICE_G2P_TYPE_GPT_SOVITS_JA)
# Save result
soundfile.write("output.wav", buf, sampling_rate)
Speech recognition with ailia AI Speech
As shown below, download the demo.wav
file and perform speech recognition on it. Since the return value is a generator, you can sequentially obtain the recognition results even for long audio files.
import ailia_speechimport librosa
import os
import urllib.request
# Load target audio
input_file_path = "demo.wav"
if not os.path.exists(input_file_path):
urllib.request.urlretrieve(
"https://github.com/axinc-ai/ailia-models/raw/refs/heads/master/audio_processing/whisper/demo.wa",
"demo.wav"
)
audio_waveform, sampling_rate = librosa.load(input_file_path, mono=True)
# Infer
speech = ailia_speech.Whisper()
speech.initialize_model(model_path = "./models/", model_type = ailia_speech.AILIA_SPEECH_MODEL_TYPE_WHISPER_MULTILINGUAL_SMALL)
recognized_text = speech.transcribe(audio_waveform, sampling_rate)
for text in recognized_text:
print(text)
Various parameters can be passed to the ailia SDK constructor. For example if you want to use the GPU, you can configure it as shown below.
import ailia
import ailia_voice
import ailia_speechenv_id = ailia.get_gpu_environment_id()
voice = ailia_voice.GPTSoVITS(env_id = env_id)
speech = ailia_speech.Whisper(env_id = env_id)
If the AI model files exist in the model_path
, both speech synthesis and speech recognition will operate completely offline.
By providing a function to Whisper’s callback, it is possible to obtain intermediate results during speech recognition.
import ailia_speechdef f_callback(text):
print(text)
speech = ailia_speech.Whisper(callback = f_callback)