Addition of Python API to ailia AI Voice and ailia AI Speech | by David Cochard | axinc-ai | Oct, 2024


axinc-ai

ailia AI Voice is a library that performs speech synthesis using GPT-SoVITS, while ailia AI Speech is a library that performs speech recognition using Whisper.

Previously, these libraries provided bindings for C++, C#, and Flutter, and we just added Python bindings.

ailia AI Voice and ailia AI Speech have very few dependencies and run on ONNX without using PyTorch, enabling stable operation without relying on framework versions. Additionally, after prototyping in Python, you can seamlessly deploy to mobile devices like iOS or Android using bindings for Unity or Flutter.

Both modules can be install via pip

pip3 install ailia_voice
pip3 install ailia_speech

Using the Python bindings for ailia AI Voice and ailia AI Speech, speech synthesis and speech recognition can be achieved in just a few line of code. The models are also downloaded automatically.

Speech synthesis with ailia AI Voice

As shown in the sample below, download the reference_audio_girl.wav file, perform speech synthesis based on the voice in this file, and save the result.

import ailia_voice

import librosa
import time
import soundfile

import os
import urllib.request

# Load reference audio
ref_text = "水をマレーシアから買わなくてはならない。"
ref_file_path = "reference_audio_girl.wav"
if not os.path.exists(ref_file_path):
urllib.request.urlretrieve(
"https://github.com/axinc-ai/ailia-models/raw/refs/heads/master/audio_processing/gpt-sovits/reference_audio_captured_by_ax.wav",
"reference_audio_girl.wav"
)
audio_waveform, sampling_rate = librosa.load(ref_file_path, mono=True)

# Infer
voice = ailia_voice.GPTSoVITS()
voice.initialize_model(model_path = "./models/")
voice.set_reference_audio(ref_text, ailia_voice.AILIA_VOICE_G2P_TYPE_GPT_SOVITS_JA, audio_waveform, sampling_rate)
buf, sampling_rate = voice.synthesize_voice("こんにちは。今日はいい天気ですね。", ailia_voice.AILIA_VOICE_G2P_TYPE_GPT_SOVITS_JA)

# Save result
soundfile.write("output.wav", buf, sampling_rate)

Speech recognition with ailia AI Speech

As shown below, download the demo.wav file and perform speech recognition on it. Since the return value is a generator, you can sequentially obtain the recognition results even for long audio files.

import ailia_speech

import librosa

import os
import urllib.request

# Load target audio
input_file_path = "demo.wav"
if not os.path.exists(input_file_path):
urllib.request.urlretrieve(
"https://github.com/axinc-ai/ailia-models/raw/refs/heads/master/audio_processing/whisper/demo.wa",
"demo.wav"
)
audio_waveform, sampling_rate = librosa.load(input_file_path, mono=True)

# Infer
speech = ailia_speech.Whisper()
speech.initialize_model(model_path = "./models/", model_type = ailia_speech.AILIA_SPEECH_MODEL_TYPE_WHISPER_MULTILINGUAL_SMALL)
recognized_text = speech.transcribe(audio_waveform, sampling_rate)
for text in recognized_text:
print(text)

Various parameters can be passed to the ailia SDK constructor. For example if you want to use the GPU, you can configure it as shown below.

import ailia
import ailia_voice
import ailia_speech

env_id = ailia.get_gpu_environment_id()
voice = ailia_voice.GPTSoVITS(env_id = env_id)
speech = ailia_speech.Whisper(env_id = env_id)

If the AI model files exist in the model_path, both speech synthesis and speech recognition will operate completely offline.

By providing a function to Whisper’s callback, it is possible to obtain intermediate results during speech recognition.

import ailia_speech

def f_callback(text):
print(text)

speech = ailia_speech.Whisper(callback = f_callback)

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here