Introduction
I am going to setup Home Assistant for local home automation on a VM, then create a voice pipeline to setup a local AI voice assistant.
I am also going to train a custom voice model for the Voice Assistant. I am going to be cloning the voice of Yoda (from star wars)
Setting up Home Assistant
- Install Home Assistant : https://www.home-assistant.io/installation
- Access the WebUI : http://homeassistant.local:8123
- Create an account and then setup any or all of your IoT devices under
Settings->Devices & services
.
- Additionally, it is also possible to setup the HA companion app on mobile, and create Automations.
- Also, it is possible to setup Tailscale or ZeroTier to access HA remotely, by installing their respective Add-ons from the Addon store.
Setting up Voice Pipeline
- Under
Settings->Add-ons
, Install the following Addons :
- Whisper (STT)
- Piper (TTS)
- Assist Microphone
- openWakeWord
- Start all the Addons, then under
Devices & services
, add the 4 Wyoming Protocol Devices
- Also, add the Ollama Integration to connect and use an LLM as the conversation agent for the assistant, to make it smarter and remember context. Ollama could be setup on the HA host or any other PC with a GPU for better speed. I am using llama3.2
Note: If running ollama on the same machine, the IP for the host is usually the gateway address of the subnet of the VM (if not using bridged adapter). Also, use this command for ollama to allow commections from remote services :
1
|
OLLAMA_HOST=0.0.0.0 ollama serve
|
-
Under Settings->Voice Assistants
, Add Assistant,
- Change the name, then under Conversation Agent, select the LLM added using the Ollama Integration, which is way smarter than the default Home Assistant agent. Then in settings of ollama conversation agent, Turn
Assist
on under Control Home Assistant
- For
Speech-to-Text
, select faster-whisper
, and for Text to Speech
, select piper.
- Also, to use a wake word,
Add streaming wake word
, then under Wake word engine select openwakeword
, and select a Wake Word.
Note: To activate assistant through a wake word, a usb speaker and mic needs to be used (passthrough to the HAOS VM ,if applicable), and selected under the Assist Microphone
addon Configuration
-
To access the Assistant, use the Assist icon on HA dashboard. In the webui, it is only possible to talk to the assistant using text. To use voice, an SSL certificate needs to be added to HA to get a HTTPS connection.
-
The easier way to converse with the assistant using voice is to use the Home Assistant Companion App on mobile.
-
It is also possible to setup a Wyoming Satellite on another device to access the Voice Assistant with wake word trigger.
https://github.com/rhasspy/wyoming-satellite.git
Note: It is also possible to offload Whisper (SST) and Piper (TTS) to another device with a GPU to improve the Assistant’s response speed.
https://github.com/rhasspy/wyoming-addons
Then add the service using the Wyoming Protocol
-
It is also possible to train a custom wake word : https://www.home-assistant.io/voice_control/create_wake_word/
Training a custom voice clone model
I will be training the voice model using Piper
, which will give the model in a standard format (.onnx), which can be used in many TTS services and for the voice assistant aswell. This process requires the use of linux and an Nvidia GPU. I used WSL with Ubuntu-22.04
Step 1: Prerequisites
Piper works best with Python-3.10 and cuda-11.7
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
|
sudo apt update && sudo apt upgrade
sudo apt install python3-venv python3-dev gcc
sudo apt-get install g++ freeglut3-dev build-essential libx11-dev libxmu-dev libxi-dev libglu1-mesa libglu1-mesa-dev
## Adding GPU Drivers repo
sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt update
# Install the compatible GPU drivers if not preinstalled
## Installing Cuda
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
sudo mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/3bf863cc.pub
sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/ /"
sudo apt-get update
sudo apt install cuda-11-7
echo 'export PATH=/usr/local/cuda-11.7/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda-11.7/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc
sudo ldconfig
sudo reboot
## To verify install
nvcc -V
nvidia-smi
|
Step 2: Data Prep
The key to any AI project is getting CLEAN DATA.
There are two options to get the data,
Pull audio data from YouTube videos :
- Download YT Videos : Around 1 hr of clips should be good enough
1
2
3
4
5
6
7
8
9
10
|
cd ~
mkdir dataprep
cd dataprep
python3 -m venv .venv
source .venv/bin/activate
pip install yt-dlp
yt-dlp -x --audio-format wav "YT-URL"
|
- Cleanup your audio samples : Piper requires audio samples around 10sec each. Use Audacity to trim clips. When exporting from Audacity, Piper requires the audio in this format:
- Format: wav
- Channels: mono
- Sample Rate: 22050 Hz
- Remove Silence & Cut up your files :
1
2
3
4
5
6
7
8
9
10
|
for file in ./*.wav; do
ffmpeg -i "$file" -af "silenceremove=stop_periods=-1:stop_duration=3:stop_threshold=-25dB" "./${file%.wav}_nosilence.wav"
done
mkdir wav
for file in *.wav; do
ffmpeg -i "$file" -f segment -segment_time 15 -c copy "./wav/split_${file%.*}_%03d.wav"
done
|
- Transcribe your audio samples with Whisper (Using a python script) :
1
2
3
|
pip install git+https://github.com/openai/whisper.git
nano transcribe.py
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
|
import os
import whisper
model = whisper.load_model("base")
audio_dir = "./wav"
output_csv = "./metadata.csv"
audio_files = [f for f in os.listdir(audio_dir) if f.endswith(".wav")]
audio_files.sort()
with open(output_csv, "w") as f:
for audio_file in audio_files:
audio_path = os.path.join(audio_dir, audio_file)
result = model.transcribe(audio_path)
transcription = result["text"].strip()
file_id = os.path.splitext(audio_file)[0]
f.write(f"{file_id}|{transcription}\n")
print(f"Metadata saved to {output_csv}")
|
1
2
3
4
5
|
python3 transcribe.py
deactivate
cd ~/
|
Step 3: Training Prep
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
mkdir training
cd training
git clone https://github.com/rhasspy/piper.git
python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install pip==23.3.1
pip install numpy==1.24.4
pip install torchmetrics==0.11.4
cd piper/src/python
python3 -m pip install --upgrade wheel setuptools
pip3 install -e .
./build_monotonic_align.sh
|
- Process dataset to the format piper requires:
1
2
3
4
5
6
7
|
python3 -m piper_train.preprocess \
--language en \
--input-dir ~/dataprep/ \
--output-dir ~/training/dataset \
--dataset-format ljspeech \
--single-speaker \
--sample-rate 22050
|
1
|
wget https://huggingface.co/datasets/rhasspy/piper-checkpoints/resolve/main/en/en_US/lessac/medium/epoch%3D2164-step%3D1355540.ckpt
|
Step 4: Training the model
1
2
3
4
5
6
7
8
9
10
11
12
13
|
python3 -m piper_train \
--dataset-dir ~/training/dataset/ \
--accelerator 'gpu' \
--gpus 1 \
--batch-size 16 \
--validation-split 0.0 \
--num-test-examples 0 \
--max_epochs 6000 \
--resume_from_checkpoint "~/training/epoch=2164-step=1355540.ckpt" \
--checkpoint-epochs 1 \
--precision 16 \
--max-phoneme-ids 400 \
--quality medium
|
- Pausing and Resuming
- The training can be interrupted and picked up where it was left off by ctrl-c
- To resume training, use the same command but change the checkpoint file to the most recent checkpoint file, created under :
dataset/lightning_logs
- Exporting the voice model : After the final epoch is reached, training is done, and model can be exported
1
2
3
4
5
|
# Export the model
python3 -m piper_train.export_onnx \
"~/training/dataset/lightning_logs/version_0/checkpoints/epoch=5999-step=xxxxxx.ckpt" \
~/output/directory/voicemodel.onnx
|
1
2
3
|
# Copy the training json file to your model file directory
cp ~/training/dataset/config.json ~/output/directory/voicemodel.onnx.json
|
Testing the Voice Model
1
2
3
|
pip install piper-tts
echo "Completed, the training is" | piper -m voicename.onnx --output_file test.wav
|
Using the Voice Model for HA voice assistant
- The Voice Model can be transferred to Home Assistant using the SAMBA share plugin.
- Inside the
share
folder, Create a new folder : piper
, and paste the voicemodel.onnx
& voicemodel.onnx.json
files inside this folder.
- To edit the name this voice model shows as inside HA, edit the
dataset
& quality
inside the voicemodel.onnx.json
file.
- After restarting Home Assistant, this voice model should be ready for use, and can be selected by editing the Voice Assistant settings, and under Text-to-speech, a new language should be created (English), and under Voice the new Voice Model should be present.
- It is also possible to change the system prompt of the LLM in its settings to match the persona of who’s voice was cloned.