最近在折腾本地语音能力,顺手整理了一套最小可运行的 sherpa-onnx Demo。

这篇文章不讲太多概念,直接用我本地的项目来跑通 5 个场景:

  1. 离线语音识别 ASR
  2. 流式语音识别 ASR
  3. 语音合成 TTS
  4. 语音活动检测 VAD
  5. VAD + ASR 联合链路

本文示例使用的项目路径是:

1
C:\Users\wlf18\data\codes\gitcode\ai-voice-platform\sherpa-demo

如果你现在的目标只是两件事:

  1. 验证 sherpa-onnx 在 Windows 本机能不能跑通
  2. 找一个足够小、足够清晰的样例项目做二次开发

那这套 demo 基本够用了。

sherpa-demo 项目结构

这个项目本质上是一个最小 demo 集合,每个可执行程序只负责一个场景,方便单独验证和排障。

目录结构大致如下:

1
2
3
4
5
6
7
8
9
10
sherpa-demo/
├── build.ps1
├── CMakeLists.txt
├── README.md
└── src/
├── demo_offline_asr.cpp
├── demo_online_asr.cpp
├── demo_tts.cpp
├── demo_vad.cpp
└── demo_vad_asr.cpp

依赖也比较直接:

  1. Windows x64
  2. Visual Studio 2022
  3. CMake
  4. ../sherpa-onnx/win_x64/ 下的预编译产物
  5. ../models/ 下的模型目录

我这里实际用到的模型有:

  1. sherpa-onnx-sense-voice:离线 ASR
  2. sherpa-onnx-streaming-zipformer-bilingual-zh-en:流式 ASR
  3. vits-zh-aishell3:TTS
  4. silero_vad.onnx:VAD

先编译项目

最简单的方式就是直接执行项目里的 build.ps1

1
.\build.ps1

它的逻辑很简单:

  1. 优先从 PATH 里找 cmake.exe
  2. 找不到就回退到 Visual Studio 自带的 CMake
  3. 自动选择可用的 Visual Studio 生成器
  4. 执行 configure 和 build

脚本内容如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
# build.ps1 - configure + build sherpa-demo (Release x64)
$ErrorActionPreference = 'Stop'

$root = $PSScriptRoot
$build = Join-Path $root 'build'

# 优先使用 PATH 中的 cmake,否则回退到 Visual Studio 自带版本。
$cmake = Get-Command cmake.exe -ErrorAction SilentlyContinue
if ($null -eq $cmake) {
$bundled = "${env:ProgramFiles}\Microsoft Visual Studio\18\Community\Common7\IDE\CommonExtensions\Microsoft\CMake\CMake\bin\cmake.exe"
if (-not (Test-Path $bundled)) {
$bundled = "${env:ProgramFiles}\Microsoft Visual Studio\2022\Community\Common7\IDE\CommonExtensions\Microsoft\CMake\CMake\bin\cmake.exe"
}
if (-not (Test-Path $bundled)) {
throw "cmake.exe not found. 请在 PATH 中安装 cmake,或调整脚本里的 bundled 路径。"
}
$cmakeExe = $bundled
} else {
$cmakeExe = $cmake.Path
}

# 选择第一个可用的 VS 生成器
$generator = $null
foreach ($g in @('Visual Studio 18 2026', 'Visual Studio 17 2022')) {
$help = & $cmakeExe --help 2>&1 | Out-String
if ($help -match [regex]::Escape($g)) {
$generator = $g
break
}
}
if (-not $generator) { throw "找不到可用的 Visual Studio 生成器" }

Write-Host "Using cmake: $cmakeExe"
Write-Host "Using generator: $generator"

& $cmakeExe -S $root -B $build -G $generator -A x64
if ($LASTEXITCODE -ne 0) { throw "cmake configure failed" }

& $cmakeExe --build $build --config Release
if ($LASTEXITCODE -ne 0) { throw "cmake build failed" }

Write-Host ""
Write-Host "Build OK. Executables:" -ForegroundColor Green
Get-ChildItem (Join-Path $build 'Release') -Filter 'demo_*.exe' | ForEach-Object {
Write-Host " $($_.FullName)"
}

如果你更喜欢直接敲命令,也可以手动执行:

1
2
cmake -S . -B build -G "Visual Studio 17 2022" -A x64
cmake --build build --config Release

构建完成后,所有 exe 和运行时 DLL 都会落到:

1
build\Release\

CMakeLists 做了什么

这个项目的 CMakeLists.txt 非常适合拿来做最小模板。

它做了三件关键的事:

  1. 指定 SHERPA_ONNX_ROOT,把头文件和库文件接进来
  2. 为 5 个 demo 统一链接 sherpa-onnxonnxruntime
  3. POST_BUILD 里把运行时 DLL 自动复制到 exe 同目录

核心配置如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
cmake_minimum_required(VERSION 3.15)
project(sherpa_demo CXX)

set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED ON)
set(CMAKE_CXX_EXTENSIONS OFF)

if(MSVC)
add_compile_options(/utf-8 /W3 /wd4305)
add_definitions(-D_CRT_SECURE_NO_WARNINGS -DNOMINMAX)
endif()

# 指向仓库内已经下载好的 sherpa-onnx Windows 预编译包
set(SHERPA_ONNX_ROOT "${CMAKE_SOURCE_DIR}/../sherpa-onnx/win_x64" CACHE PATH
"Path to sherpa-onnx prebuilt root containing bin/include/lib")

if(NOT EXISTS "${SHERPA_ONNX_ROOT}/include/sherpa-onnx/c-api/cxx-api.h")
message(FATAL_ERROR
"Cannot find sherpa-onnx cxx-api header at ${SHERPA_ONNX_ROOT}. "
"Set -DSHERPA_ONNX_ROOT=<path> or place the prebuilt under ../sherpa-onnx/win_x64.")
endif()

include_directories("${SHERPA_ONNX_ROOT}/include")

set(SHERPA_ONNX_CXX_LIB "${SHERPA_ONNX_ROOT}/lib/sherpa-onnx-cxx-api.lib")
set(SHERPA_ONNX_C_LIB "${SHERPA_ONNX_ROOT}/lib/sherpa-onnx-c-api.lib")
set(ONNXRUNTIME_LIB "${SHERPA_ONNX_ROOT}/lib/onnxruntime.lib")

# 这些 DLL 在运行时需要紧贴在 .exe 旁边
set(SHERPA_RUNTIME_DLLS
"${SHERPA_ONNX_ROOT}/bin/onnxruntime.dll"
"${SHERPA_ONNX_ROOT}/bin/onnxruntime_providers_shared.dll"
"${SHERPA_ONNX_ROOT}/lib/sherpa-onnx-c-api.dll"
"${SHERPA_ONNX_ROOT}/lib/sherpa-onnx-cxx-api.dll"
)

set(DEMO_TARGETS
demo_offline_asr
demo_online_asr
demo_tts
demo_vad
demo_vad_asr
)

foreach(name ${DEMO_TARGETS})
add_executable(${name} src/${name}.cpp)
target_link_libraries(${name} PRIVATE
"${SHERPA_ONNX_CXX_LIB}"
"${SHERPA_ONNX_C_LIB}"
"${ONNXRUNTIME_LIB}"
)
if(WIN32)
target_link_libraries(${name} PRIVATE ws2_32 winmm)
endif()
add_custom_command(TARGET ${name} POST_BUILD
COMMAND ${CMAKE_COMMAND} -E copy_if_different
${SHERPA_RUNTIME_DLLS}
"$<TARGET_FILE_DIR:${name}>"
COMMENT "Copy sherpa-onnx runtime DLLs next to ${name}.exe"
)
endforeach()

这段配置的价值非常直接:

  1. 你不需要自己到处拷 DLL
  2. 每个 demo 的依赖完全一致
  3. 后面新增 demo_xxx.cpp 也很方便复用

Demo 1:离线 ASR

先看最基础的离线识别。

运行方式:

1
2
cd .\build\Release\
.\demo_offline_asr.exe

这个程序会读取:

1
..\..\..\models\sherpa-onnx-sense-voice\test_wavs\

目录下的 wav 文件,然后逐个输出:

  1. 文件名
  2. 识别语言
  3. 解码耗时
  4. RTF
  5. 识别文本

源码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
// demo_offline_asr.cpp
//
// 用 SenseVoice (models/sherpa-onnx-sense-voice) 跑 test_wavs/ 下所有 wav,
// 打印 文件名 / 检测语种 / 识别文本 / 解码耗时(ms) / RTF。
//
// 默认从 sherpa-demo/build/Release/ 下运行,使用相对路径 ..\..\..\models 定位模型。
// 也可以传入 1 个参数指定 models 根目录:
// demo_offline_asr.exe D:\path\to\models

#ifndef NOMINMAX
#define NOMINMAX
#endif

#include <chrono>
#include <cstdio>
#include <filesystem>
#include <iostream>
#include <string>
#include <vector>

#include "sherpa-onnx/c-api/cxx-api.h"

namespace fs = std::filesystem;
namespace cxx = sherpa_onnx::cxx;

namespace {

fs::path ResolveModelsDir(int argc, char *argv[]) {
if (argc >= 2) {
return fs::path(argv[1]);
}
// 默认假设可执行文件位于 sherpa-demo/build/Release/,
// 仓库根是上溯 3 级,再加 models。
fs::path here = fs::current_path();
fs::path candidate = here / ".." / ".." / ".." / "models";
if (fs::exists(candidate / "sherpa-onnx-sense-voice")) {
return fs::weakly_canonical(candidate);
}
// 兜底:直接在当前目录下找
return fs::weakly_canonical(here / "models");
}

} // namespace

int main(int argc, char *argv[]) {
fs::path models_dir = ResolveModelsDir(argc, argv);
fs::path model_dir = models_dir / "sherpa-onnx-sense-voice";
fs::path model_file = model_dir / "model.int8.onnx";
fs::path tokens_file = model_dir / "tokens.txt";
fs::path wavs_dir = model_dir / "test_wavs";

std::cout << "[demo_offline_asr] models_dir = " << models_dir.string() << "\n";
std::cout << "[demo_offline_asr] model = " << model_file.string() << "\n";
std::cout << "[demo_offline_asr] tokens = " << tokens_file.string() << "\n";
std::cout << "[demo_offline_asr] wavs_dir = " << wavs_dir.string() << "\n";

if (!fs::exists(model_file) || !fs::exists(tokens_file) || !fs::exists(wavs_dir)) {
std::cerr << "Required model/wavs not found, please check paths above.\n";
return 1;
}

cxx::OfflineRecognizerConfig config;
config.model_config.sense_voice.model = model_file.string();
config.model_config.sense_voice.language = "auto";
config.model_config.sense_voice.use_itn = true;
config.model_config.tokens = tokens_file.string();
config.model_config.num_threads = 1;
config.model_config.provider = "cpu";
config.model_config.debug = false;

auto load_start = std::chrono::steady_clock::now();
auto recognizer = cxx::OfflineRecognizer::Create(config);
if (recognizer.Get() == nullptr) {
std::cerr << "Failed to create OfflineRecognizer (SenseVoice)\n";
return 2;
}
auto load_ms = std::chrono::duration<double, std::milli>(
std::chrono::steady_clock::now() - load_start)
.count();
std::cout << "[demo_offline_asr] recognizer loaded in " << load_ms << " ms\n\n";

// 收集 wav,按文件名排序,输出更稳定
std::vector<fs::path> wavs;
for (const auto &entry : fs::directory_iterator(wavs_dir)) {
if (entry.is_regular_file() && entry.path().extension() == ".wav") {
wavs.push_back(entry.path());
}
}
std::sort(wavs.begin(), wavs.end());

if (wavs.empty()) {
std::cerr << "No .wav files in " << wavs_dir.string() << "\n";
return 3;
}

std::printf("%-18s %-6s %-8s %-7s %s\n",
"file", "lang", "elapsed", "rtf", "text");
std::printf("%-18s %-6s %-8s %-7s %s\n",
"----", "----", "--------", "-----", "----");

for (const auto &wav_path : wavs) {
cxx::Wave wave = cxx::ReadWave(wav_path.string());
if (wave.samples.empty()) {
std::cerr << " [skip] cannot read " << wav_path.filename().string() << "\n";
continue;
}

auto stream = recognizer.CreateStream();
stream.AcceptWaveform(wave.sample_rate, wave.samples.data(),
static_cast<int32_t>(wave.samples.size()));

auto t0 = std::chrono::steady_clock::now();
recognizer.Decode(&stream);
auto result = recognizer.GetResult(&stream);
auto elapsed_ms = std::chrono::duration<double, std::milli>(
std::chrono::steady_clock::now() - t0)
.count();

double audio_seconds = static_cast<double>(wave.samples.size()) /
static_cast<double>(wave.sample_rate);
double rtf = audio_seconds > 0 ? (elapsed_ms / 1000.0) / audio_seconds : 0;

std::printf("%-18s %-6s %7.1fms %-7.3f %s\n",
wav_path.filename().string().c_str(),
result.lang.empty() ? "?" : result.lang.c_str(),
elapsed_ms, rtf, result.text.c_str());
}

return 0;
}

这段代码里最值得关注的是两个点:

  1. config.model_config.sense_voice.language = "auto",表示自动检测语种
  2. config.model_config.sense_voice.use_itn = true,表示对识别结果做 ITN 归一化

如果你只是要接一个最小离线识别能力,这段基本就是模板。

Demo 2:流式 ASR

第二个 demo 是流式识别。

运行方式:

1
2
cd .\build\Release\
.\demo_online_asr.exe

也可以指定模型根目录和输入音频:

1
.\demo_online_asr.exe ..\..\..\models D:\some\other.wav

这个 demo 的思路很实用:

  1. 把一个 wav 文件当成“麦克风输入”
  2. 每次喂 0.1s 音频
  3. 文本变化时打印一次 partial
  4. 最后输出 FINAL

源码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
// demo_online_asr.cpp
//
// 用 streaming-zipformer-bilingual-zh-en 跑流式识别,把
// models/sherpa-onnx-sense-voice/test_wavs/zh.wav 当作"麦克风输入",
// 按 0.1s 一片喂入,并在文本变化时打印一次 partial。
//
// 默认从 sherpa-demo/build/Release/ 下运行:
// demo_online_asr.exe // 用默认 wav
// demo_online_asr.exe <models_dir> // 自定义 models 根
// demo_online_asr.exe <models_dir> <input_wav> // 自定义输入 wav

#ifndef NOMINMAX
#define NOMINMAX
#endif

#include <algorithm>
#include <chrono>
#include <cstdio>
#include <filesystem>
#include <iostream>
#include <string>
#include <vector>

#include "sherpa-onnx/c-api/cxx-api.h"

namespace fs = std::filesystem;
namespace cxx = sherpa_onnx::cxx;

namespace {

fs::path ResolveModelsDir(int argc, char *argv[]) {
if (argc >= 2) {
return fs::path(argv[1]);
}
fs::path candidate =
fs::current_path() / ".." / ".." / ".." / "models";
if (fs::exists(candidate / "sherpa-onnx-streaming-zipformer-bilingual-zh-en")) {
return fs::weakly_canonical(candidate);
}
return fs::weakly_canonical(fs::current_path() / "models");
}

} // namespace

int main(int argc, char *argv[]) {
fs::path models_dir = ResolveModelsDir(argc, argv);
fs::path model_dir =
models_dir / "sherpa-onnx-streaming-zipformer-bilingual-zh-en";
fs::path encoder = model_dir / "encoder-epoch-99-avg-1.onnx";
fs::path decoder = model_dir / "decoder-epoch-99-avg-1.onnx";
fs::path joiner = model_dir / "joiner-epoch-99-avg-1.onnx";
fs::path tokens = model_dir / "tokens.txt";

fs::path input_wav;
if (argc >= 3) {
input_wav = fs::path(argv[2]);
} else {
input_wav = models_dir / "sherpa-onnx-sense-voice" / "test_wavs" / "zh.wav";
}

std::cout << "[demo_online_asr] encoder = " << encoder.string() << "\n";
std::cout << "[demo_online_asr] decoder = " << decoder.string() << "\n";
std::cout << "[demo_online_asr] joiner = " << joiner.string() << "\n";
std::cout << "[demo_online_asr] tokens = " << tokens.string() << "\n";
std::cout << "[demo_online_asr] input = " << input_wav.string() << "\n";

for (const auto &p : {encoder, decoder, joiner, tokens, input_wav}) {
if (!fs::exists(p)) {
std::cerr << "missing file: " << p.string() << "\n";
return 1;
}
}

cxx::OnlineRecognizerConfig config;
config.model_config.transducer.encoder = encoder.string();
config.model_config.transducer.decoder = decoder.string();
config.model_config.transducer.joiner = joiner.string();
config.model_config.tokens = tokens.string();
config.model_config.num_threads = 1;
config.model_config.provider = "cpu";
config.model_config.debug = false;
config.decoding_method = "greedy_search";
config.enable_endpoint = true;
config.rule1_min_trailing_silence = 2.4f;
config.rule2_min_trailing_silence = 1.2f;
config.rule3_min_utterance_length = 20.0f;

auto load_start = std::chrono::steady_clock::now();
auto recognizer = cxx::OnlineRecognizer::Create(config);
if (recognizer.Get() == nullptr) {
std::cerr << "Failed to create OnlineRecognizer (zipformer)\n";
return 2;
}
auto load_ms = std::chrono::duration<double, std::milli>(
std::chrono::steady_clock::now() - load_start)
.count();
std::cout << "[demo_online_asr] recognizer loaded in " << load_ms << " ms\n";

cxx::Wave wave = cxx::ReadWave(input_wav.string());
if (wave.samples.empty()) {
std::cerr << "Failed to read wav: " << input_wav.string() << "\n";
return 3;
}
std::cout << "[demo_online_asr] wav samples=" << wave.samples.size()
<< " sr=" << wave.sample_rate
<< " duration=" << (double)wave.samples.size() / wave.sample_rate
<< "s\n";

auto stream = recognizer.CreateStream();

const int32_t chunk_samples = wave.sample_rate / 10; // 0.1s
std::string last_text;
int partial_count = 0;

auto t0 = std::chrono::steady_clock::now();

for (size_t off = 0; off < wave.samples.size(); off += chunk_samples) {
int32_t n = static_cast<int32_t>(
std::min<size_t>(chunk_samples, wave.samples.size() - off));
stream.AcceptWaveform(wave.sample_rate, wave.samples.data() + off, n);

while (recognizer.IsReady(&stream)) {
recognizer.Decode(&stream);
}

auto result = recognizer.GetResult(&stream);
if (!result.text.empty() && result.text != last_text) {
++partial_count;
double ms_now = std::chrono::duration<double, std::milli>(
std::chrono::steady_clock::now() - t0)
.count();
std::printf(" partial #%-3d t=%6.0fms %s\n",
partial_count, ms_now, result.text.c_str());
last_text = result.text;
}

if (recognizer.IsEndpoint(&stream)) {
auto end_result = recognizer.GetResult(&stream);
std::printf(" endpoint at offset=%zu, final_so_far: %s\n",
off, end_result.text.c_str());
recognizer.Reset(&stream);
last_text.clear();
}
}

// 标记结束并 drain 解码器
stream.InputFinished();
while (recognizer.IsReady(&stream)) {
recognizer.Decode(&stream);
}
auto final_result = recognizer.GetResult(&stream);
auto total_ms = std::chrono::duration<double, std::milli>(
std::chrono::steady_clock::now() - t0)
.count();

double audio_seconds =
static_cast<double>(wave.samples.size()) / wave.sample_rate;
double rtf = audio_seconds > 0 ? (total_ms / 1000.0) / audio_seconds : 0;

std::cout << "\n[demo_online_asr] FINAL: " << final_result.text << "\n";
std::printf("[demo_online_asr] total=%.1fms audio=%.2fs rtf=%.3f partials=%d\n",
total_ms, audio_seconds, rtf, partial_count);
return 0;
}

这段代码最大的参考价值在于主循环:

  1. stream.AcceptWaveform(...) 按块喂音频
  2. while (recognizer.IsReady(&stream)) 持续解码
  3. recognizer.GetResult(&stream) 获取当前 partial
  4. recognizer.IsEndpoint(&stream) 判定一句话是否结束
  5. recognizer.Reset(&stream) 重置流状态

如果你后面要接麦克风实时输入,这个 demo 的结构几乎可以直接平移过去。

Demo 3:TTS

第三个 demo 是语音合成。

运行方式:

1
2
cd .\build\Release\
.\demo_tts.exe

传自定义文本:

1
.\demo_tts.exe "今天天气真不错"

再进一步,你也可以指定 models 根目录和 speaker_id

1
.\demo_tts.exe "文本" ..\..\..\models 0

源码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
// demo_tts.cpp
//
// 用 vits-zh-aishell3 合成中文语音,输出 tts-out.wav。
//
// 用法:
// demo_tts.exe // 使用默认文本
// demo_tts.exe "你要合成的文本" // 自定义文本(cp936/utf-8 命令行均支持)
// demo_tts.exe "文本" <models_dir> // 自定义 models 根
// demo_tts.exe "文本" <models_dir> <speaker_id>
//
// 注意:源文件用 /utf-8 编译,字符串字面量为 utf-8。

#ifndef NOMINMAX
#define NOMINMAX
#endif

#include <chrono>
#include <cstdio>
#include <cstdlib>
#include <filesystem>
#include <iostream>
#include <string>

#include "sherpa-onnx/c-api/cxx-api.h"

namespace fs = std::filesystem;
namespace cxx = sherpa_onnx::cxx;

namespace {

fs::path ResolveModelsDir(const std::string &arg) {
if (!arg.empty()) {
return fs::path(arg);
}
fs::path candidate =
fs::current_path() / ".." / ".." / ".." / "models";
if (fs::exists(candidate / "vits-zh-aishell3")) {
return fs::weakly_canonical(candidate);
}
return fs::weakly_canonical(fs::current_path() / "models");
}

} // namespace

int main(int argc, char *argv[]) {
std::string text =
"你好,欢迎使用 sherpa-onnx 语音合成。今天是个测试 demo 的好日子。";
std::string models_arg;
int32_t speaker_id = 0;

if (argc >= 2) text = argv[1];
if (argc >= 3) models_arg = argv[2];
if (argc >= 4) speaker_id = std::atoi(argv[3]);

fs::path models_dir = ResolveModelsDir(models_arg);
fs::path model_dir = models_dir / "vits-zh-aishell3";
fs::path model_file = model_dir / "vits-aishell3.onnx";
fs::path tokens = model_dir / "tokens.txt";
fs::path lexicon = model_dir / "lexicon.txt";

std::cout << "[demo_tts] model = " << model_file.string() << "\n";
std::cout << "[demo_tts] tokens = " << tokens.string() << "\n";
std::cout << "[demo_tts] lexicon = " << lexicon.string() << "\n";
std::cout << "[demo_tts] sid = " << speaker_id << "\n";
std::cout << "[demo_tts] text = " << text << "\n";

for (const auto &p : {model_file, tokens, lexicon}) {
if (!fs::exists(p)) {
std::cerr << "missing file: " << p.string() << "\n";
return 1;
}
}

cxx::OfflineTtsConfig config;
config.model.vits.model = model_file.string();
config.model.vits.tokens = tokens.string();
config.model.vits.lexicon = lexicon.string();
config.model.num_threads = 1;
config.model.provider = "cpu";
config.model.debug = false;
config.max_num_sentences = 1;

auto load_start = std::chrono::steady_clock::now();
auto tts = cxx::OfflineTts::Create(config);
if (tts.Get() == nullptr) {
std::cerr << "Failed to create OfflineTts\n";
return 2;
}
auto load_ms = std::chrono::duration<double, std::milli>(
std::chrono::steady_clock::now() - load_start)
.count();
std::cout << "[demo_tts] tts loaded in " << load_ms
<< " ms, sample_rate=" << tts.SampleRate()
<< ", num_speakers=" << tts.NumSpeakers() << "\n";

if (tts.NumSpeakers() > 0 && speaker_id >= tts.NumSpeakers()) {
std::cerr << "speaker_id " << speaker_id << " out of range, fallback to 0\n";
speaker_id = 0;
}

auto t0 = std::chrono::steady_clock::now();
auto audio = tts.Generate(text, speaker_id, /*speed=*/1.0f);
auto gen_ms = std::chrono::duration<double, std::milli>(
std::chrono::steady_clock::now() - t0)
.count();

if (audio.samples.empty()) {
std::cerr << "TTS generated empty audio\n";
return 3;
}

fs::path out = fs::current_path() / "tts-out.wav";
cxx::Wave wave;
wave.samples = audio.samples;
wave.sample_rate = audio.sample_rate;
if (!cxx::WriteWave(out.string(), wave)) {
std::cerr << "Failed to write " << out.string() << "\n";
return 4;
}

double audio_seconds =
static_cast<double>(audio.samples.size()) / audio.sample_rate;
double rtf = audio_seconds > 0 ? (gen_ms / 1000.0) / audio_seconds : 0;

std::cout << "[demo_tts] wrote " << out.string() << "\n";
std::printf("[demo_tts] generated %.2fs audio in %.1fms (rtf=%.3f), sr=%d\n",
audio_seconds, gen_ms, rtf, audio.sample_rate);
return 0;
}

这个 demo 的核心点非常明确:

  1. 加载 model / tokens / lexicon
  2. 调用 tts.Generate(text, speaker_id, 1.0f)
  3. 把结果写成 tts-out.wav

如果你只是要给业务系统补一个离线中文 TTS 能力,这个入口已经够简洁了。

Demo 4:VAD

第四个 demo 是纯 VAD。

运行方式:

1
2
cd .\build\Release\
.\demo_vad.exe

也可以传自定义 wav:

1
.\demo_vad.exe ..\..\..\models C:\path\to\your.wav

这个程序会把输入音频切成一段段语音区间,输出每一段的:

  1. 起始时间
  2. 结束时间
  3. 持续时长

源码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
// demo_vad.cpp
//
// 用 silero_vad.onnx 在 zh_vad.wav 上做语音活动检测,打印每段
// 语音 [start_ms, end_ms] 与时长。
//
// 用法:
// demo_vad.exe // 用默认 wav
// demo_vad.exe <models_dir> // 自定义 models 根
// demo_vad.exe <models_dir> <input_wav>

#ifndef NOMINMAX
#define NOMINMAX
#endif

#include <algorithm>
#include <chrono>
#include <cstdio>
#include <filesystem>
#include <iostream>
#include <string>
#include <vector>

#include "sherpa-onnx/c-api/cxx-api.h"

namespace fs = std::filesystem;
namespace cxx = sherpa_onnx::cxx;

namespace {

fs::path ResolveModelsDir(int argc, char *argv[]) {
if (argc >= 2) {
return fs::path(argv[1]);
}
fs::path candidate =
fs::current_path() / ".." / ".." / ".." / "models";
if (fs::exists(candidate / "silero_vad.onnx")) {
return fs::weakly_canonical(candidate);
}
return fs::weakly_canonical(fs::current_path() / "models");
}

} // namespace

int main(int argc, char *argv[]) {
fs::path models_dir = ResolveModelsDir(argc, argv);
fs::path vad_model = models_dir / "silero_vad.onnx";
fs::path input_wav = argc >= 3
? fs::path(argv[2])
: models_dir / "sherpa-onnx-sense-voice" / "test_wavs" / "zh_vad.wav";

std::cout << "[demo_vad] vad_model = " << vad_model.string() << "\n";
std::cout << "[demo_vad] input_wav = " << input_wav.string() << "\n";

for (const auto &p : {vad_model, input_wav}) {
if (!fs::exists(p)) {
std::cerr << "missing file: " << p.string() << "\n";
return 1;
}
}

cxx::Wave wave = cxx::ReadWave(input_wav.string());
if (wave.samples.empty()) {
std::cerr << "Failed to read wav\n";
return 2;
}
std::cout << "[demo_vad] samples=" << wave.samples.size()
<< " sr=" << wave.sample_rate
<< " duration=" << (double)wave.samples.size() / wave.sample_rate
<< "s\n";

cxx::VadModelConfig config;
config.silero_vad.model = vad_model.string();
config.silero_vad.threshold = 0.5f;
config.silero_vad.min_silence_duration = 0.5f;
config.silero_vad.min_speech_duration = 0.25f;
config.silero_vad.window_size = 512;
config.silero_vad.max_speech_duration = 20.0f;
config.sample_rate = wave.sample_rate;
config.num_threads = 1;
config.provider = "cpu";

// buffer_size_in_seconds: 给一个比单次窗口大、能容纳整段音频的值
float buffer_seconds =
std::max(60.0f, static_cast<float>(wave.samples.size()) / wave.sample_rate + 5.0f);

auto vad = cxx::VoiceActivityDetector::Create(config, buffer_seconds);
if (vad.Get() == nullptr) {
std::cerr << "Failed to create VoiceActivityDetector\n";
return 3;
}

const int32_t window = config.silero_vad.window_size;
std::vector<std::pair<double, double>> segments_ms;
int seg_index = 0;

auto t0 = std::chrono::steady_clock::now();

// 一次喂一个 window 大小,循环消费已检出的段
for (size_t off = 0; off + window <= wave.samples.size(); off += window) {
vad.AcceptWaveform(wave.samples.data() + off, window);

while (!vad.IsEmpty()) {
auto seg = vad.Front();
double start_ms = 1000.0 * seg.start / wave.sample_rate;
double end_ms =
1000.0 * (seg.start + static_cast<int32_t>(seg.samples.size())) /
wave.sample_rate;
segments_ms.emplace_back(start_ms, end_ms);
std::printf(" seg #%-3d [%8.0f - %8.0f] ms (%.2fs)\n",
++seg_index, start_ms, end_ms,
(end_ms - start_ms) / 1000.0);
vad.Pop();
}
}

// 处理结尾残留
vad.Flush();
while (!vad.IsEmpty()) {
auto seg = vad.Front();
double start_ms = 1000.0 * seg.start / wave.sample_rate;
double end_ms =
1000.0 * (seg.start + static_cast<int32_t>(seg.samples.size())) /
wave.sample_rate;
segments_ms.emplace_back(start_ms, end_ms);
std::printf(" seg #%-3d [%8.0f - %8.0f] ms (%.2fs) (post-flush)\n",
++seg_index, start_ms, end_ms,
(end_ms - start_ms) / 1000.0);
vad.Pop();
}

auto total_ms = std::chrono::duration<double, std::milli>(
std::chrono::steady_clock::now() - t0)
.count();
std::printf("[demo_vad] %zu segments in %.1f ms\n",
segments_ms.size(), total_ms);
return 0;
}

这段代码的重点是 VadModelConfig

  1. threshold
  2. min_silence_duration
  3. min_speech_duration
  4. window_size
  5. max_speech_duration

这几个参数基本决定了切段边界的风格。

Demo 5:VAD + ASR 联合链路

如果前面 4 个 demo 你都看完了,那最后这个 demo_vad_asr.cpp 就是最像真实业务场景的版本。

运行方式:

1
2
cd .\build\Release\
.\demo_vad_asr.exe

也可以指定音频:

1
.\demo_vad_asr.exe ..\..\..\models C:\path\to\your.wav

这个程序先做两件事:

  1. silero_vad.onnx 对整段音频做切分
  2. 每切出一段语音,就交给 SenseVoice 做离线识别

输出的结果会类似这样:

1
2
seg #1 [   start -     end ] ms (zh) | 识别文本
seg #2 [ start - end ] ms (zh) | 识别文本

源码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
// demo_vad_asr.cpp
//
// silero_vad + sherpa-onnx-sense-voice 联合:
// 先用 VAD 把整段音频切成若干语音段,再把每段送 SenseVoice 离线识别,
// 打印 [start_ms, end_ms] 段文本。这是更接近实际通话场景的最小链路。
//
// 用法:
// demo_vad_asr.exe // 用默认 wav
// demo_vad_asr.exe <models_dir>
// demo_vad_asr.exe <models_dir> <input_wav>

#ifndef NOMINMAX
#define NOMINMAX
#endif

#include <algorithm>
#include <chrono>
#include <cstdio>
#include <filesystem>
#include <iostream>
#include <string>
#include <vector>

#include "sherpa-onnx/c-api/cxx-api.h"

namespace fs = std::filesystem;
namespace cxx = sherpa_onnx::cxx;

namespace {

fs::path ResolveModelsDir(int argc, char *argv[]) {
if (argc >= 2) {
return fs::path(argv[1]);
}
fs::path candidate =
fs::current_path() / ".." / ".." / ".." / "models";
if (fs::exists(candidate / "silero_vad.onnx")) {
return fs::weakly_canonical(candidate);
}
return fs::weakly_canonical(fs::current_path() / "models");
}

} // namespace

int main(int argc, char *argv[]) {
fs::path models_dir = ResolveModelsDir(argc, argv);

fs::path vad_model = models_dir / "silero_vad.onnx";
fs::path asr_dir = models_dir / "sherpa-onnx-sense-voice";
fs::path asr_model = asr_dir / "model.int8.onnx";
fs::path asr_tokens = asr_dir / "tokens.txt";

fs::path input_wav = argc >= 3
? fs::path(argv[2])
: asr_dir / "test_wavs" / "zh_vad.wav";

std::cout << "[demo_vad_asr] vad = " << vad_model.string() << "\n";
std::cout << "[demo_vad_asr] asr = " << asr_model.string() << "\n";
std::cout << "[demo_vad_asr] input = " << input_wav.string() << "\n";

for (const auto &p : {vad_model, asr_model, asr_tokens, input_wav}) {
if (!fs::exists(p)) {
std::cerr << "missing file: " << p.string() << "\n";
return 1;
}
}

cxx::Wave wave = cxx::ReadWave(input_wav.string());
if (wave.samples.empty()) {
std::cerr << "Failed to read wav\n";
return 2;
}
std::cout << "[demo_vad_asr] samples=" << wave.samples.size()
<< " sr=" << wave.sample_rate << "\n";

// ---------- VAD ----------
cxx::VadModelConfig vad_config;
vad_config.silero_vad.model = vad_model.string();
vad_config.silero_vad.threshold = 0.5f;
vad_config.silero_vad.min_silence_duration = 0.5f;
vad_config.silero_vad.min_speech_duration = 0.25f;
vad_config.silero_vad.window_size = 512;
vad_config.silero_vad.max_speech_duration = 20.0f;
vad_config.sample_rate = wave.sample_rate;
vad_config.num_threads = 1;
vad_config.provider = "cpu";

float buffer_seconds =
std::max(60.0f, static_cast<float>(wave.samples.size()) / wave.sample_rate + 5.0f);
auto vad = cxx::VoiceActivityDetector::Create(vad_config, buffer_seconds);
if (vad.Get() == nullptr) {
std::cerr << "Failed to create VoiceActivityDetector\n";
return 3;
}

// ---------- ASR (SenseVoice) ----------
cxx::OfflineRecognizerConfig asr_config;
asr_config.model_config.sense_voice.model = asr_model.string();
asr_config.model_config.sense_voice.language = "auto";
asr_config.model_config.sense_voice.use_itn = true;
asr_config.model_config.tokens = asr_tokens.string();
asr_config.model_config.num_threads = 1;
asr_config.model_config.provider = "cpu";
auto recognizer = cxx::OfflineRecognizer::Create(asr_config);
if (recognizer.Get() == nullptr) {
std::cerr << "Failed to create OfflineRecognizer\n";
return 4;
}

auto recognize_segment = [&](const cxx::SpeechSegment &seg) {
auto stream = recognizer.CreateStream();
stream.AcceptWaveform(wave.sample_rate, seg.samples.data(),
static_cast<int32_t>(seg.samples.size()));
recognizer.Decode(&stream);
return recognizer.GetResult(&stream);
};

// ---------- 主循环 ----------
const int32_t window = vad_config.silero_vad.window_size;
int seg_index = 0;
auto t0 = std::chrono::steady_clock::now();

auto consume = [&](bool post_flush) {
while (!vad.IsEmpty()) {
auto seg = vad.Front();
double start_ms = 1000.0 * seg.start / wave.sample_rate;
double end_ms =
1000.0 * (seg.start + static_cast<int32_t>(seg.samples.size())) /
wave.sample_rate;

auto result = recognize_segment(seg);
std::printf(" seg #%-3d [%8.0f - %8.0f] ms %s%s | %s\n",
++seg_index, start_ms, end_ms,
result.lang.empty() ? "" : ("(" + result.lang + ")").c_str(),
post_flush ? " [flush]" : "",
result.text.c_str());
vad.Pop();
}
};

for (size_t off = 0; off + window <= wave.samples.size(); off += window) {
vad.AcceptWaveform(wave.samples.data() + off, window);
consume(false);
}
vad.Flush();
consume(true);

auto total_ms = std::chrono::duration<double, std::milli>(
std::chrono::steady_clock::now() - t0)
.count();
double audio_seconds =
static_cast<double>(wave.samples.size()) / wave.sample_rate;
std::printf(
"[demo_vad_asr] %d segments, total=%.1fms audio=%.2fs rtf=%.3f\n",
seg_index, total_ms, audio_seconds,
audio_seconds > 0 ? (total_ms / 1000.0) / audio_seconds : 0.0);
return 0;
}

这个 demo 我认为是全文最有价值的一段,因为它已经把一个典型语音链路串起来了:

  1. 读 wav
  2. 做 VAD
  3. 对每个 segment 建一个 ASR stream
  4. 解码得到文本
  5. 输出时间戳和内容

很多电话语音、录音转写、语音质检的最小链路,本质上就是这个结构。

命令行参数约定

为了方便快速试验,这几个 demo 的参数风格保持了一致:

demo 参数 1 参数 2 参数 3
demo_offline_asr models 根(可选) - -
demo_online_asr models 根(可选) 输入 wav(可选) -
demo_tts 文本(可选) models 根(可选) speaker_id(可选)
demo_vad models 根(可选) 输入 wav(可选) -
demo_vad_asr models 根(可选) 输入 wav(可选) -

这套约定的好处是:

  1. 默认情况下直接双击或直接执行就能跑
  2. 排障时又能手动指定模型目录
  3. 做批量测试时也比较容易拼命令

运行时常见问题

1. 找不到 onnxruntime.dll

如果启动时提示找不到 onnxruntime.dll,先检查:

1
build\Release\

下面是否存在这些 DLL:

1
2
3
4
onnxruntime.dll
onnxruntime_providers_shared.dll
sherpa-onnx-c-api.dll
sherpa-onnx-cxx-api.dll

正常情况下,CMakeLists.txt 里的 POST_BUILD 会自动复制它们。

2. Failed to create OfflineRecognizer / OnlineRecognizer / OfflineTts

这类问题优先检查两件事:

  1. 模型路径是不是对的
  2. 对应模型文件是不是完整

建议先看程序启动时打印出来的绝对路径,再逐个确认文件是否存在。

3. 中文乱码

PowerShell 下如果输出中文乱码,可以先执行:

1
2
chcp 65001
$OutputEncoding = [System.Text.UTF8Encoding]::new()

另外这套 demo 源码本身也已经通过 /utf-8 编译选项处理了源文件编码问题。

4. RTF 偏高

如果你发现 RTF > 1,通常说明本机 CPU 推理比较吃力。

可以先尝试把源码里的:

1
2
3
config.model_config.num_threads = 1;
config.model.num_threads = 1;
config.num_threads = 1;

改成 24,然后重新构建测试。

这套 demo 适合拿来做什么

我觉得它最适合做三件事:

  1. 验证 sherpa-onnx + 模型 + 本机环境 是否能正常工作
  2. 给更大项目里的语音问题做最小复现
  3. 作为二次开发前的最小模板

它现在的边界也很明确:

  1. 只覆盖 Windows
  2. 主要从 wav 文件读取输入
  3. 目标是稳定复现,不是最终产品形态

如果后面你要继续往前走,下一步通常就是这些方向:

  1. 接麦克风实时输入
  2. 封装成 HTTP / gRPC 服务
  3. 接到 MRCP、电话语音、机器人语音链路里
  4. 针对不同模型做准确率、延迟、RTF 对比

总结

如果你的目标不是“研究 sherpa-onnx 的全部能力”,而是“先把东西跑起来”,那这套 sherpa-demo 已经很合适了。

它把离线 ASR、流式 ASR、TTS、VAD、VAD+ASR 这几条最常见的语音链路拆成了 5 个最小程序,结构清楚,排障直接,拿来做实验和二次开发都很顺手。

后面如果我要继续往下做,我大概率会沿着这条路线走:

  1. 把 wav 输入替换成实时音频流
  2. 把单机 demo 封装成服务接口
  3. 再接入实际业务场景

先跑通,再优化,通常是这类项目最省时间的做法。