最近在折腾本地语音能力,顺手整理了一套最小可运行的 sherpa-onnx Demo。
这篇文章不讲太多概念,直接用我本地的项目来跑通 5 个场景:
离线语音识别 ASR
流式语音识别 ASR
语音合成 TTS
语音活动检测 VAD
VAD + ASR 联合链路
本文示例使用的项目路径是:
1 C:\Users\wlf18\data\codes\gitcode\ai-voice-platform \sherpa-demo
如果你现在的目标只是两件事:
验证 sherpa-onnx 在 Windows 本机能不能跑通
找一个足够小、足够清晰的样例项目做二次开发
那这套 demo 基本够用了。
sherpa-demo 项目结构 这个项目本质上是一个最小 demo 集合,每个可执行程序只负责一个场景,方便单独验证和排障。
目录结构大致如下:
1 2 3 4 5 6 7 8 9 10 sherpa-demo/ ├── build.ps1 ├── CMakeLists.txt ├── README.md └── src/ ├── demo_offline_asr.cpp ├── demo_online_asr.cpp ├── demo_tts.cpp ├── demo_vad.cpp └── demo_vad_asr.cpp
依赖也比较直接:
Windows x64
Visual Studio 2022
CMake
../sherpa-onnx/win_x64/ 下的预编译产物
../models/ 下的模型目录
我这里实际用到的模型有:
sherpa-onnx-sense-voice:离线 ASR
sherpa-onnx-streaming-zipformer-bilingual-zh-en:流式 ASR
vits-zh-aishell3:TTS
silero_vad.onnx:VAD
先编译项目 最简单的方式就是直接执行项目里的 build.ps1:
它的逻辑很简单:
优先从 PATH 里找 cmake.exe
找不到就回退到 Visual Studio 自带的 CMake
自动选择可用的 Visual Studio 生成器
执行 configure 和 build
脚本内容如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 $ErrorActionPreference = 'Stop' $root = $PSScriptRoot $build = Join-Path $root 'build' $cmake = Get-Command cmake.exe -ErrorAction SilentlyContinueif ($null -eq $cmake ) { $bundled = "$ {env:ProgramFiles}\Microsoft Visual Studio\18\Community\Common7\IDE\CommonExtensions\Microsoft\CMake\CMake\bin\cmake.exe" if (-not (Test-Path $bundled )) { $bundled = "$ {env:ProgramFiles}\Microsoft Visual Studio\2022\Community\Common7\IDE\CommonExtensions\Microsoft\CMake\CMake\bin\cmake.exe" } if (-not (Test-Path $bundled )) { throw "cmake.exe not found. 请在 PATH 中安装 cmake,或调整脚本里的 bundled 路径。" } $cmakeExe = $bundled } else { $cmakeExe = $cmake .Path } $generator = $null foreach ($g in @ ('Visual Studio 18 2026' , 'Visual Studio 17 2022' )) { $help = & $cmakeExe --help 2 >&1 | Out-String if ($help -match [regex ]::Escape($g )) { $generator = $g break } } if (-not $generator ) { throw "找不到可用的 Visual Studio 生成器" }Write-Host "Using cmake: $cmakeExe " Write-Host "Using generator: $generator " & $cmakeExe -S $root -B $build -G $generator -A x64 if ($LASTEXITCODE -ne 0 ) { throw "cmake configure failed" }& $cmakeExe --build $build --config Release if ($LASTEXITCODE -ne 0 ) { throw "cmake build failed" }Write-Host "" Write-Host "Build OK. Executables:" -ForegroundColor GreenGet-ChildItem (Join-Path $build 'Release' ) -Filter 'demo_*.exe' | ForEach-Object { Write-Host " $ ($_ .FullName)" }
如果你更喜欢直接敲命令,也可以手动执行:
1 2 cmake -S . -B build -G "Visual Studio 17 2022" -A x64 cmake --build build --config Release
构建完成后,所有 exe 和运行时 DLL 都会落到:
CMakeLists 做了什么 这个项目的 CMakeLists.txt 非常适合拿来做最小模板。
它做了三件关键的事:
指定 SHERPA_ONNX_ROOT,把头文件和库文件接进来
为 5 个 demo 统一链接 sherpa-onnx 和 onnxruntime
在 POST_BUILD 里把运行时 DLL 自动复制到 exe 同目录
核心配置如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 cmake_minimum_required (VERSION 3.15 )project (sherpa_demo CXX)set (CMAKE_CXX_STANDARD 17 )set (CMAKE_CXX_STANDARD_REQUIRED ON )set (CMAKE_CXX_EXTENSIONS OFF )if (MSVC) add_compile_options (/utf-8 /W3 /wd4305) add_definitions (-D_CRT_SECURE_NO_WARNINGS -DNOMINMAX) endif ()set (SHERPA_ONNX_ROOT "${CMAKE_SOURCE_DIR}/../sherpa-onnx/win_x64" CACHE PATH "Path to sherpa-onnx prebuilt root containing bin/include/lib" ) if (NOT EXISTS "${SHERPA_ONNX_ROOT}/include/sherpa-onnx/c-api/cxx-api.h" ) message (FATAL_ERROR "Cannot find sherpa-onnx cxx-api header at ${SHERPA_ONNX_ROOT}. " "Set -DSHERPA_ONNX_ROOT=<path> or place the prebuilt under ../sherpa-onnx/win_x64." ) endif ()include_directories ("${SHERPA_ONNX_ROOT}/include" )set (SHERPA_ONNX_CXX_LIB "${SHERPA_ONNX_ROOT}/lib/sherpa-onnx-cxx-api.lib" )set (SHERPA_ONNX_C_LIB "${SHERPA_ONNX_ROOT}/lib/sherpa-onnx-c-api.lib" )set (ONNXRUNTIME_LIB "${SHERPA_ONNX_ROOT}/lib/onnxruntime.lib" )set (SHERPA_RUNTIME_DLLS "${SHERPA_ONNX_ROOT}/bin/onnxruntime.dll" "${SHERPA_ONNX_ROOT}/bin/onnxruntime_providers_shared.dll" "${SHERPA_ONNX_ROOT}/lib/sherpa-onnx-c-api.dll" "${SHERPA_ONNX_ROOT}/lib/sherpa-onnx-cxx-api.dll" ) set (DEMO_TARGETS demo_offline_asr demo_online_asr demo_tts demo_vad demo_vad_asr ) foreach (name ${DEMO_TARGETS} ) add_executable (${name} src/${name} .cpp) target_link_libraries (${name} PRIVATE "${SHERPA_ONNX_CXX_LIB}" "${SHERPA_ONNX_C_LIB}" "${ONNXRUNTIME_LIB}" ) if (WIN32) target_link_libraries (${name} PRIVATE ws2_32 winmm) endif () add_custom_command (TARGET ${name} POST_BUILD COMMAND ${CMAKE_COMMAND} -E copy_if_different ${SHERPA_RUNTIME_DLLS} "$<TARGET_FILE_DIR:${name}>" COMMENT "Copy sherpa-onnx runtime DLLs next to ${name}.exe" ) endforeach ()
这段配置的价值非常直接:
你不需要自己到处拷 DLL
每个 demo 的依赖完全一致
后面新增 demo_xxx.cpp 也很方便复用
Demo 1:离线 ASR 先看最基础的离线识别。
运行方式:
1 2 cd .\build\Release\.\demo_offline_asr.exe
这个程序会读取:
1 ..\..\..\models\sherpa-onnx-sense-voice \test_wavs\
目录下的 wav 文件,然后逐个输出:
文件名
识别语言
解码耗时
RTF
识别文本
源码如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 #ifndef NOMINMAX #define NOMINMAX #endif #include <chrono> #include <cstdio> #include <filesystem> #include <iostream> #include <string> #include <vector> #include "sherpa-onnx/c-api/cxx-api.h" namespace fs = std::filesystem;namespace cxx = sherpa_onnx::cxx;namespace {fs::path ResolveModelsDir (int argc, char *argv[]) { if (argc >= 2 ) { return fs::path (argv[1 ]); } fs::path here = fs::current_path (); fs::path candidate = here / ".." / ".." / ".." / "models" ; if (fs::exists (candidate / "sherpa-onnx-sense-voice" )) { return fs::weakly_canonical (candidate); } return fs::weakly_canonical (here / "models" ); } } int main (int argc, char *argv[]) { fs::path models_dir = ResolveModelsDir (argc, argv); fs::path model_dir = models_dir / "sherpa-onnx-sense-voice" ; fs::path model_file = model_dir / "model.int8.onnx" ; fs::path tokens_file = model_dir / "tokens.txt" ; fs::path wavs_dir = model_dir / "test_wavs" ; std::cout << "[demo_offline_asr] models_dir = " << models_dir.string () << "\n" ; std::cout << "[demo_offline_asr] model = " << model_file.string () << "\n" ; std::cout << "[demo_offline_asr] tokens = " << tokens_file.string () << "\n" ; std::cout << "[demo_offline_asr] wavs_dir = " << wavs_dir.string () << "\n" ; if (!fs::exists (model_file) || !fs::exists (tokens_file) || !fs::exists (wavs_dir)) { std::cerr << "Required model/wavs not found, please check paths above.\n" ; return 1 ; } cxx::OfflineRecognizerConfig config; config.model_config.sense_voice.model = model_file.string (); config.model_config.sense_voice.language = "auto" ; config.model_config.sense_voice.use_itn = true ; config.model_config.tokens = tokens_file.string (); config.model_config.num_threads = 1 ; config.model_config.provider = "cpu" ; config.model_config.debug = false ; auto load_start = std::chrono::steady_clock::now (); auto recognizer = cxx::OfflineRecognizer::Create (config); if (recognizer.Get () == nullptr ) { std::cerr << "Failed to create OfflineRecognizer (SenseVoice)\n" ; return 2 ; } auto load_ms = std::chrono::duration <double , std::milli>( std::chrono::steady_clock::now () - load_start) .count (); std::cout << "[demo_offline_asr] recognizer loaded in " << load_ms << " ms\n\n" ; std::vector<fs::path> wavs; for (const auto &entry : fs::directory_iterator (wavs_dir)) { if (entry.is_regular_file () && entry.path ().extension () == ".wav" ) { wavs.push_back (entry.path ()); } } std::sort (wavs.begin (), wavs.end ()); if (wavs.empty ()) { std::cerr << "No .wav files in " << wavs_dir.string () << "\n" ; return 3 ; } std::printf ("%-18s %-6s %-8s %-7s %s\n" , "file" , "lang" , "elapsed" , "rtf" , "text" ); std::printf ("%-18s %-6s %-8s %-7s %s\n" , "----" , "----" , "--------" , "-----" , "----" ); for (const auto &wav_path : wavs) { cxx::Wave wave = cxx::ReadWave (wav_path.string ()); if (wave.samples.empty ()) { std::cerr << " [skip] cannot read " << wav_path.filename ().string () << "\n" ; continue ; } auto stream = recognizer.CreateStream (); stream.AcceptWaveform (wave.sample_rate, wave.samples.data (), static_cast <int32_t >(wave.samples.size ())); auto t0 = std::chrono::steady_clock::now (); recognizer.Decode (&stream); auto result = recognizer.GetResult (&stream); auto elapsed_ms = std::chrono::duration <double , std::milli>( std::chrono::steady_clock::now () - t0) .count (); double audio_seconds = static_cast <double >(wave.samples.size ()) / static_cast <double >(wave.sample_rate); double rtf = audio_seconds > 0 ? (elapsed_ms / 1000.0 ) / audio_seconds : 0 ; std::printf ("%-18s %-6s %7.1fms %-7.3f %s\n" , wav_path.filename ().string ().c_str (), result.lang.empty () ? "?" : result.lang.c_str (), elapsed_ms, rtf, result.text.c_str ()); } return 0 ; }
这段代码里最值得关注的是两个点:
config.model_config.sense_voice.language = "auto",表示自动检测语种
config.model_config.sense_voice.use_itn = true,表示对识别结果做 ITN 归一化
如果你只是要接一个最小离线识别能力,这段基本就是模板。
Demo 2:流式 ASR 第二个 demo 是流式识别。
运行方式:
1 2 cd .\build\Release\.\demo_online_asr.exe
也可以指定模型根目录和输入音频:
1 .\demo_online_asr.exe ..\..\..\models D:\some\other.wav
这个 demo 的思路很实用:
把一个 wav 文件当成“麦克风输入”
每次喂 0.1s 音频
文本变化时打印一次 partial
最后输出 FINAL
源码如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 #ifndef NOMINMAX #define NOMINMAX #endif #include <algorithm> #include <chrono> #include <cstdio> #include <filesystem> #include <iostream> #include <string> #include <vector> #include "sherpa-onnx/c-api/cxx-api.h" namespace fs = std::filesystem;namespace cxx = sherpa_onnx::cxx;namespace {fs::path ResolveModelsDir (int argc, char *argv[]) { if (argc >= 2 ) { return fs::path (argv[1 ]); } fs::path candidate = fs::current_path () / ".." / ".." / ".." / "models" ; if (fs::exists (candidate / "sherpa-onnx-streaming-zipformer-bilingual-zh-en" )) { return fs::weakly_canonical (candidate); } return fs::weakly_canonical (fs::current_path () / "models" ); } } int main (int argc, char *argv[]) { fs::path models_dir = ResolveModelsDir (argc, argv); fs::path model_dir = models_dir / "sherpa-onnx-streaming-zipformer-bilingual-zh-en" ; fs::path encoder = model_dir / "encoder-epoch-99-avg-1.onnx" ; fs::path decoder = model_dir / "decoder-epoch-99-avg-1.onnx" ; fs::path joiner = model_dir / "joiner-epoch-99-avg-1.onnx" ; fs::path tokens = model_dir / "tokens.txt" ; fs::path input_wav; if (argc >= 3 ) { input_wav = fs::path (argv[2 ]); } else { input_wav = models_dir / "sherpa-onnx-sense-voice" / "test_wavs" / "zh.wav" ; } std::cout << "[demo_online_asr] encoder = " << encoder.string () << "\n" ; std::cout << "[demo_online_asr] decoder = " << decoder.string () << "\n" ; std::cout << "[demo_online_asr] joiner = " << joiner.string () << "\n" ; std::cout << "[demo_online_asr] tokens = " << tokens.string () << "\n" ; std::cout << "[demo_online_asr] input = " << input_wav.string () << "\n" ; for (const auto &p : {encoder, decoder, joiner, tokens, input_wav}) { if (!fs::exists (p)) { std::cerr << "missing file: " << p.string () << "\n" ; return 1 ; } } cxx::OnlineRecognizerConfig config; config.model_config.transducer.encoder = encoder.string (); config.model_config.transducer.decoder = decoder.string (); config.model_config.transducer.joiner = joiner.string (); config.model_config.tokens = tokens.string (); config.model_config.num_threads = 1 ; config.model_config.provider = "cpu" ; config.model_config.debug = false ; config.decoding_method = "greedy_search" ; config.enable_endpoint = true ; config.rule1_min_trailing_silence = 2.4f ; config.rule2_min_trailing_silence = 1.2f ; config.rule3_min_utterance_length = 20.0f ; auto load_start = std::chrono::steady_clock::now (); auto recognizer = cxx::OnlineRecognizer::Create (config); if (recognizer.Get () == nullptr ) { std::cerr << "Failed to create OnlineRecognizer (zipformer)\n" ; return 2 ; } auto load_ms = std::chrono::duration <double , std::milli>( std::chrono::steady_clock::now () - load_start) .count (); std::cout << "[demo_online_asr] recognizer loaded in " << load_ms << " ms\n" ; cxx::Wave wave = cxx::ReadWave (input_wav.string ()); if (wave.samples.empty ()) { std::cerr << "Failed to read wav: " << input_wav.string () << "\n" ; return 3 ; } std::cout << "[demo_online_asr] wav samples=" << wave.samples.size () << " sr=" << wave.sample_rate << " duration=" << (double )wave.samples.size () / wave.sample_rate << "s\n" ; auto stream = recognizer.CreateStream (); const int32_t chunk_samples = wave.sample_rate / 10 ; std::string last_text; int partial_count = 0 ; auto t0 = std::chrono::steady_clock::now (); for (size_t off = 0 ; off < wave.samples.size (); off += chunk_samples) { int32_t n = static_cast <int32_t >( std::min <size_t >(chunk_samples, wave.samples.size () - off)); stream.AcceptWaveform (wave.sample_rate, wave.samples.data () + off, n); while (recognizer.IsReady (&stream)) { recognizer.Decode (&stream); } auto result = recognizer.GetResult (&stream); if (!result.text.empty () && result.text != last_text) { ++partial_count; double ms_now = std::chrono::duration <double , std::milli>( std::chrono::steady_clock::now () - t0) .count (); std::printf (" partial #%-3d t=%6.0fms %s\n" , partial_count, ms_now, result.text.c_str ()); last_text = result.text; } if (recognizer.IsEndpoint (&stream)) { auto end_result = recognizer.GetResult (&stream); std::printf (" endpoint at offset=%zu, final_so_far: %s\n" , off, end_result.text.c_str ()); recognizer.Reset (&stream); last_text.clear (); } } stream.InputFinished (); while (recognizer.IsReady (&stream)) { recognizer.Decode (&stream); } auto final_result = recognizer.GetResult (&stream); auto total_ms = std::chrono::duration <double , std::milli>( std::chrono::steady_clock::now () - t0) .count (); double audio_seconds = static_cast <double >(wave.samples.size ()) / wave.sample_rate; double rtf = audio_seconds > 0 ? (total_ms / 1000.0 ) / audio_seconds : 0 ; std::cout << "\n[demo_online_asr] FINAL: " << final_result.text << "\n" ; std::printf ("[demo_online_asr] total=%.1fms audio=%.2fs rtf=%.3f partials=%d\n" , total_ms, audio_seconds, rtf, partial_count); return 0 ; }
这段代码最大的参考价值在于主循环:
stream.AcceptWaveform(...) 按块喂音频
while (recognizer.IsReady(&stream)) 持续解码
recognizer.GetResult(&stream) 获取当前 partial
recognizer.IsEndpoint(&stream) 判定一句话是否结束
recognizer.Reset(&stream) 重置流状态
如果你后面要接麦克风实时输入,这个 demo 的结构几乎可以直接平移过去。
Demo 3:TTS 第三个 demo 是语音合成。
运行方式:
1 2 cd .\build\Release\.\demo_tts.exe
传自定义文本:
1 .\demo_tts.exe "今天天气真不错"
再进一步,你也可以指定 models 根目录和 speaker_id:
1 .\demo_tts.exe "文本" ..\..\..\models 0
源码如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 #ifndef NOMINMAX #define NOMINMAX #endif #include <chrono> #include <cstdio> #include <cstdlib> #include <filesystem> #include <iostream> #include <string> #include "sherpa-onnx/c-api/cxx-api.h" namespace fs = std::filesystem;namespace cxx = sherpa_onnx::cxx;namespace {fs::path ResolveModelsDir (const std::string &arg) { if (!arg.empty ()) { return fs::path (arg); } fs::path candidate = fs::current_path () / ".." / ".." / ".." / "models" ; if (fs::exists (candidate / "vits-zh-aishell3" )) { return fs::weakly_canonical (candidate); } return fs::weakly_canonical (fs::current_path () / "models" ); } } int main (int argc, char *argv[]) { std::string text = "你好,欢迎使用 sherpa-onnx 语音合成。今天是个测试 demo 的好日子。" ; std::string models_arg; int32_t speaker_id = 0 ; if (argc >= 2 ) text = argv[1 ]; if (argc >= 3 ) models_arg = argv[2 ]; if (argc >= 4 ) speaker_id = std::atoi (argv[3 ]); fs::path models_dir = ResolveModelsDir (models_arg); fs::path model_dir = models_dir / "vits-zh-aishell3" ; fs::path model_file = model_dir / "vits-aishell3.onnx" ; fs::path tokens = model_dir / "tokens.txt" ; fs::path lexicon = model_dir / "lexicon.txt" ; std::cout << "[demo_tts] model = " << model_file.string () << "\n" ; std::cout << "[demo_tts] tokens = " << tokens.string () << "\n" ; std::cout << "[demo_tts] lexicon = " << lexicon.string () << "\n" ; std::cout << "[demo_tts] sid = " << speaker_id << "\n" ; std::cout << "[demo_tts] text = " << text << "\n" ; for (const auto &p : {model_file, tokens, lexicon}) { if (!fs::exists (p)) { std::cerr << "missing file: " << p.string () << "\n" ; return 1 ; } } cxx::OfflineTtsConfig config; config.model.vits.model = model_file.string (); config.model.vits.tokens = tokens.string (); config.model.vits.lexicon = lexicon.string (); config.model.num_threads = 1 ; config.model.provider = "cpu" ; config.model.debug = false ; config.max_num_sentences = 1 ; auto load_start = std::chrono::steady_clock::now (); auto tts = cxx::OfflineTts::Create (config); if (tts.Get () == nullptr ) { std::cerr << "Failed to create OfflineTts\n" ; return 2 ; } auto load_ms = std::chrono::duration <double , std::milli>( std::chrono::steady_clock::now () - load_start) .count (); std::cout << "[demo_tts] tts loaded in " << load_ms << " ms, sample_rate=" << tts.SampleRate () << ", num_speakers=" << tts.NumSpeakers () << "\n" ; if (tts.NumSpeakers () > 0 && speaker_id >= tts.NumSpeakers ()) { std::cerr << "speaker_id " << speaker_id << " out of range, fallback to 0\n" ; speaker_id = 0 ; } auto t0 = std::chrono::steady_clock::now (); auto audio = tts.Generate (text, speaker_id, 1.0f ); auto gen_ms = std::chrono::duration <double , std::milli>( std::chrono::steady_clock::now () - t0) .count (); if (audio.samples.empty ()) { std::cerr << "TTS generated empty audio\n" ; return 3 ; } fs::path out = fs::current_path () / "tts-out.wav" ; cxx::Wave wave; wave.samples = audio.samples; wave.sample_rate = audio.sample_rate; if (!cxx::WriteWave (out.string (), wave)) { std::cerr << "Failed to write " << out.string () << "\n" ; return 4 ; } double audio_seconds = static_cast <double >(audio.samples.size ()) / audio.sample_rate; double rtf = audio_seconds > 0 ? (gen_ms / 1000.0 ) / audio_seconds : 0 ; std::cout << "[demo_tts] wrote " << out.string () << "\n" ; std::printf ("[demo_tts] generated %.2fs audio in %.1fms (rtf=%.3f), sr=%d\n" , audio_seconds, gen_ms, rtf, audio.sample_rate); return 0 ; }
这个 demo 的核心点非常明确:
加载 model / tokens / lexicon
调用 tts.Generate(text, speaker_id, 1.0f)
把结果写成 tts-out.wav
如果你只是要给业务系统补一个离线中文 TTS 能力,这个入口已经够简洁了。
Demo 4:VAD 第四个 demo 是纯 VAD。
运行方式:
1 2 cd .\build\Release\.\demo_vad.exe
也可以传自定义 wav:
1 .\demo_vad.exe ..\..\..\models C:\path\to\your.wav
这个程序会把输入音频切成一段段语音区间,输出每一段的:
起始时间
结束时间
持续时长
源码如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 #ifndef NOMINMAX #define NOMINMAX #endif #include <algorithm> #include <chrono> #include <cstdio> #include <filesystem> #include <iostream> #include <string> #include <vector> #include "sherpa-onnx/c-api/cxx-api.h" namespace fs = std::filesystem;namespace cxx = sherpa_onnx::cxx;namespace {fs::path ResolveModelsDir (int argc, char *argv[]) { if (argc >= 2 ) { return fs::path (argv[1 ]); } fs::path candidate = fs::current_path () / ".." / ".." / ".." / "models" ; if (fs::exists (candidate / "silero_vad.onnx" )) { return fs::weakly_canonical (candidate); } return fs::weakly_canonical (fs::current_path () / "models" ); } } int main (int argc, char *argv[]) { fs::path models_dir = ResolveModelsDir (argc, argv); fs::path vad_model = models_dir / "silero_vad.onnx" ; fs::path input_wav = argc >= 3 ? fs::path (argv[2 ]) : models_dir / "sherpa-onnx-sense-voice" / "test_wavs" / "zh_vad.wav" ; std::cout << "[demo_vad] vad_model = " << vad_model.string () << "\n" ; std::cout << "[demo_vad] input_wav = " << input_wav.string () << "\n" ; for (const auto &p : {vad_model, input_wav}) { if (!fs::exists (p)) { std::cerr << "missing file: " << p.string () << "\n" ; return 1 ; } } cxx::Wave wave = cxx::ReadWave (input_wav.string ()); if (wave.samples.empty ()) { std::cerr << "Failed to read wav\n" ; return 2 ; } std::cout << "[demo_vad] samples=" << wave.samples.size () << " sr=" << wave.sample_rate << " duration=" << (double )wave.samples.size () / wave.sample_rate << "s\n" ; cxx::VadModelConfig config; config.silero_vad.model = vad_model.string (); config.silero_vad.threshold = 0.5f ; config.silero_vad.min_silence_duration = 0.5f ; config.silero_vad.min_speech_duration = 0.25f ; config.silero_vad.window_size = 512 ; config.silero_vad.max_speech_duration = 20.0f ; config.sample_rate = wave.sample_rate; config.num_threads = 1 ; config.provider = "cpu" ; float buffer_seconds = std::max (60.0f , static_cast <float >(wave.samples.size ()) / wave.sample_rate + 5.0f ); auto vad = cxx::VoiceActivityDetector::Create (config, buffer_seconds); if (vad.Get () == nullptr ) { std::cerr << "Failed to create VoiceActivityDetector\n" ; return 3 ; } const int32_t window = config.silero_vad.window_size; std::vector<std::pair<double , double >> segments_ms; int seg_index = 0 ; auto t0 = std::chrono::steady_clock::now (); for (size_t off = 0 ; off + window <= wave.samples.size (); off += window) { vad.AcceptWaveform (wave.samples.data () + off, window); while (!vad.IsEmpty ()) { auto seg = vad.Front (); double start_ms = 1000.0 * seg.start / wave.sample_rate; double end_ms = 1000.0 * (seg.start + static_cast <int32_t >(seg.samples.size ())) / wave.sample_rate; segments_ms.emplace_back (start_ms, end_ms); std::printf (" seg #%-3d [%8.0f - %8.0f] ms (%.2fs)\n" , ++seg_index, start_ms, end_ms, (end_ms - start_ms) / 1000.0 ); vad.Pop (); } } vad.Flush (); while (!vad.IsEmpty ()) { auto seg = vad.Front (); double start_ms = 1000.0 * seg.start / wave.sample_rate; double end_ms = 1000.0 * (seg.start + static_cast <int32_t >(seg.samples.size ())) / wave.sample_rate; segments_ms.emplace_back (start_ms, end_ms); std::printf (" seg #%-3d [%8.0f - %8.0f] ms (%.2fs) (post-flush)\n" , ++seg_index, start_ms, end_ms, (end_ms - start_ms) / 1000.0 ); vad.Pop (); } auto total_ms = std::chrono::duration <double , std::milli>( std::chrono::steady_clock::now () - t0) .count (); std::printf ("[demo_vad] %zu segments in %.1f ms\n" , segments_ms.size (), total_ms); return 0 ; }
这段代码的重点是 VadModelConfig:
threshold
min_silence_duration
min_speech_duration
window_size
max_speech_duration
这几个参数基本决定了切段边界的风格。
Demo 5:VAD + ASR 联合链路 如果前面 4 个 demo 你都看完了,那最后这个 demo_vad_asr.cpp 就是最像真实业务场景的版本。
运行方式:
1 2 cd .\build\Release\.\demo_vad_asr.exe
也可以指定音频:
1 .\demo_vad_asr.exe ..\..\..\models C:\path\to\your.wav
这个程序先做两件事:
用 silero_vad.onnx 对整段音频做切分
每切出一段语音,就交给 SenseVoice 做离线识别
输出的结果会类似这样:
1 2 seg #1 [ start - end ] ms (zh) | 识别文本 seg #2 [ start - end ] ms (zh) | 识别文本
源码如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 #ifndef NOMINMAX #define NOMINMAX #endif #include <algorithm> #include <chrono> #include <cstdio> #include <filesystem> #include <iostream> #include <string> #include <vector> #include "sherpa-onnx/c-api/cxx-api.h" namespace fs = std::filesystem;namespace cxx = sherpa_onnx::cxx;namespace {fs::path ResolveModelsDir (int argc, char *argv[]) { if (argc >= 2 ) { return fs::path (argv[1 ]); } fs::path candidate = fs::current_path () / ".." / ".." / ".." / "models" ; if (fs::exists (candidate / "silero_vad.onnx" )) { return fs::weakly_canonical (candidate); } return fs::weakly_canonical (fs::current_path () / "models" ); } } int main (int argc, char *argv[]) { fs::path models_dir = ResolveModelsDir (argc, argv); fs::path vad_model = models_dir / "silero_vad.onnx" ; fs::path asr_dir = models_dir / "sherpa-onnx-sense-voice" ; fs::path asr_model = asr_dir / "model.int8.onnx" ; fs::path asr_tokens = asr_dir / "tokens.txt" ; fs::path input_wav = argc >= 3 ? fs::path (argv[2 ]) : asr_dir / "test_wavs" / "zh_vad.wav" ; std::cout << "[demo_vad_asr] vad = " << vad_model.string () << "\n" ; std::cout << "[demo_vad_asr] asr = " << asr_model.string () << "\n" ; std::cout << "[demo_vad_asr] input = " << input_wav.string () << "\n" ; for (const auto &p : {vad_model, asr_model, asr_tokens, input_wav}) { if (!fs::exists (p)) { std::cerr << "missing file: " << p.string () << "\n" ; return 1 ; } } cxx::Wave wave = cxx::ReadWave (input_wav.string ()); if (wave.samples.empty ()) { std::cerr << "Failed to read wav\n" ; return 2 ; } std::cout << "[demo_vad_asr] samples=" << wave.samples.size () << " sr=" << wave.sample_rate << "\n" ; cxx::VadModelConfig vad_config; vad_config.silero_vad.model = vad_model.string (); vad_config.silero_vad.threshold = 0.5f ; vad_config.silero_vad.min_silence_duration = 0.5f ; vad_config.silero_vad.min_speech_duration = 0.25f ; vad_config.silero_vad.window_size = 512 ; vad_config.silero_vad.max_speech_duration = 20.0f ; vad_config.sample_rate = wave.sample_rate; vad_config.num_threads = 1 ; vad_config.provider = "cpu" ; float buffer_seconds = std::max (60.0f , static_cast <float >(wave.samples.size ()) / wave.sample_rate + 5.0f ); auto vad = cxx::VoiceActivityDetector::Create (vad_config, buffer_seconds); if (vad.Get () == nullptr ) { std::cerr << "Failed to create VoiceActivityDetector\n" ; return 3 ; } cxx::OfflineRecognizerConfig asr_config; asr_config.model_config.sense_voice.model = asr_model.string (); asr_config.model_config.sense_voice.language = "auto" ; asr_config.model_config.sense_voice.use_itn = true ; asr_config.model_config.tokens = asr_tokens.string (); asr_config.model_config.num_threads = 1 ; asr_config.model_config.provider = "cpu" ; auto recognizer = cxx::OfflineRecognizer::Create (asr_config); if (recognizer.Get () == nullptr ) { std::cerr << "Failed to create OfflineRecognizer\n" ; return 4 ; } auto recognize_segment = [&](const cxx::SpeechSegment &seg) { auto stream = recognizer.CreateStream (); stream.AcceptWaveform (wave.sample_rate, seg.samples.data (), static_cast <int32_t >(seg.samples.size ())); recognizer.Decode (&stream); return recognizer.GetResult (&stream); }; const int32_t window = vad_config.silero_vad.window_size; int seg_index = 0 ; auto t0 = std::chrono::steady_clock::now (); auto consume = [&](bool post_flush) { while (!vad.IsEmpty ()) { auto seg = vad.Front (); double start_ms = 1000.0 * seg.start / wave.sample_rate; double end_ms = 1000.0 * (seg.start + static_cast <int32_t >(seg.samples.size ())) / wave.sample_rate; auto result = recognize_segment (seg); std::printf (" seg #%-3d [%8.0f - %8.0f] ms %s%s | %s\n" , ++seg_index, start_ms, end_ms, result.lang.empty () ? "" : ("(" + result.lang + ")" ).c_str (), post_flush ? " [flush]" : "" , result.text.c_str ()); vad.Pop (); } }; for (size_t off = 0 ; off + window <= wave.samples.size (); off += window) { vad.AcceptWaveform (wave.samples.data () + off, window); consume (false ); } vad.Flush (); consume (true ); auto total_ms = std::chrono::duration <double , std::milli>( std::chrono::steady_clock::now () - t0) .count (); double audio_seconds = static_cast <double >(wave.samples.size ()) / wave.sample_rate; std::printf ( "[demo_vad_asr] %d segments, total=%.1fms audio=%.2fs rtf=%.3f\n" , seg_index, total_ms, audio_seconds, audio_seconds > 0 ? (total_ms / 1000.0 ) / audio_seconds : 0.0 ); return 0 ; }
这个 demo 我认为是全文最有价值的一段,因为它已经把一个典型语音链路串起来了:
读 wav
做 VAD
对每个 segment 建一个 ASR stream
解码得到文本
输出时间戳和内容
很多电话语音、录音转写、语音质检的最小链路,本质上就是这个结构。
命令行参数约定 为了方便快速试验,这几个 demo 的参数风格保持了一致:
demo
参数 1
参数 2
参数 3
demo_offline_asr
models 根(可选)
-
-
demo_online_asr
models 根(可选)
输入 wav(可选)
-
demo_tts
文本(可选)
models 根(可选)
speaker_id(可选)
demo_vad
models 根(可选)
输入 wav(可选)
-
demo_vad_asr
models 根(可选)
输入 wav(可选)
-
这套约定的好处是:
默认情况下直接双击或直接执行就能跑
排障时又能手动指定模型目录
做批量测试时也比较容易拼命令
运行时常见问题 1. 找不到 onnxruntime.dll 如果启动时提示找不到 onnxruntime.dll,先检查:
下面是否存在这些 DLL:
1 2 3 4 onnxruntime.dll onnxruntime_providers_shared.dll sherpa-onnx-c-api.dll sherpa-onnx-cxx-api.dll
正常情况下,CMakeLists.txt 里的 POST_BUILD 会自动复制它们。
2. Failed to create OfflineRecognizer / OnlineRecognizer / OfflineTts 这类问题优先检查两件事:
模型路径是不是对的
对应模型文件是不是完整
建议先看程序启动时打印出来的绝对路径,再逐个确认文件是否存在。
3. 中文乱码 PowerShell 下如果输出中文乱码,可以先执行:
1 2 chcp 65001 $OutputEncoding = [System.Text.UTF8Encoding ]::new()
另外这套 demo 源码本身也已经通过 /utf-8 编译选项处理了源文件编码问题。
4. RTF 偏高 如果你发现 RTF > 1,通常说明本机 CPU 推理比较吃力。
可以先尝试把源码里的:
1 2 3 config.model_config.num_threads = 1 ; config.model.num_threads = 1 ; config.num_threads = 1 ;
改成 2 或 4,然后重新构建测试。
这套 demo 适合拿来做什么 我觉得它最适合做三件事:
验证 sherpa-onnx + 模型 + 本机环境 是否能正常工作
给更大项目里的语音问题做最小复现
作为二次开发前的最小模板
它现在的边界也很明确:
只覆盖 Windows
主要从 wav 文件读取输入
目标是稳定复现,不是最终产品形态
如果后面你要继续往前走,下一步通常就是这些方向:
接麦克风实时输入
封装成 HTTP / gRPC 服务
接到 MRCP、电话语音、机器人语音链路里
针对不同模型做准确率、延迟、RTF 对比
总结 如果你的目标不是“研究 sherpa-onnx 的全部能力”,而是“先把东西跑起来”,那这套 sherpa-demo 已经很合适了。
它把离线 ASR、流式 ASR、TTS、VAD、VAD+ASR 这几条最常见的语音链路拆成了 5 个最小程序,结构清楚,排障直接,拿来做实验和二次开发都很顺手。
后面如果我要继续往下做,我大概率会沿着这条路线走:
把 wav 输入替换成实时音频流
把单机 demo 封装成服务接口
再接入实际业务场景
先跑通,再优化,通常是这类项目最省时间的做法。