UNlawrence/hermes-weixin-voice

GitHub: UNlawrence/hermes-weixin-voice

Stars: 1 | Forks: 0

# Hermes Weixin Voice `hermes-weixin-voice` is the bidirectional voice I/O layer for the Hermes WeChat agent: it lets the agent **hear** incoming WeChat voice messages and **speak** replies back into a chat. It is a local Python package that combines neural TTS, neural STT, the Tencent SILK_V3 codec, and the iLink bot transport protocol into a single pipeline. It also documents — with evidence — where the public iLink bot path stops being able to render outbound *native* voice bubbles, and what the remaining options are. ## What it does ┌──────────────────────── inbound (STT) ────────────────────────┐ WeChat voice message → SILK_V3 → 16 kHz PCM → faster-whisper → text └───────────────────────────────────────────────────────────────┘ ┌──────────────────────── outbound (TTS) ───────────────────────┐ agent reply text → Piper (local) | edge-tts (cloud, free) → 24 kHz PCM → SILK_V3 (Tencent) ↓ AES-128-ECB ↓ iLink CDN upload ↓ ITEM_VOICE / ITEM_FILE send └───────────────────────────────────────────────────────────────┘ ### End-to-end demo (real output) $ python -c "..." # synthesize → encode → transcribe round trip IN: 今天天气真好,我们一起去公园散步吧 OUT: 今天天氣真好,我們一起去公園散步吧。 duration_ms: 3900 Same TTS path produces a structurally valid Tencent SILK_V3 file (`\x02#!SILK_V3` magic, ~24 kHz, duration matches playback), which the iLink layer can upload and send to a target wxid. ## Status | Capability | State | Notes | |---|---|---| | **TTS** (text → speech) | ✅ verified | Piper (local, ONNX) or edge-tts (Microsoft cloud, free) → 24 kHz mono → SILK_V3 Tencent variant | | **STT** (speech → text) | ✅ verified | SILK / WAV / MP3 → faster-whisper (`base`, int8, CPU) | | **WeChat file attachment send** | ✅ verified | `ITEM_FILE` reaches the target chat end-to-end | | **WeChat native voice bubble** | ⚠️ unresolved | `ITEM_VOICE` API call succeeds, personal WeChat client does not render — see [analysis](#engineering-findings) | | **Doctor / local diagnostics** | ✅ verified | ffmpeg, base URL, token, context token preflight | Test suite: `27 passed, 1 skipped` (the skip is a network-gated TTS test that runs with `HV_RUN_NETWORK_TESTS=1`). ## Quick start macOS: git clone https://github.com/UNlawrence/hermes-weixin-voice-clean.git hermes-voice cd hermes-voice ./install.command # or: ./scripts/install.sh The installer: - installs `uv` if missing - installs `ffmpeg` via Homebrew if available - installs the `hermes-voice`, `hermes-voice-doctor`, and `hermes-voice-stt` commands - copies the Hermes skill into `~/.hermes/skills/hermes-voice` During install you'll be prompted to pick a Piper TTS voice model (default: `zh_CN-huayan-medium`, ~63 MB). First STT call downloads the faster-whisper `base` model (~145 MB) from HuggingFace. Both caches are local — after first install the agent runs **fully offline**. To run setup again or change voice later: UV_CACHE_DIR=.uv-cache uv run hermes-voice-setup ## Commands ### TTS — send a synthesized reply UV_CACHE_DIR=.uv-cache uv run hermes-voice wxid_xxx "今天天气真好" ### STT — transcribe an audio file UV_CACHE_DIR=.uv-cache uv run hermes-voice-stt /path/to/voice.silk --language zh Auto-detects SILK by header bytes; otherwise hands the file to ffmpeg (WAV/MP3/M4A/OGG all work). ### Generate a real `.silk` test file UV_CACHE_DIR=.uv-cache uv run python scripts/generate_test_silk.py \ --text "这是一条测试语音" --keep-wav Prints `md5`, `first16_hex`, `duration_ms`. Useful for debugging the SILK encoder independently of the network path. ### Doctor — check local prerequisites UV_CACHE_DIR=.uv-cache uv run hermes-voice-doctor UV_CACHE_DIR=.uv-cache uv run hermes-voice-doctor wxid_xxx Checks ffmpeg, iLink base URL, token, Hermes account config, and optionally whether a context token for the target wxid is present. ### File-attachment fallback (verified reliable) UV_CACHE_DIR=.uv-cache uv run hermes-voice wxid_xxx \ --send-audio-file /tmp/voice.wav Sends the audio as `ITEM_FILE`. This is the currently reliable delivery shape — see findings below. ## Programmatic API import asyncio from hermes_voice import ( send_voice_from_text, # TTS → SILK → iLink send transcribe, # SILK/WAV/MP3 bytes → text ) async def main(): # Outbound result = await send_voice_from_text("你好,我是 Hermes", "wxid_xxx") print(result.msg_id, result.duration_ms) # Inbound voice_bytes = open("/path/to/wechat_voice.silk", "rb").read() text = await transcribe(voice_bytes, language="zh") print(text) asyncio.run(main()) ## Engineering findings This project was used to trace the full outbound voice path for personal WeChat accounts through the public iLink bot infrastructure. The full write-up is in [WEIXIN_VOICE_ANALYSIS.md](WEIXIN_VOICE_ANALYSIS.md); the short version: 1. **Local encoding is correct.** Generated `.silk` files have valid Tencent `SILK_V3` headers, expected duration, and stable size/md5. 2. **The native voice payload is correct.** `ITEM_VOICE` requests with `voice_item.media` AES-128-ECB ciphertext upload through the iLink CDN cleanly; `sendmessage` returns `ret=0`. 3. **The personal WeChat client still doesn't render it.** A controlled A/B test (marker text + immediate voice send) showed the text arriving and the voice not appearing in the recipient's client. 4. **`ITEM_FILE` audio attachments do arrive.** Same media bytes, same target, different `item.type` — delivered every time. The reduced conclusion: `ITEM_VOICE` over the public iLink bot path is accepted at the API layer but is not currently rendered by personal WeChat clients. The audio-file attachment path is the verified-reliable shape today. ## Roadmap / open hypotheses These are the remaining hypotheses worth testing: 1. Public iLink bot outbound `ITEM_VOICE` is accepted by API infrastructure but filtered before personal-client rendering. 2. Public reference implementations expose the voice payload schema without guaranteeing personal-client delivery support. 3. Additional private/internal fields may be required for true outbound voice bubble support. 4. If the product requirement is strictly "WeChat voice bubble," the most credible remaining path is WeChat client automation: drive the official client to record and send the audio itself. ## Configuration Priority order: 1. `HV_*` values in `.env` 2. local Hermes Weixin account config in `~/.hermes/weixin/accounts/*.json` 3. fallback defaults Example `.env`: HV_ILINK_BASE_URL=http://127.0.0.1:8080 HV_ILINK_TOKEN= # TTS engine: "piper" (local, offline) or "edge" (Microsoft cloud, free, no key). HV_TTS_ENGINE=piper # Piper config (when HV_TTS_ENGINE=piper). Set by `hermes-voice-setup`. HV_TTS_MODEL_PATH= # edge-tts config (when HV_TTS_ENGINE=edge) HV_TTS_VOICE=zh-CN-XiaoxiaoNeural HV_TTS_RATE=+0% HV_TTS_PITCH=+0Hz # STT (local faster-whisper, fully offline) HV_STT_MODEL=base # tiny / base / small / medium / large-v3 HV_STT_COMPUTE_TYPE=int8 # int8 / int8_float16 / float16 / float32 HV_STT_DEVICE=cpu # cpu / cuda / auto If you already use Hermes Weixin, the iLink fields are auto-loaded from the existing account config; you typically don't need to fill them manually. ## Development UV_CACHE_DIR=.uv-cache uv run pytest The `slow` marker covers the STT round-trip test (TTS → SILK → STT) which downloads the Whisper model on first run. ## Scope This repository is: - a working bidirectional voice I/O layer for the Hermes WeChat agent - a reproducible SILK / iLink experiment harness - an evidence-backed analysis of where public-iLink-bot outbound voice delivery currently stops working for personal WeChat clients It is **not** a turnkey native-voice-bubble sender — that path remains unresolved and the document above explains why. ## License MIT — see [LICENSE](LICENSE).