Blaizzy/mlx-audio

GitHub: Blaizzy/mlx-audio

基于 Apple MLX 框架的全能语音处理库，集成文本转语音、语音转文字和语音对语音功能，专为 Apple Silicon 提供高效本地推理。

Stars: 7367 | Forks: 633

# MLX-Audio 基于 Apple MLX 框架构建的最佳音频处理库，可在 Apple Silicon 上提供快速高效的 text-to-speech (TTS)、speech-to-text (STT) 和 speech-to-speech (STS) 功能。 ## 功能特性 - 针对 Apple Silicon（M 系列芯片）优化的快速推理 - 支持多种 TTS、STT 和 STS 模型架构 - 跨模型的多语言支持 - 声音定制和克隆功能 - 可调节的语速控制 - 带有 3D 音频可视化的交互式 Web 界面 - 兼容 OpenAI 的 REST API - 量化支持（3-bit、4-bit、6-bit、8-bit 等），优化性能 - 用于 iOS/macOS 集成的 Swift 包 ## 安装 ### 使用 pip ``` pip install mlx-audio ``` ### 使用 uv 仅安装命令行工具从 pypi 安装最新版本： ``` uv tool install --force mlx-audio --prerelease=allow ``` 从 github 安装最新代码： ``` uv tool install --force git+https://github.com/Blaizzy/mlx-audio.git --prerelease=allow ``` ### 用于开发或 Web 界面： ``` git clone https://github.com/Blaizzy/mlx-audio.git cd mlx-audio pip install -e ".[dev]" ``` ## 快速开始 ### 命令行 ``` # 基础 TTS 生成 mlx_audio.tts.generate --model mlx-community/Kokoro-82M-bf16 --text 'Hello, world!' --lang_code a # 带声音选择和语速调节 mlx_audio.tts.generate --model mlx-community/Kokoro-82M-bf16 --text 'Hello!' --voice af_heart --speed 1.2 --lang_code a # 立即播放音频 mlx_audio.tts.generate --model mlx-community/Kokoro-82M-bf16 --text 'Hello!' --play --lang_code a # 保存到特定目录 mlx_audio.tts.generate --model mlx-community/Kokoro-82M-bf16 --text 'Hello!' --output_path ./my_audio --lang_code a ``` ### Python API ``` from mlx_audio.tts.utils import load_model # 加载模型 model = load_model("mlx-community/Kokoro-82M-bf16") # 生成语音 for result in model.generate("Hello from MLX-Audio!", voice="af_heart"): print(f"Generated {result.audio.shape[0]} samples") # result.audio contains the waveform as mx.array ``` ## 支持的模型 ### 文本转语音 (TTS) | 模型 | 描述 | 语言 | 仓库 | |-------|-------------|-----------|------| | **Kokoro** | 快速、高质量的多语言 TTS | EN, JA, ZH, FR, ES, IT, PT, HI | [mlx-community/Kokoro-82M-bf16](https://huggingface.co/mlx-community/Kokoro-82M-bf16) | | **Qwen3-TTS** | 阿里巴巴具有语音设计功能的多语言 TTS | ZH, EN, JA, KO, + 更多 | [mlx-community/Qwen3-TTS-12Hz-1.7B-VoiceDesign-bf16](https://huggingface.co/mlx-community/Qwen3-TTS-12Hz-1.7B-VoiceDesign-bf16) | | **CSM** | 支持声音克隆的对话语音模型 | EN | [mlx-community/csm-1b](https://huggingface.co/mlx-community/csm-1b) | | **Dia** | 专注于对话的 TTS | EN | [mlx-community/Dia-1.6B-fp16](https://huggingface.co/mlx-community/Dia-1.6B-fp16) | | **OuteTTS** | 高效 TTS 模型 | EN | [mlx-community/OuteTTS-1.0-0.6B-fp16](https://huggingface.co/mlx-community/OuteTTS-1.0-0.6B-fp16) | | **Spark** | SparkTTS 模型 | EN, ZH | [mlx-community/Spark-TTS-0.5B-bf16](https://huggingface.co/mlx-community/Spark-TTS-0.5B-bf16) | | **Chatterbox** | 富有表现力的多语言 TTS | EN, ES, FR, DE, IT, PT, PL, TR, RU, NL, CS, AR, ZH, JA, HU, KO | [mlx-community/chatterbox-fp16](https://huggingface.co/mlx-community/chatterbox-fp16) | | **Soprano** | 高质量 TTS | EN | [mlx-community/Soprano-1.1-80M-bf16](https://huggingface.co/mlx-community/Soprano-1.1-80M-bf16) | | **Ming Omni TTS (BailingMM)** | 具有声音克隆、风格控制和语音/音乐/事件生成功能的多模态生成 | EN, ZH | [mlx-community/Ming-omni-tts-16.8B-A3B-bf16](https://huggingface.co/mlx-community/Ming-omni-tts-16.8B-A3B-bf16) | | **Ming Omni TTS (Dense)** | 用于声音克隆和风格控制的轻量级 dense 版 Ming Omni | EN, ZH | [mlx-community/Ming-omni-tts-0.5B-bf16](https://huggingface.co/mlx-community/Ming-omni-tts-0.5B-bf16) | ### 语音转文字 (STT) | 模型 | 描述 | 语言 | 仓库 | |-------|-------------|-----------|------| | **Whisper** | OpenAI 的鲁棒 STT 模型 | 99+ 种语言 | [mlx-community/whisper-large-v3-turbo-asr-fp16](https://huggingface.co/mlx-community/whisper-large-v3-turbo-asr-fp16) | | **Qwen3-ASR** | 阿里巴巴的多语言 ASR | ZH, EN, JA, KO, + 更多 | [mlx-community/Qwen3-ASR-1.7B-8bit](https://huggingface.co/mlx-community/Qwen3-ASR-1.7B-8bit) | | **Qwen3-ForcedAligner** | 词级音频对齐 | ZH, EN, JA, KO, + 更多 | [mlx-community/Qwen3-ForcedAligner-0.6B-8bit](https://huggingface.co/mlx-community/Qwen3-ForcedAligner-0.6B-8bit) | | **Parakeet** | NVIDIA 的高精度 STT | EN (v2), 25 种欧盟语言 (v3) | [mlx-community/parakeet-tdt-0.6b-v3](https://huggingface.co/mlx-community/parakeet-tdt-0.6b-v3) | | **Voxtral** | Mistral 的语音模型 | 多种语言 | [mlx-community/Voxtral-Mini-3B-2507-bf16](https://huggingface.co/mlx-community/Voxtral-Mini-3B-2507-bf16) | | **Voxtral Realtime** | Mistral 的 4B 流式 STT | 多种语言 | [4bit](https://huggingface.co/mlx-community/Voxtral-Mini-4B-Realtime-2602-4bit), [fp16](https://huggingface.co/mlx-community/Voxtral-Mini-4B-Realtime-2602-fp16) | | **VibeVoice-ASR** | 微软 9B 参数 ASR，支持说话人分离和时间戳 | 多种语言 | [mlx-community/VibeVoice-ASR-bf16](https://huggingface.co/mlx-community/VibeVoice-ASR-bf16) | | **Canary** | NVIDIA 支持翻译的多语言 ASR | 25 种欧盟语言 + RU, UK | [README](mlx_audio/stt/models/canary/README.md) | | **Moonshine** | Useful Sensors 的轻量级 ASR | EN | [README](mlx_audio/stt/models/moonshine/README.md) | ### 语音活动检测 / 说话人分离 (VAD) | 模型 | 描述 | 语言 | 仓库 | |-------|-------------|-----------|------| | **Sortformer v1** | NVIDIA 的端到端说话人分离（最多 4 位说话人） | 与语言无关 | [mlx-community/diar_sortformer_4spk-v1-fp32](https://huggingface.co/mlx-community/diar_sortformer_4spk-v1-fp32) | | **Sortformer v2.1** | NVIDIA 的流式说话人分离，支持 AOSC 压缩 | 与语言无关 | [mlx-community/diar_streaming_sortformer_4spk-v2.1-fp32](https://huggingface.co/mlx-community/diar_streaming_sortformer_4spk-v2.1-fp32) | 有关 API 详细信息、流式示例和模型转换，请参阅 [Sortformer README](mlx_audio/vad/models/sortformer/README.md)。 ### 语音转语音 (STS) | 模型 | 描述 | 用例 | 仓库 | |-------|-------------|----------|------| | **SAM-Audio** | 文本引导的音源分离 | 提取特定声音 | [mlx-community/sam-audio-large](https://huggingface.co/mlx-community/sam-audio-large) | | **Liquid2.5-Audio*** | 语音转语音、文本转语音和语音转文字 | 语音交互 | [mlx-community/LFM2.5-Audio-1.5B-8bit](https://huggingface.co/mlx-community/LFM2.5-Audio-1.5B-8bit) | **MossFormer2 SE** | 语音增强 | 噪声消除 | [starkdmi/MossFormer2_SE_48K_MLX](https://huggingface.co/starkdmi/MossFormer2_SE_48K_MLX) | ## 模型示例 ### Kokoro TTS Kokoro 是一个快速的多语言 TTS 模型，拥有 54 种预设音色。 ``` from mlx_audio.tts.utils import load_model model = load_model("mlx-community/Kokoro-82M-bf16") # 使用不同声音生成 for result in model.generate( text="Welcome to MLX-Audio!", voice="af_heart", # American female speed=1.0, lang_code="a" # American English ): audio = result.audio ``` **可用声音：** - 美式英语：`af_heart`、`af_bella`、`af_nova`、`af_sky`、`am_adam`、`am_echo` 等。 - 英式英语：`bf_alice`、`bf_emma`、`bm_daniel`、`bm_george` 等。 - 日语：`jf_alpha`、`jm_kumo` 等。 - 中文：`zf_xiaobei`、`zm_yunxi` 等。 **语言代码：** | 代码 | 语言 | 备注 | |------|----------|------| | `a` | 美式英语 | 默认 | | `b` | 英式英语 | | | `j` | 日语 | 需要 `pip install misaki[ja]` | | `z` | 普通话 | 需要 `pip install misaki[zh]` | | `e` | 西班牙语 | | | `f` | 法语 | | ### Qwen3-TTS 阿里巴巴最先进的多语言 TTS，具备声音克隆、情感控制和语音设计功能。 ``` from mlx_audio.tts.utils import load_model model = load_model("mlx-community/Qwen3-TTS-12Hz-0.6B-Base-bf16") results = list(model.generate( text="Hello, welcome to MLX-Audio!", voice="Chelsie", language="English", )) audio = results[0].audio # mx.array ``` 有关声音克隆、CustomVoice、VoiceDesign 和所有可用模型，请参阅 [Qwen3-TTS README](mlx_audio/tts/models/qwen3_tts/README.md)。 ### Ming Omni TTS (BailingMM) ``` mlx_audio.tts.generate \ --model mlx-community/Ming-omni-tts-16.8B-A3B-bf16 \ --prompt "Please generate speech based on the following description.\n" \ --text "This is a quick Ming Omni test." \ --lang_code en \ --output_path audio_io \ --file_prefix ming_basic \ --verbose ``` 有关 CLI 和 Python cookbook 示例，请参阅 [Ming Omni TTS README](mlx_audio/tts/models/bailingmm/README.md)；有关 `mlx-community/Ming-omni-tts-0.5B-bf16` 的工作流程，请参阅 [Ming Omni Dense README](mlx_audio/tts/models/dense/README.md)。 ### CSM (声音克隆) 使用参考音频样本克隆任何声音： ``` mlx_audio.tts.generate \ --model mlx-community/csm-1b \ --text "Hello from Sesame." \ --ref_audio ./reference_voice.wav \ --play ``` ### Whisper STT ``` from mlx_audio.stt.generate import generate_transcription result = generate_transcription( model="mlx-community/whisper-large-v3-turbo-asr-fp16", audio="audio.wav", ) print(result.text) ``` ### Qwen3-ASR & ForcedAligner 阿里巴巴用于转录和词级对齐的多语言语音模型。 ``` from mlx_audio.stt import load # 语音识别 model = load("mlx-community/Qwen3-ASR-0.6B-8bit") result = model.generate("audio.wav", language="English") print(result.text) # 词级强制对齐 aligner = load("mlx-community/Qwen3-ForcedAligner-0.6B-8bit") result = aligner.generate("audio.wav", text="I have a dream", language="English") for item in result: print(f"[{item.start_time:.2f}s - {item.end_time:.2f}s] {item.text}") ``` 有关 CLI 用法、所有模型和更多示例，请参阅 [Qwen3-ASR README](mlx_audio/stt/models/qwen3_asr/README.md)。 ### VibeVoice-ASR 微软 9B 参数的语音转文字模型，支持说话人分离和时间戳。支持长音频（最长 60 分钟）并输出结构化 JSON。 ``` from mlx_audio.stt.utils import load model = load("mlx-community/VibeVoice-ASR-bf16") # 基础转写 result = model.generate(audio="meeting.wav", max_tokens=8192, temperature=0.0) print(result.text) # [{"Start":0,"End":5.2,"Speaker":0,"Content":"Hello everyone, let's begin."}, # {"Start":5.5,"End":9.8,"Speaker":1,"Content":"Thanks for joining today."}] # 访问解析后的片段 for seg in result.segments: print(f"[{seg['start_time']:.1f}-{seg['end_time']:.1f}] Speaker {seg['speaker_id']}: {seg['text']}") ``` **流式转录：** ``` # 在生成 Token 时进行流式传输 for text in model.stream_transcribe(audio="speech.wav", max_tokens=4096): print(text, end="", flush=True) ``` **带上下文（热词/元数据）：** ``` result = model.generate( audio="technical_talk.wav", context="MLX, Apple Silicon, PyTorch, Transformer", max_tokens=8192, temperature=0.0, ) ``` **CLI 用法：** ``` # 基础转写 python -m mlx_audio.stt.generate \ --model mlx-community/VibeVoice-ASR-bf16 \ --audio meeting.wav \ --output-path output \ --format json \ --max-tokens 8192 \ --verbose # 带上下文/热词 python -m mlx_audio.stt.generate \ --model mlx-community/VibeVoice-ASR-bf16 \ --audio technical_talk.wav \ --output-path output \ --format json \ --max-tokens 8192 \ --context "MLX, Apple Silicon, PyTorch, Transformer" \ --verbose ``` ### Parakeet (多语言 STT) NVIDIA 的高精度语音转文字模型。Parakeet v3 支持 25 种欧洲语言。 ``` from mlx_audio.stt.utils import load # 加载多语言 v3 模型 model = load("mlx-community/parakeet-tdt-0.6b-v3") # 转写音频 result = model.generate("audio.wav") print(f"Text: {result.text}") # 获取词级时间戳 for sentence in result.sentences: print(f"[{sentence.start:.2f}s - {sentence.end:.2f}s] {sentence.text}") ``` **流式转录：** ``` for chunk in model.generate("long_audio.wav", stream=True): print(chunk.text, end="", flush=True) ``` **支持的语言 (v3)：** 保加利亚语、克罗地亚语、捷克语、丹麦语、荷兰语、英语、爱沙尼亚语、芬兰语、法语、德语、希腊语、匈牙利语、意大利语、拉脱维亚语、立陶宛语、马耳他语、波兰语、葡萄牙语、罗马尼亚语、斯洛伐克语、斯洛文尼亚语、西班牙语、瑞典语、俄语、乌克兰语 **CLI 用法：** ``` python -m mlx_audio.stt.generate \ --model mlx-community/parakeet-tdt-0.6b-v3 \ --audio speech.wav \ --output-path output \ --format json \ --verbose ``` ### Voxtral Realtime Mistral 的 4B 参数流式语音转文字模型，针对低延迟转录进行了优化。可用版本：[4bit](https://huggingface.co/mlx-community/Voxtral-Mini-4B-Realtime-2602-4bit) (更小/更快) | [fp16](https://huggingface.co/mlx-community/Voxtral-Mini-4B-Realtime-2602-fp16) (全精度) ``` from mlx_audio.stt.utils import load # 使用 4bit 以实现更快推理，使用 fp16 以获得完整精度 model = load("mlx-community/Voxtral-Mini-4B-Realtime-2602-4bit") # 转写音频 result = model.generate("audio.wav") print(result.text) # 流式转写 for chunk in model.generate("audio.wav", stream=True): print(chunk, end="", flush=True) # 调整转写延迟（越低 = 越快但准确率越低） result = model.generate("audio.wav", transcription_delay_ms=240) ``` ### MedASR (医疗转录) 专门针对医学术语和听写的模型。 ``` from mlx_audio.stt.utils import load, transcribe model = load("mlx-community/medasr") result = transcribe("medical_dictation.wav", model=model) print(result["text"]) ``` **实时转录示例：** ``` # 使用 VAD 的连续实时转写 python examples/medasr_live.py ``` ### SAM-Audio (音源分离) 使用文本提示从音频中分离特定声音： ``` from mlx_audio.sts import SAMAudio, SAMAudioProcessor, save_audio model = SAMAudio.from_pretrained("mlx-community/sam-audio-large") processor = SAMAudioProcessor.from_pretrained("mlx-community/sam-audio-large") batch = processor( descriptions=["A person speaking"], audios=["mixed_audio.wav"], ) result = model.separate_long( batch.audios, descriptions=batch.descriptions, anchors=batch.anchor_ids, chunk_seconds=10.0, overlap_seconds=3.0, ode_opt={"method": "midpoint", "step_size": 2/32}, ) save_audio(result.target[0], "voice.wav") save_audio(result.residual[0], "background.wav") ``` ### MossFormer2 (语音增强) 去除语音录音中的噪声： ``` from mlx_audio.sts import MossFormer2SEModel, save_audio model = MossFormer2SEModel.from_pretrained("starkdmi/MossFormer2_SE_48K_MLX") enhanced = model.enhance("noisy_speech.wav") save_audio(enhanced, "clean.wav", 48000) ``` ## Web 界面与 API 服务器 MLX-Audio 包含一个现代化的 Web 界面和兼容 OpenAI 的 API。 ### 启动服务器 ``` # 启动 API 服务器 mlx_audio.server --host 0.0.0.0 --port 8000 # 启动 Web UI (在另一个终端中) cd mlx_audio/ui npm install && npm run dev ``` ### API 端点 **文本转语音** (兼容 OpenAI)： ``` curl -X POST http://localhost:8000/v1/audio/speech \ -H "Content-Type: application/json" \ -d '{"model": "mlx-community/Kokoro-82M-bf16", "input": "Hello!", "voice": "af_heart"}' \ --output speech.wav ``` **语音转文字**： ``` curl -X POST http://localhost:8000/v1/audio/transcriptions \ -F "file=@audio.wav" \ -F "model=mlx-community/whisper-large-v3-turbo-asr-fp16" ``` ## 量化使用转换脚本通过量化减小模型大小并提高性能： ``` # 转换并量化为 4-bit python -m mlx_audio.convert \ --hf-path prince-canuma/Kokoro-82M \ --mlx-path ./Kokoro-82M-4bit \ --quantize \ --q-bits 4 \ --upload-repo username/Kokoro-82M-4bit (optional: if you want to upload the model to Hugging Face) # 使用 MXFP4 量化转换 python -m mlx_audio.convert \ --hf-path prince-canuma/Kokoro-82M \ --mlx-path ./Kokoro-82M-mxfp4 \ --quantize \ --q-mode mxfp4 # 使用特定 dtype (bfloat16) 转换 python -m mlx_audio.convert \ --hf-path prince-canuma/Kokoro-82M \ --mlx-path ./Kokoro-82M-bf16 \ --dtype bfloat16 \ --upload-repo username/Kokoro-82M-bf16 (optional: if you want to upload the model to Hugging Face) ``` **选项：** | 标志 | 描述 | |------|-------------| | `--hf-path` | 源 Hugging Face 模型或本地路径 | | `--mlx-path` | 转换后模型的输出目录 | | `-q, --quantize` | 启用量化 | | `--q-bits` | 每权重位数（可选，默认值取决于 `--q-mode`） | | `--q-group-size` | 量化分组大小（可选，默认值取决于 `--q-mode`） | | `--q-mode` | 量化模式：`affine`、`mxfp4`、`mxfp8`、`nvfp4` | | `--dtype` | 权重数据类型：`float16`、`bfloat16`、`float32` | | `--upload-repo` | 将转换后的模型上传到 HF Hub | ## Swift 正在寻找 Swift/iOS 支持？请查看 [mlx-audio-swift](https://github.com/Blaizzy/mlx-audio-swift)，以便在 macOS 和 iOS 上使用 MLX 进行设备端 TTS。 ## 系统要求 - Python 3.10+ - Apple Silicon Mac (M1/M2/M3/M4) - MLX 框架 - **ffmpeg**（MP3/FLAC 音频编码需要） ### 安装 ffmpeg 保存 MP3 或 FLAC 格式的音频需要 ffmpeg。使用以下命令安装： ``` # macOS (使用 Homebrew) brew install ffmpeg # Ubuntu/Debian sudo apt install ffmpeg ``` 无需 ffmpeg 即可使用 WAV 格式。 ## 许可证 [MIT License](LICENSE) ## 引用 ``` @misc{mlx-audio, author = {Canuma, Prince}, title = {MLX Audio}, year = {2025}, howpublished = {\url{https://github.com/Blaizzy/mlx-audio}}, note = {Audio processing library for Apple Silicon with TTS, STT, and STS capabilities.} } ``` ## 致谢 - [Apple MLX Team](https://github.com/ml-explore/mlx) 提供的 MLX 框架

标签：Apex, Apple Silicon, iOS集成, IPv6支持, MLX, M系列芯片, Python, REST API, STS, STT, Swift, TTS, Web界面, 人工智能, 声音克隆, 多语言支持, 安全测试框架, 开源库, 搜索引擎爬虫, 数据可视化, 无后门, 机器学习, 深度学习, 用户模式Hook绕过, 语音合成, 语音识别, 语音转语音, 逆向工具, 量化技术, 音频可视化, 音频处理