NVIDIA/personaplex

GitHub: NVIDIA/personaplex

PersonaPlex 是一个基于 Moshi 架构的实时全双工语音对话模型,支持通过文本和音频提示精确控制语音风格与角色设定。

Stars: 8954 | Forks: 1261

# PersonaPlex:全双工对话语音模型的语音与角色控制 [![Weights](https://img.shields.io/badge/🤗-Weights-yellow)](https://huggingface.co/nvidia/personaplex-7b-v1) [![Paper](https://img.shields.io/badge/📄-Paper-blue)](https://arxiv.org/abs/2602.06053) [![Demo](https://img.shields.io/badge/🎮-Demo-green)](https://research.nvidia.com/labs/adlr/personaplex/) [![Discord](https://img.shields.io/badge/Discord-Join-purple?logo=discord)](https://discord.gg/5jAXrrbwRb) PersonaPlex 是一个实时、全双工的语音到语音对话模型,它通过基于文本的角色提示和基于音频的语音条件来实现角色控制。该模型在合成对话和真实对话的组合数据上训练,能够产生自然、低延迟且具有一致角色的口语交互。PersonaPlex 基于 [Moshi](https://arxiv.org/abs/2410.00037) 的架构和权重。

PersonaPlex Model Architecture
PersonaPlex Architecture

## 使用方法 ### 前置条件 安装 [Opus audio codec](https://github.com/xiph/opus) 开发库: ``` # Ubuntu/Debian sudo apt install libopus-dev # Fedora/RHEL sudo dnf install opus-devel ``` ### 安装说明 下载此仓库并使用以下命令安装: ``` pip install moshi/. ``` 对于基于 Blackwell 的 GPU,按照建议执行额外步骤(参见 https://github.com/NVIDIA/personaplex/issues/2): ``` pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130 ``` ### 接受模型许可 登录您的 Huggingface 账户并在此处接受 PersonaPlex 模型许可 [here](https://huggingface.co/nvidia/personaplex-7b-v1)。
然后设置您的 Huggingface 认证: ``` export HF_TOKEN= ``` ### 启动服务器 启动服务器以进行实时交互(用于 https 的临时 SSL 证书): ``` SSL_DIR=$(mktemp -d); python -m moshi.server --ssl "$SSL_DIR" ``` **CPU 卸载:** 如果您的 GPU 内存不足,请使用 `--cpu-offload` 标志将模型层卸载到 CPU。这需要 `accelerate` 包(`pip install accelerate`): ``` SSL_DIR=$(mktemp -d); python -m moshi.server --ssl "$SSL_DIR" --cpu-offload ``` 如果在本地运行,请从浏览器访问 `localhost:8998` 的 Web UI,否则请查看脚本打印的访问链接: ``` Access the Web UI directly at https://11.54.401.33:8998 ``` ### 离线评估 对于离线评估,请使用离线脚本,该脚本将输入 wav 文件流式传输,并从捕获的输出流生成输出 wav 文件。输出文件的持续时间将与输入文件相同。 如果您的 GPU 内存不足,请在以下任何命令中添加 `--cpu-offload`(需要 `accelerate` 包)。或者安装仅 CPU 版本的 PyTorch 以在纯 CPU 上进行离线评估。 **助手示例:** ``` HF_TOKEN= \ python -m moshi.offline \ --voice-prompt "NATF2.pt" \ --input-wav "assets/test/input_assistant.wav" \ --seed 42424242 \ --output-wav "output.wav" \ --output-text "output.json" ``` **服务示例:** ``` HF_TOKEN= \ python -m moshi.offline \ --voice-prompt "NATM1.pt" \ --text-prompt "$(cat assets/test/prompt_service.txt)" \ --input-wav "assets/test/input_service.wav" \ --seed 42424242 \ --output-wav "output.wav" \ --output-text "output.json" ``` ## 语音 PersonaPlex 支持广泛的语音;我们预打包了听起来更自然和更具对话性的语音(NAT)以及其他更多样化的语音(VAR)的嵌入。固定的语音集标记为: ``` Natural(female): NATF0, NATF1, NATF2, NATF3 Natural(male): NATM0, NATM1, NATM2, NATM3 Variety(female): VARF0, VARF1, VARF2, VARF3, VARF4 Variety(male): VARM0, VARM1, VARM2, VARM3, VARM4 ``` ## 提示指南 该模型在针对固定助手角色和不同客户服务角色的合成对话上进行了训练。 ### 助手角色 助手角色的提示为: ``` You are a wise and friendly teacher. Answer questions or provide advice in a clear and engaging way. ``` 请使用此提示针对 [FullDuplexBench](https://arxiv.org/abs/2503.04721) 中专注于 QA 助手的“用户打断”评估类别。 ### 客户服务角色 客户服务角色支持多种提示。以下是一些用于提示风格参考的示例: ``` You work for CitySan Services which is a waste management and your name is Ayelen Lucero. Information: Verify customer name Omar Torres. Current schedule: every other week. Upcoming pickup: April 12th. Compost bin service available for $8/month add-on. ``` ``` You work for Jerusalem Shakshuka which is a restaurant and your name is Owen Foster. Information: There are two shakshuka options: Classic (poached eggs, $9.50) and Spicy (scrambled eggs with jalapenos, $10.25). Sides include warm pita ($2.50) and Israeli salad ($3). No combo offers. Available for drive-through until 9 PM. ``` ``` You work for AeroRentals Pro which is a drone rental company and your name is Tomaz Novak. Information: AeroRentals Pro has the following availability: PhoenixDrone X ($65/4 hours, $110/8 hours), and the premium SpectraDrone 9 ($95/4 hours, $160/8 hours). Deposit required: $150 for standard models, $300 for premium. ``` ### 日常对话 该模型还在 [Fisher English Corpus](https://catalog.ldc.upenn.edu/LDC2004T19) 的真实对话上进行了训练,并带有用于开放式对话的 LLM 标记提示。以下是一些用于日常对话的示例提示: ``` You enjoy having a good conversation. ``` ``` You enjoy having a good conversation. Have a casual discussion about eating at home versus dining out. ``` ``` You enjoy having a good conversation. Have an empathetic discussion about the meaning of family amid uncertainty. ``` ``` You enjoy having a good conversation. Have a reflective conversation about career changes and feeling of home. You have lived in California for 21 years and consider San Francisco your home. You work as a teacher and have traveled a lot. You dislike meetings. ``` ``` You enjoy having a good conversation. Have a casual conversation about favorite foods and cooking experiences. You are David Green, a former baker now living in Boston. You enjoy cooking diverse international dishes and appreciate many ethnic restaurants. ``` 请使用提示 `You enjoy having a good conversation.` 用于 FullDuplexBench 的“暂停处理”、“反向信号”和“平滑轮替”评估类别。 ## 泛化能力 Personaplex 对 Moshi 进行了微调,并受益于底层 [Helium](https://kyutai.org/blog/2025-04-30-helium) LLM 的泛化能力。由于主干模型具有广泛的训练语料库,我们发现模型能够对分布外的提示做出合理的响应,并引发意想不到或有趣的对话。我们鼓励尝试不同的提示,以测试模型在处理训练分布之外场景时的涌现能力。作为灵感,我们在 WebUI 中展示了以下宇航员提示: ``` You enjoy having a good conversation. Have a technical discussion about fixing a reactor core on a spaceship to Mars. You are an astronaut on a Mars mission. Your name is Alex. You are already dealing with a reactor core meltdown on a Mars mission. Several ship systems are failing, and continued instability will lead to catastrophic failure. You explain what is happening and you urgently ask for help thinking through how to stabilize the reactor. ``` ## 许可证 当前代码在 MIT 许可证下提供。模型的权重在 NVIDIA Open Model 许可证下发布。 ## 引用 如果您在研究中使用了 PersonaPlex,请引用我们的论文: ``` @misc{roy2026personaplexvoicerolecontrol, title={PersonaPlex: Voice and Role Control for Full Duplex Conversational Speech Models}, author={Rajarshi Roy and Jonathan Raiman and Sang-gil Lee and Teodor-Dumitru Ene and Robert Kirby and Sungwon Kim and Jaehyeon Kim and Bryan Catanzaro}, year={2026}, eprint={2602.06053}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2602.06053}, } ```
标签:DLL 劫持, Hugging Face, IaC 扫描, Moshi架构, Opus编解码, Python, PyTorch, Vectored Exception Handling, 人工智能, 人设控制, 低延迟, 全双工对话, 凭据扫描, 多模态, 大语言模型, 实时语音交互, 无后门, 深度学习, 用户模式Hook绕过, 端到端模型, 语音克隆, 语音助手, 语音合成, 语音识别, 逆向工具