marin-community/marin

GitHub: marin-community/marin

Marin 是一个端到端可复现的基础模型研究与开发框架，覆盖从数据整理到模型训练与评估的完整流程。

Stars: 1227 | Forks: 147

# Marin

[Marin](https://marin.community) 是一个用于[基础模型](https://en.wikipedia.org/wiki/Foundation_model)研究与开发的开源框架。 Marin 的一个关键特性是**可复现性**：从原始数据到最终模型的每一步都被记录下来，而不仅仅是最终结果。这包括失败的实验，因此整个研究过程是透明的。 Marin 的主要用例是训练语言模型，例如 Llama、DeepSeek、Qwen 等。值得注意的是，这包括数据整理、转换、过滤、tokenization、训练和评估。我们使用 Marin 训练了第一个性能超越 Llama 3.1 8B 的开源 8B 参数模型。您可以查看[训练脚本](https://github.com/marin-community/marin/blob/main/experiments/tootsie/exp600_tootsie.py) 或阅读[回顾总结](docs/reports/marin-8b-retro.md)。 Marin 的文档可在 [ReadTheDocs](https://marin.readthedocs.io/en/latest/) 或 [`docs/`](docs/) 文件夹中找到。开始使用 Marin： - [安装](docs/tutorials/installation.md) Marin。 - 使用 Marin 训练一个[微型语言模型](docs/tutorials/first-experiment.md)。 - 了解如何使用 Marin 运行规模更大的 [DCLM 1B/1x](docs/tutorials/train-an-lm.md) 实验。 - 查看我们运行的[实验摘要](docs/reports/index.md)。 - 加入 [Marin Discord](https://discord.gg/J9CTk7pqcM) 与社区交流。 ## 示例 Marin 实验被定义为一系列可以相互依赖并按拓扑顺序执行的步骤，类似于 Makefile。作为一个关于如何使用 Marin 的简单示例，这里有一个在 [TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories) 上训练微型模型的完整脚本。您可以查看[完整脚本](https://github.com/marin-community/marin/blob/main/experiments/tutorials/train_tiny_model_cpu.py)了解更多细节。 ``` from fray.cluster import ResourceConfig from experiments.defaults import default_train from experiments.llama import llama3_tokenizer, llama_nano from experiments.simple_train_config import SimpleTrainConfig from experiments.tokenization import default_tokenize from marin.execution.executor import executor_main # 1. 选择 dataset tinystories_hf_id = "roneneldan/TinyStories" # 2. 对 dataset 进行 tokenize tinystories_tokenized = default_tokenize( name=tinystories_hf_id, # path to write tokenized files (tokenized/ will be prepended) dataset=tinystories_hf_id, # HF dataset id tokenizer=llama3_tokenizer, ) # 3. 定义 training configuration nano_train_config = SimpleTrainConfig( # Here we define the hardware resources we need. resources=ResourceConfig.with_cpu(), train_batch_size=4, num_train_steps=100, # set hyperparameters learning_rate=6e-4, weight_decay=0.1, # keep eval quick for tutorial max_eval_batches=4, ) # 4. 训练 model nano_tinystories_model = default_train( name="marin-nano-tinystories", # Steps can depend on other steps: nano_tinystories_model depends on tinystories_tokenized tokenized=tinystories_tokenized, model_config=llama_nano, train_config=nano_train_config, # wandb tags tags=["llama", "nano", "tinystories", "tutorial"], # We can run many [eval_harness](https://github.com/EleutherAI/lm-evaluation-harness) tasks in the loop # during training, but there's no point in running evals on such a tiny model eval_harness_tasks=[], # to keep tutorial fast, skip default validation sets use_default_validation=False, ) if __name__ == "__main__": executor_main(steps=[ nano_tinystories_model, ]) ``` 在这里，我们创建了两个[步骤](docs/explanations/executor.md#steps)，一个用于对数据集进行 tokenization，另一个用于训练模型。训练步骤依赖于 tokenized 数据集步骤，因此它将在 tokenization 步骤完成后执行。通过稍微修改，您可以将其扩展为在[更大数据集](docs/tutorials/train-an-lm.md)上训练[更大的模型](docs/tutorials/train-an-lm.md)，使用[混合数据集](docs/tutorials/train-an-lm.md#mixture-of-sources)，甚至扩展到超大型 TPU pod （或多 slice TPU，而且很快将支持多节点 GPU！）。 ## Agent 技能 - 请参阅 `.agents/skills/`（以及 `.claude/skills/`）以获取可加载的 agent 技能。例如，`.agents/skills/add-dataset/` 提供了添加新数据集的分步指南。

标签：大模型, 数据工程, 机器学习框架, 模型训练, 深度学习, 逆向工具