文章目录一、前言二、DeepSeekMath2. 数学预训练2.1. 数据收集与去污问题1解释一下1. 问题诊断为什么第一次迭代多样性不足2. 域分析Domain Analysis找出数学相关的网站3. 人工标注 URL 路径把漏网之鱼加入种子库4. 重新训练 fastText进入下一轮迭代5. 什么时候停止总结成一张图通俗类比回到原文2.2. 验证DeepSeekMath语料库的质量2.2.1. 训练设置2.2.2. 评估结果2.3. DeepSeekMath-Base 7B 的训练与评估一、前言仅供参考未经实验验证。DeepSeekMath是GRPO算法的原文但是由于这篇文章把算法部分放在了后面这一篇并不涉及GRPO算法但是意外收获是他的数据收集的方法也很有启发当我们需要收集一个领域内的数据集的时候一个直接的思路是先收集一些高质量的数据然后训练分类器但是这导致这个分类器只认识跟你收集的高质量的数据相似的数据而实际上可能存在很多并不相似但是也是你需要的多样化数据这时候我们就需要思考一个问题就是该如何找到更多的领域内但是分类器识别不到的数据一个简单的思路就是看来源就像你找到一篇论文是哪个团队的那这个团队有可能就是这个研究方向的就能找到这个团队的所有这个研究方向的论文一样你收集数据的网站来源就可能存在更多的同领域数据。二、DeepSeekMath论文标题DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language ModelsDeepSeekMath推动开放语言模型中数学推理的极限作者Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y.K. Li, Y. Wu, Daya Guo机构DeepSeek-AI主导、清华大学、北京大学发表时间2024年2月6日arXiv:2402.03300GitHubhttps://github.com/deepseek-ai/DeepSeek-Math2. 数学预训练2.1. 数据收集与去污In this section, we will outline the process of constructing the DeepSeekMath Corpus from Common Crawl. As depicted in Figure 2, we present an iterative pipeline that demonstrates how to systematically gather a large-scale mathematical corpus from Common Crawl, starting with a seed corpus (e.g., a small but high-quality collection of math-related dataset).It’s worth noting that this approach is also applicable to other domains, such as coding.在本节中我们将概述从Common Crawl构建DeepSeekMath语料库的过程。如图2所示我们提出了一个迭代流程展示了如何从Common Crawl系统地收集大规模数学语料库起始于一个种子语料库例如一个小型但高质量的数学相关数据集集合。值得注意的是这种方法也适用于其他领域例如编码。First, we choose OpenWebMath (Paster et al., 2023), a collection of high-quality mathematical web texts, as our initial seed corpus. Using this corpus, we train a fastText model (Joulin et al., 2016) to recall more OpenWebMath-like mathematical web pages. Specifically, we randomly select 500,000 data points from the seed corpus as positive training examples and another 500,000 web pages from Common Crawl as negative ones.首先我们选择 OpenWebMath (Paster et al., 2023)) 作为初始种子语料库这是一个高质量的数学网络文本集合。利用该语料库我们训练了一个 fastText 模型 (Joulin et al., 2016) 来检索更多类似 OpenWebMath 的数学网络页面。具体而言我们从种子语料库中随机抽取 500,000 个数据点作为正训练样本并从 Common Crawl 中抽取另外 500,000 个网页作为负样本。We employ an open-source library1for training, configuring the vector dimension to 256, learning rate to 0.1, the maximum length of word n-gram to 3, the minimum number of word occurrences to 3, and the number of training epochs to 3. To reduce the size of the original Common Crawl, we employ URL-based deduplication and near-deduplication techniques, resulting in 40B HTML web pages.我们使用一个开源库1进行训练将向量维度配置为256学习率为0.1最大长度为3的词n-gram最小词出现次数为3训练轮数为3。为了减小原始Common Crawl的规模我们采用了基于URL的去重和近乎去重技术得到了400亿个HTML网页。1https://fasttext.ccWe then recall mathematical web pages from deduplicated Common Crawl with the fastText model. To filter out low-quality mathematical content, we rank the collected pages according to their scores predicted by the fastText model, and only preserve the top-ranking ones.我们随后从经过去重处理的Common Crawl数据集中使用fastText模型检索数学网页。为了过滤掉低质量的数学内容我们根据fastText模型预测的分数对收集到的网页进行排序并只保留排名靠前的网页。The volume of data preserved is assessed through pre-training experiments on the top 40B, 80B, 120B, and 160B tokens. In the first iteration, we choose to keep the top 40B tokens.通过在top 40B、80B、120B和160B个token上进行预训练实验来评估保留的数据量。在第一次迭代中我们选择保留top 40B个token。Figure 2 An iterative pipeline that collects mathematical web pages from Common Crawl.图 2 一个从 Common Crawl 收集数学网页的迭代流程。After the first iteration of data collection, numerous mathematical web pages remain uncollected, mainly because the fastText model is trained on a set of positive examples that lacks sufficient diversity. We therefore identify additional mathematical web sources to enrich the seed corpus, so that we can optimize the fastText model.在第一轮数据收集后仍有大量数学网页未被收集这主要是因为 fastText 模型是在一组多样性不足的正面示例上训练的。因此我们识别了额外的数学网络来源来丰富种子语料库以便优化 fastText 模型。Specifically, we first organize the entire Common Crawl into disjoint domains; a domain is defined as web pages sharing the same base URL. For each domain, we calculate the percentage of web pages that are collected in the first iteration.具体而言我们首先将整个Common Crawl组织成不相交的域域被定义为共享相同基本URL的网页。对于每个域我们计算在第一次迭代中收集的网页的百分比。Domains where over 10% of the web pages have been collected are classified as math-related (e.g., mathoverflow.net). Subsequently, we manually annotate the URLs associated with mathematical content within these identified domains (e.g., mathoverflow.net/questions). Web pages linked to these URLs, yet uncollected, will be added to the seed corpus.将超过10%网页已被收集的域名归类为数学相关例如mathoverflow.net。随后我们手动标注这些已识别域名中与数学内容相关的URL例如mathoverflow.net/questions。与这些URL相关但尚未被收集的网页将被添加到种子语料库中。This approach enables us to gather more positive examples, thereby training an improved fastText model capable of recalling more mathematical data in the subsequent iteration. After four iterations of data collection, we end up with 35.5M mathematical web pages, totaling 120B tokens.该方法使我们能够收集更多正样本从而训练一个改进的fastText模型该模型能够在后续迭代中回忆起更多数学数据。经过四轮数据收集我们最终获得了35.5M个数学网页共计120B个词元。In the fourth iteration, we notice that nearly 98% of the data has already been collected in the third iteration, so we decide to cease data collection.在第四次迭代中我们注意到在第三次迭代中已收集了近98%的数据因此我们决定停止数据收集。To avoid benchmark contamination, we follow Guo et al. (2024) to filter out web pages containing questions or answers from English mathematical benchmarks such as GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021) and Chinese benchmarks such as CMATH (Wei et al., 2023) and AGIEval (Zhong et al., 2023).为避免基准测试污染我们遵循 Guo 等人 (2024) 的方法从英文数学基准测试如 GSM8K (Cobbe et al., 2021) 和 MATH (Hendrycks et al., 2021)以及中文基准测试如 CMATH (Wei et al., 2023) 和 AGIEval (Zhong et al., 2023)中过滤掉包含问题或答案的网页。The filtering criteria are as follows: any text segment containing a 10-gram string that matches exactly with any sub-string from the evaluation benchmarks is removed from our math training corpus. For benchmark texts that are shorter than 10 grams but have at least 3 grams, we employ exact matching to filter out contaminated web pages.过滤标准如下任何包含与评估基准中的任何子字符串完全匹配的10-gram字符串的文本片段将从我们的数学训练语料库中移除。对于长度小于10-gram但至少包含3-gram的基准文本我们采用精确匹配来过滤受污染的网页。问题1解释一下解释一下我理解他是先用高质量的数据训练一个分类模型fastText然后再去爬虫就能利用fastText识别出更多高质量的数据但是他又想找比fastText更多样化的数据之后他是怎么做的呢你的理解是对的而且已经抓住了核心逻辑先用高质量种子训练分类器 → 分类器召回更多数据 → 发现召回的数据多样性不够 → 想办法补充多样性。论文中后续的操作本质上可以概括为一句话让分类器回头看自己漏掉了哪些数学网站然后把这些漏掉的网页手动抓回来重新训练分类器再去新一轮召回。具体步骤如下1. 问题诊断为什么第一次迭代多样性不足第一次迭代用的正样本全部来自OpenWebMath。这个种子库虽然质量高但覆盖面有限——它可能主要来自某些特定的数学网站比如 Math StackExchange、某些数学论坛等。因此训练出来的 fastText 分类器会形成一种偏见它只认得这些特定网站的数学文本风格遇到其他数学网站比如中文数学论坛、某个大学的数学课程页、数学竞赛网站可能就认不出来了。结果就是第一轮召回的 40B tokens虽然都是数学相关的但来源高度同质化。2. 域分析Domain Analysis找出数学相关的网站论文的做法很聪明——他们不直接让模型去猜哪里还有数学内容而是让模型坦白自己是从哪些网站抓的数据步骤把整个 Common Crawl 按域名base URL分组。比如mathoverflow.net是一个域zhihu.com是另一个域。统计每个域里有多少网页被第一轮分类器抓走了。如果某个域被抓走了超过 10% 的网页就把这个域标记为数学相关域。原理如果mathoverflow.net里有大量网页被分类器判定为数学内容那说明这个整个网站很可能就是一个数学网站。分类器只是抓了这个网站的一部分同一网站下还有很多漏网之鱼。3. 人工标注 URL 路径把漏网之鱼加入种子库找到数学相关域之后下一步是人工介入在这些已标记的数学域中人工标注出包含数学内容的具体 URL 路径。例如第一轮可能只抓到了mathoverflow.net/questions/12345但人工发现mathoverflow.net/questions/这个路径下全是数学问答。把这些已经被识别为数学域、但第一轮没被分类器抓到的网页即这些 URL 下的其他页面全部加入种子语料库。关键作用这些新加入的网页来自同样的数学网站但可能是不同的子板块、不同的讨论主题、不同的文本风格。这就极大地丰富了正样本的多样性。4. 重新训练 fastText进入下一轮迭代现在种子库变大了原来的 OpenWebMath 新发现的漏网之鱼用这个更多样化的种子库重新训练 fastText 分类器。新的分类器见多识广了就能在下一轮召回中认出更多不同风格的数学网页发现更多新的数学域然后重复步骤 2-4一共做了4 轮迭代。5. 什么时候停止第四轮迭代时他们发现新增的数据量只占前一轮的2%左右说明已经趋于收敛于是停止。最终得到35.5M 个数学网页共 120B tokens。总结成一张图第1轮OpenWebMath(种子) → 训练fastText → 召回40B → 发现多样性不够 ↓ 分析被抓网页来自哪些域名 → 标记数学相关域 ↓ 人工标注这些域的数学URL路径 → 把漏掉的网页加入种子库 ↓ 种子库变大了、更多样了 ↓ 第2轮用新种子重新训练fastText → 召回更多、更多样 → 重复... ↓ ...直到第4轮几乎无新增 → 停止 → 最终120B tokens通俗类比想象你在一个巨大的图书馆Common Crawl里找数学书第一轮你手头只有 50 本数学书OpenWebMath你训练了一个助手fastText帮你找类似的书。助手只找回来了某个书架上的书而且都是关于微积分的。第二轮你发现助手找回来的书很多都来自3 楼 A 区。你推断3 楼 A 区整个区域可能都是数学书。你亲自去 3 楼 A 区逛了一圈发现除了微积分还有线性代数、概率论的书架但助手之前漏掉了。你把这部分书补充进样本重新训练助手。第三轮助手现在认识线性代数和概率论的风格了又帮你找到了 4 楼 B 区的数学书…第四轮助手已经逛遍了所有数学区域几乎找不到新书了任务完成。这就是论文中解决多样性不足的核心思路不是盲目扩大搜索范围而是通过域分析精准定位被遗漏的数学聚集地人工补充种子再迭代训练。回到原文2.2. 验证DeepSeekMath语料库的质量We run pre-training experiments to investigate how the DeepSeekMath Corpus is compared with the recently released math-training corpora:我们进行了预训练实验以研究DeepSeekMath语料库与最近发布的数学训练语料库相比如何MathPile(Wang et al., 2023c): a multi-source corpus (8.9B tokens) aggregated from textbooks, Wikipedia, ProofWiki, CommonCrawl, StackExchange, and arXiv, with the majority (over 85%) sourced from arXiv;MathPile (Wang et al., 2023c)一个多来源语料库89亿 tokens汇总自教科书、维基百科、ProofWiki、CommonCrawl、StackExchange 和 arXiv其中大部分超过 85%来自 arXivOpenWebMath(Paster et al., 2023): CommonCrawl data filtered for mathematical content, totaling 13.6B tokens;OpenWebMathPaster et al., 2023)为数学内容过滤的 CommonCrawl 数据总计 136 亿个 tokensProof-Pile-2(Azerbayev et al., 2023): a mathematical corpus consisting of OpenWebMath, AlgebraicStack (10.3B tokens of mathematical code), and arXiv papers (28.0B tokens). When experimenting on Proof-Pile-2, we follow Azerbayev et al. (2023) to use an arXiv:Web:Code ratio of 2:4:1.Proof-Pile-2 (Azerbayev et al., 2023)一个数学语料库由 OpenWebMath、AlgebraicStack103 亿个数学代码 tokens和 arXiv 论文280 亿个 tokens组成。当在 Proof-Pile-2 上进行实验时我们遵循 Azerbayev et al. (2023) 的方法使用 arXiv:Web:Code 的比例为 2:4:1。2.2.1. 训练设置We apply math training to a general pre-trained language model with 1.3B parameters, which shares the same framework as the DeepSeek LLMs (DeepSeek-AI, 2024), denoted as DeepSeek-LLM 1.3B. We separately train a model on each mathematical corpus for 150B tokens. All experiments are conducted using the efficient and light-weight HAI-LLM (High-flyer, 2023) training framework.我们将数学训练应用于一个拥有1.3B参数的通用预训练语言模型该模型与DeepSeek LLMsDeepSeek-AI, [NT0]2024共享相同的框架记为DeepSeekLLM 1.3B。我们针对每个数学语料库分别训练模型150B个token。所有实验均在高效轻量级的HAI-LLMHigh-flyer, [NT1]2023训练框架下进行。Following the training practice of DeepSeek LLMs, we use the AdamW optimizer (Loshchilov and Hutter, 2017) withβ 1 0.9 \beta_1 0.9β1​0.9,β 2 0.95 \beta_2 0.95β2​0.95, and weight_decay 0.1, along with a multi-step learning rate schedule where the learning rate reaches the peak after 2,000 warmup steps, decreases to its 31.6% after 80% of the training process, and further decreases to 10.0% of the peak after 90% of the training process. We set the maximum value of learning rate to5.3 e − 4 5.3e-45.3e−4, and use a batch size of 4M tokens with a 4K context length.遵循DeepSeek大模型的训练实践我们采用AdamW优化器Loshchilov and Hutter, 2017其参数设置为β 1 0.9 \beta_1 0.9β1​0.9β 2 0.95 \beta_2 0.95β2​0.95以及weight_decay 0.1。同时我们使用多步学习率衰减策略学习率在2000步预热后达到峰值在训练过程的80%时衰减至峰值的31.6%并在训练过程的90%时进一步衰减至峰值的10.0%。We set the maximum value of learning rate to 5.3e-4, and use a batch size of 4M tokens with a 4K context length.我们将学习率的最大值设置为 5.3e-4并使用 4M 词元tokens的批次大小和 4K 的上下文长度。2.2.2. 评估结果The DeepSeekMath Corpus is of high quality, covers multilingual mathematical content, and is the largest in size.DeepSeekMath语料库质量高涵盖多语种数学内容且规模最大。High-quality:We evaluate downstream performance on 8 mathematical benchmarks using few-shot chain-of-thought prompting Wei et al. (2022). As shown in Table 1, there is a clear performance lead of the model trained on the DeepSeekMath Corpus. Figure 3 shows that the model trained on the DeepSeekMath Corpus demonstrates better performance than Proof-Pile-2 at 50B tokens (1 full epoch of Proof-Pile-2), indicating the average quality of DeepSeekMath Corpus is higher.高质量我们使用少样本思维链提示Wei et al., 2022在8个数学基准上评估下游性能。如表1所示基于DeepSeekMath语料库训练的模型表现出明显的性能领先优势。图3显示在训练数据量为500亿个token即Proof-Pile-2的一个完整训练周期时基于DeepSeekMath语料库训练的模型表现优于Proof-Pile-2这表明DeepSeekMath语料库的平均质量更高。Table 1 Performance of DeepSeek-LLM 1.3B trained on different mathematical corpora, evaluated using few-shot chain-of-thought prompting. Corpus sizes are calculated using our tokenizer with a vocabulary size of 100K.表 1 在不同数学语料库上训练的 DeepSeek-LLM 1.3B 的性能使用少量样本的思维链提示进行评估。语料库大小使用我们词汇量为 10 万的 tokenizer 计算。Multilingual:The DeepSeekMath Corpus encompasses data in multiple languages, predominantly featuring English and Chinese as the two most represented languages. As shown in Table 1, training on the DeepSeekMath Corpus enhances mathematical reasoning performance in both English and Chinese. In contrast, existing mathematical corpora, which are primarily English-centric, show limited improvement and may even hinder performance in Chinese mathematical reasoning.多语言性DeepSeekMath语料库包含多种语言的数据其中英语和中文是两种最具代表性的语言。如表1所示在DeepSeekMath语料库上进行训练可以提高英语和中文的数学推理性能。相比之下现有的数学语料库主要以英语为中心其改进有限甚至可能阻碍中文数学推理的性能。Large-scale:The DeepSeekMath Corpus is several times larger than existing mathematical corpora. As depicted in Figure 3, DeepSeek-LLM 1.3B, when trained on the DeepSeekMath Corpus, shows a steeper learning curve along with more lasting improvements. In contrast, the baseline corpora are much smaller, and have already been repeated multiple rounds during training, with the resulting model performance quickly reaching a plateau.大规模性DeepSeekMath语料库比现有的数学语料库大数倍。如图3所示DeepSeek-LLM 1.3B在DeepSeekMath语料库上训练时表现出更陡峭的学习曲线以及更持久的改进。相比之下基线语料库要小得多并且在训练过程中已经重复了多轮导致模型性能迅速达到平台期。Figure 3 Benchmark curves of DeepSeek-LLM 1.3B trained on different mathematical corpora.图 3 在不同数学语料库上训练的 DeepSeek-LLM 1.3B 的基准曲线。2.3. DeepSeekMath-Base 7B 的训练与评估In this section, we introduce DeepSeekMath-Base 7B, a base model with strong reasoning abilities, especially in mathematics. Our model is initialized with DeepSeek-Coder-Base-v1.5 7B在本节中我们将介绍DeepSeekMath-Base 7B这是一个具有强大推理能力的基础模型尤其是在数学方面。我们的模型使用DeepSeek-Coder-Base-v1.5 7B进行初始化In this section, we introduce DeepSeekMath-Base 7B, a base model with strong reasoning abilities, especially in mathematics. Our model is initialized with DeepSeek-Coder-Base-v1.5 7B (Guo et al., 2024) and trained for 500B tokens. The distribution of the data is as follows: 56% is from the DeepSeekMath Corpus, 4% from AlgebraicStack, 10% from arXiv, 20% is Github code, and the remaining 10% is natural language data from Common Crawl in both English and Chinese.在本节中我们介绍了DeepSeekMath-Base 7B一个具有强大推理能力的基础模型尤其是在数学领域。我们的模型初始化自DeepSeek-Coder-Base-v1.5 7BGuo等人2024并进行了500B个token的训练。数据的分布如下56%来自DeepSeekMath语料库4%来自AlgebraicStack10%来自arXiv20%是Github代码其余10%是来自Common Crawl的英文和中文自然语言数据。We mainly adopt the training setting specified in Section 2.2.1, except that we set the maximum value of the learning rate to 4.2e-4 and use a batch size of 10M tokens.我们主要采用第2.2.1节中指定的训练设置但我们将学习率的最大值设置为4.2e-4并使用10M token的批处理大小。We conduct a comprehensive assessment of the mathematical capabilities of DeepSeekMathBase 7B, focusing on its ability to produce self-contained mathematical solutions without relying on external tools, solve mathematical problems using tools, and conduct formal theorem proving. Beyond mathematics, we also provide a more general profile of the base model, including its performance of natural language understanding, reasoning, and programming skills.我们对DeepSeekMathBase 7B的数学能力进行了全面评估重点关注其在不依赖外部工具的情况下生成独立数学解题方案、使用工具解决数学问题以及进行形式化定理证明的能力。除了数学之外我们还提供了该基础模型的更通用概况包括其在自然语言理解、推理和编程技能方面的表现。Mathematical Problem Solving with Step-by-Step ReasoningWe evaluate DeepSeekMathBase’s performance of solving mathematical problems using few-shot chain-of-thought prompting (Wei et al., 2022), across eight benchmarks in English and Chinese. These benchmarks encompass quantitative reasoning (e.g., GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021), and CMATH (Wei et al., 2023)) and multiple-choice problems (e.g., MMLU-STEM (Hendrycks et al., 2020) and Gaokao-MathQA (Zhong et al., 2023)), covering diverse fields of mathematics from elementary to college-level complexity.采用逐步推理的数学问题求解我们评估了DeepSeekMathBase在解决数学问题方面的性能使用了少量样本的思维链提示Wei et al., 2022)涵盖了英语和中文的八个基准测试。这些基准测试包括定量推理例如GSM8K (Cobbe et al., 2021))MATH (Hendrycks et al., 2021))和CMATH (Wei et al., 2023))和多项选择题例如MMLU-STEM (Hendrycks et al., 2020)和Gaokao-MathQA (Zhong et al., 2023)))涵盖了从小学到大学水平的各种数学领域。As shown in Table 2, DeepSeekMath-Base 7B leads in performance across all eight benchmarks among the open-source base models (including the widely-used general model Mistral 7B (Jiang et al., 2023) and the recently released Llemma 34B (Azerbayev et al., 2023) which underwent math training on Proof-Pile-2 (Azerbayev et al., 2023)).如表 2 所示在所有八个基准测试中DeepSeekMath-Base 7B 在开源基础模型中处于领先地位包括广泛使用的通用模型 Mistral 7B (Jiang et al., 2023) 和最近发布的 Llemma 34B (Azerbayev et al., 2023)后者在 Proof-Pile-2 (Azerbayev et al., 2023) 上进行了数学训练。Table 2 | Comparisons between DeepSeekMath-Base 7B and strong base models on English and Chinese mathematical benchmarks. Models are evaluated with chain-of-thought prompting. Minerva results are quoted from Lewkowycz et al. (2022a).表2 DeepSeekMath-Base 7B与强大的基础模型在英语和中文数学基准测试上的比较。模型通过思维链提示进行评估。Minerva的结果引自Lewkowycz等人(2022a)。Notably, on the competitionlevel MATH dataset, DeepSeekMath-Base surpasses existing open-source base models by over 10% absolute, and outperforms Minerva 540B (Lewkowycz et al., 2022a), a closed-source base model 77 times larger which builds on PaLM (Lewkowycz et al., 2022b) and is further trained on mathematical texts.值得注意的是在竞赛级别的MATH数据集上DeepSeekMath-Base的性能超越现有开源基础模型超过10个百分点并且优于Minerva 540BLewkowycz等人2022a)。Minerva 540B是一个闭源基础模型其规模是DeepSeekMath-Base的77倍并且在数学文本上进行了进一步训练Lewkowycz等人2022b。Mathematical Problem Solving with Tool UseWe evaluate program-aided mathematical reasoning on GSM8K and MATH using few-shot program-of-thought prompting (Chen et al., 2022; Gao et al., 2023). Models are prompted to solve each problem by writing a Python program where libraries such as math and sympy can be utilized for intricate computations.使用工具进行数学问题求解 我们在GSM8K和MATH数据集上使用少样本思维程序提示Chen et al., 2022; Gao et al., 2023来评估程序辅助的数学推理能力。模型被提示通过编写Python程序来解决每个问题其中可以利用math和sympy等库进行复杂的计算。The execution result of the program is evaluated as the answer. As shown in Table 3, DeepSeekMath-Base 7B outperforms the prior state-of-the-art Llemma 34B.程序的执行结果被评估为答案。如表3所示DeepSeekMath-Base 7B的性能优于先前最先进的Llemma 34B。Formal MathematicsFormal proof automation is beneficial to ensure the accuracy and reliability of mathematical proofs and enhance efficiency, with increasing attention in recent years. We evaluate DeepSeekMath-Base 7B on the task of informal-to-formal proving from (Jiang et al., 2022) which is to generate a formal proof based on an informal statement, a formal counterpart of the statement, and an informal proof.形式化数学形式化证明的自动化有助于确保数学证明的准确性和可靠性并提高效率近年来受到越来越多的关注。我们在 (Jiang et al., 2022) 的非形式化到形式化证明任务上评估了 DeepSeekMath-Base 7B该任务是根据非形式化陈述、陈述的形式化对应物和非形式化证明来生成形式化证明。We evaluate on miniF2F (Zheng et al., 2021), a benchmark for formal Olympiad-level mathematics, and generate a formal proof in Isabelle for each problem with few-shot prompting. Following Jiang et al. (2022), we leverage models to generate proof sketches, and execute the off-the-shelf automated prover Sledgehammer (Paulson, 2010) to fill in the missing details.我们在 miniF2F (Zheng et al., 2021) 上进行评估这是一个面向奥林匹克数学的基准测试并通过少样本提示为每个问题生成 Isabelle 的形式化证明。遵循 Jiang et al. (2022), 我们利用模型生成证明草图并执行现成的自动证明器 Sledgehammer (Paulson, 2010) 来填充缺失的细节。As shown in Table 3, DeepSeekMath-Base 7B demonstrates strong performance in proof autoformalization.如表3所示DeepSeekMath-Base 7B在证明自动形式化方面表现出强大的性能。Table 3 Few-shot evaluation of base models’ ability to solve mathematical problems using tools and the ability to conduct informal-to-formal theorem proving in Isabelle.表 3 基础模型在使用工具解决数学问题以及在 Isabelle 中进行非形式化到形式化定理证明的能力的少样本评估。Natural Language Understanding, Reasoning, and CodeWe evaluate model performance of natural language understanding on MMLU (Hendrycks et al., 2020), reasoning on BBH (Suzgun et al., 2022), and coding capabilities on HumanEval (Chen et al., 2021) and MBPP (Austin et al.,2021).自然语言理解、推理和代码我们评估模型在以下方面的性能MMLU上的自然语言理解Hendrycks et al., 2020)BBH上的推理Suzgun et al., 2022)以及HumanEvalChen et al., 2021和MBPPAustin et al.,上的编码能力。As shown in Table 4, DeepSeekMath-Base 7B exhibits significant enhancements in performance on MMLU and BBH over its precursor, DeepSeek-Coder-Base-v1.5 (Guo et al., 2024), illustrating the positive impact of math training on language understanding and reasoning. Additionally, by including code tokens for continual training, DeepSeekMath-Base 7B effectively maintains the performance of DeepSeek-Coder-Base-v1.5 on the two coding benchmarks.如表 4 所示DeepSeekMath-Base 7B 在 MMLU 和 BBH 上的性能相比其前身 DeepSeek-Coder-Base-v1.5Guo 等人2024)有了显著提升2021这说明数学训练对语言理解和推理产生了积极影响。此外通过包含用于持续训练的代码标记DeepSeekMath-Base 7B 在两个代码基准测试中有效保持了 DeepSeek-Coder-Base-v1.5 的性能。Table 4 Evaluation on natural language understanding, reasoning, and code benchmarks. DeepSeek-Coder-Base-v1.5† is the checkpoint right before learning rate decay, which is used to train DeepSeekMath-Base. On MMLU and BBH, we use few-shot chain-of-thought prompting. On HumanEval and MBPP, we evaluate model performance under the zero-shot setting and a few-shot setting, respectively.表 4 在自然语言理解、推理和代码基准测试上的评估。DeepSeek-Coder-Base-v1.5† 是学习率衰减之前的检查点用于训练 DeepSeekMath-Base。在 MMLU 和 BBH 上我们使用少量样本的思维链提示。在 HumanEval 和 MBPP 上我们分别在零样本设置和少量样本设置下评估模型性能。Overall, DeepSeekMath-Base 7B significantly outperforms the general model Mistral 7B (Jiang et al., 2023) on the three reasoning and coding benchmarks.总体而言DeepSeekMath-Base 7B 在三个推理和编码基准测试上显著优于通用模型 Mistral 7B (Jiang et al., 2023)。