DeepSeek-R1: A Paradigm Shift in Large Language Model Development and Transparency [Paper]

Introduction¶

DeepSeek-R1 is a foundation model developed by the Chinese company DeepSeek. The paper detailing its architecture was published as the cover article in Nature (Vol. 645, Issue 8081, September 17, 2025). This publication helped foster a shift toward public research contributions in the field of Artificial Intelligence. In fact, it is the first frontier-scale LLM to complete three rounds of peer review at a prestigious journal (DeepSeek-AI, 2025). This is a paradigm shift in an industry where developers of foundation models tend to default to not publishing the techniques they use to train their models or producing non-reviewed preprint papers toward openness and more scientific scrutiny of their research.

The paper introduces two models: R1-Zero and R1. The R1-Zero model is trained using the DeepSeek-V3-Base, 671B parameterized mixture-of-experts (MoE) architecture and pure reinforcement learning without supervised reasoning methods and techniques (DeepSeek-AI, 2025). This approach allows the model to learn "thinking" behaviors, which are long chains of thought and self-correction, by relying on rule-based reward and incentives for deriving correct answers and solutions to various problems. The full R1 model takes this a step further and uses a four-stage pipeline consisting of cold-start Supervised Fine-Tuning (SFT), reasoning reinforcement learning (RL), rejection sampling plus SFT, and a secondary RL.

The primary contribution of the R1 model is to demonstrate that reasoning behaviors can arise from pure Reinforcement Learning with verifiable rewards. It proves that models do not require human-prepared data or complex frameworks to learn. In other words, it doesn't require a human-in-the-loop to rank its answers or use complex frameworks to guide its logic, as it can rely solely on facts and objective results. This report examines the foundations of the models, clarifies the significance of reasoning emergence in Large Language Models, the methodology used to train them, their performance and limitations.

Method¶

DeepSeek-R1 is built on top of DeepSeek-V3-Base, a model that contains many parameters and a Mixture-of-Experts (MoE) architecture. This architecture enables an AI model to become much smarter and more capable without slowing its performance or increasing its runtime costs. It differs from traditional models, where every parameter is used to process every input, only to activate specific parts of the model (experts) required to perform a given task. This configuration uses roughly 37 billion active parameters for each token. The model borrows its attention system and efficient memory use from earlier model versions (DeepSeek-AI, 2025).

Group Relative Policy Optimization (GRPO) is arguably R1's hallmark technical feature. Traditional Proximal Policy Optimization (PPO) requires a separate critic value function for baseline estimation, which increases memory usage. In contrast, GRPO uses group-based reward normalization. For each prompt, GRPO samples a group of outputs from the current policy.

The reward received for each completed task is based on correctness and compliance with the format. The group-based baseline helps the model identify optimal reasoning paths without requiring a learned value model (DeepSeek-AI, 2025). This also allows the model to avoid a problem known as reward-hacking, where an LLM finds a shortcut to achieve a high score or reward without actually performing the task it was asked to complete.

During RL training, R1-Zero develops reasoning abilities, including extended thought processes, self-reflection, verification, and dynamic strategy adaptation, that enable it to derive a solution. During training, the model learns that producing intermediate thinking tokens before generating a final answer increases the likelihood of achieving higher correctness rewards. It is important to note that there are two flavors or sizes of R1 — the full version and the distilled versions.

R1 Full¶

The R1 (full) version uses a four-stage pipeline when training, consisting of (1) cold-start Supervised Fine-Tuning (SFT) on thoughts of long-CoT examples used to set the reasoning format; (2) reasoning-oriented RL with GRPO on math and code; (3) rejection sampling plus SFT to increase task coverage; (4) "Reinforcement Learning for all Scenarios," a secondary reinforcement learning stage emphasizing helpfulness and harmlessness across many different prompt distributions (DeepSeek-AI, 2025).

Cold-start Supervised Fine-Tuning provides a "nudge" in the right direction, helping the model think for itself through trial and error. This addresses the tendency for a model to ramble or get stuck in a loop when asked to start reasoning because it doesn't know what a chain of thought looks like. To address this, AI researchers provide a small, high-quality dataset of long, step-by-step solutions or reasoning paths, written by humans or generated by other models, to feed to the model as examples. As a result, the model is "warmed up" as it teaches the AI the basic format of reasoning, which consists of how to start a problem, break it into steps, and draw a conclusion. Once the model learns this approach to reasoning, it is ready for a more rigorous Reinforcement Learning phase.

Long Chain-of-Thought (Long-CoT) refers to the model's ability to "think" for a long time before providing a final answer. Instead of immediately providing a result, the model generates thoughts and words from an internal monologue to explore different angles and check its own work. It can catch its own mistakes, course-correct, and try a different approach if it is going down the wrong path, as it spends more time thinking, its accuracy increases.

R1 Distilled¶

The R1 family of models provides six smaller models that use a technique called knowledge distillation. Distillation is the process of transferring the intelligence and capabilities of a massive, complex model into a much smaller, faster, or cheaper one using a teacher-student dynamic. The process creates a "student" model that performs almost as well as a large "teacher" model, using a fraction of the hardware to achieve the same performance. Distilled models inherit reasoning patterns despite being trained only on outputs, not on the Reinforcement Learning process (DeepSeek-AI, 2025). The models fine-tuned to produce these distilled R1 models are Qwen2.5 and Llama-3.

Results¶

What is impressive about R1 is the outcome of several of its benchmark assessments. For example, R1 is at near-parity with OpenAI's o1-1217 on AIME 2024 at 79.8% vs 79.2%, MATH-500 at 97.3% vs 96.4%, Codeforces Elo at 2,029 vs 2,061 (DeepSeek-AI, 2025). On Massive Multitask Language Understanding (MMLU) benchmarks, R1 scores 90.8%; on GPQA-Diamond (graduate science), 71.5%, comfortably outperforming o1-mini (60.0%) and trailing o1-1217 (75.7%) by only a few points. Pure-RL reasoning transfers beyond math and code into general domains.

The distilled versions of R1 also perform well on various benchmarks. For example, Qwen-Distill-32B scores 72.6% Pass@1 on AIME 2024 versus QwQ-32B-Preview's 50.0%. The distilled-7B model reaches 92.8% on MATH-500 and 55.5% Pass@1 on AIME 2024, which are respectable on standard math but still trail the foundation R1 model on more difficult reasoning benchmarks (DeepSeek-AI, 2025).

Strengths¶

An R1 strength is the use of pure Reinforcement Learning as a core technical innovation to improve model performance. R1-Zero can develop self-reflection, verification, and dynamic strategy development with only rule-based corrected rewards and no supervised reasoning. The approach is a clever use of RL to improve model training without relying on supervised reasoning (DeepSeek-AI, 2025). Instead of using human feedback to guide the AI, it uses objective, verifiable rewards. This allows the model to develop its own reasoning skills naturally through trial and error.

The efficiency of GRPO should also be noted, as it addresses a very real hardware bottleneck in RL training. By removing the value network critic — arguably the most expensive part of RL — GRPO substantially reduces the memory and compute overhead associated with policy optimization. It allows the model to focus on verifiable rewards by comparing a group of reasoning paths and quickly identifying specific logical steps that can lead to a correct answer.

DeepSeek AI's open-weight publishing democratizes the field of Artificial Intelligence by allowing anyone who can host and retrain the model to use it freely. Unlike closed-source models like Opus 4.7, Sonnet 4.6, and GPT 5.5, which are only available via APIs, distilled models like Qwen-Distill-7B provide visibility and access to state-of-the-art reasoning models. This makes it easier for everyday practitioners to fine-tune and create bespoke models that can be added to a harness to achieve greater performance gains.

Limitations and Open Questions¶

The distillation gap is probably R1's most significant issue. Distilled models like Qwen-7B inherit reasoning formats from R1 and close the gap on standard math tests like MATH-500, but perform worse on more difficult benchmarks, such as AIME 2024. The distillation effectively preserves the reasoning format (e.g., Chain of Thought) and performs well on routine problems, but lacks sufficient parameters for sustained, multistep logic on more difficult problems.

Language mixing is another issue exhibited by R1, limited by the RL component of training, which is proven effective for logic, but is insufficient for linguistic consistency. Without an SFT cold-start that anchors the model in a specific language, R1 tends to mix English and Chinese mid-thought (DeepSeek-AI, 2025). Human demonstrations remain essential for alignment here, even if not required for pure reasoning.

Using automatic "right or wrong" rewards also has a limit: it only works when the answer is unambiguous, like in math or coding. It becomes more difficult to use this method for tasks like creative writing or ethics, where there is no obvious correct answer.

R1 performs well in zero-shot settings but can perform worse with few-shot examples (DeepSeek-AI, 2025). RL training optimizes for a hidden reasoning format that human-provided examples disrupt, causing a distribution shift the model cannot handle at inference time.

Conclusion¶

DeepSeek-R1 moves AI research from a black box of proprietary models toward peer-reviewed science and open methodology by releasing its weights and detailing the training process in a peer-reviewed paper. By surviving three rounds of review at Nature, the paper demonstrates that emergent reasoning is a reproducible approach induced by Group Relative Policy Optimization (GRPO) and verifiable rewards (DeepSeek-AI, 2025). The technique detailed in the paper produces a memory-efficient training loop without a critic, an important advancement for training reasoning models at a trillion-parameter scale. R1 proves that we can induce logical structure in LLMs without human-provided examples.

References¶

DeepSeek-AI. (2025). DeepSeek-R1 incentivizes LLMs' reasoning through reinforcement learning. Nature, 645(8081), 633–638. https://doi.org/10.1038/s41586-025-09422-z

DeepSeek-AI. (2025). DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv preprint arXiv:2501.12948.