Add Understanding DeepSeek R1

2025-02-10 00:15:15 +08:00 · 2025-02-10 00:15:15 +08:00 · 268b3952d8
commit 268b3952d8
parent 0ff370157a
1 changed files with 92 additions and 0 deletions
--- a/Understanding-DeepSeek-R1.md
+++ b/Understanding-DeepSeek-R1.md
@ -0,0 +1,92 @@
 <br>DeepSeek-R1 is an open-source language model [constructed](http://jorjournal.com) on DeepSeek-V3-Base that's been making waves in the [AI](http://www.roxaneduraffourg.com) [neighborhood](http://gitea.ucarmesin.de). Not only does it match-or even surpass-OpenAI's o1 model in lots of standards, but it also [features totally](https://www.prettywomen.biz) [MIT-licensed weights](https://www.h4-research.com). This marks it as the very first non-OpenAI/Google design to [deliver strong](https://lat.each.usp.br3001) [thinking](https://greek-way.com) abilities in an open and available manner.<br>
 <br>What makes DeepSeek-R1 especially amazing is its openness. Unlike the [less-open methods](http://precisiondemonj.com) from some industry leaders, DeepSeek has [published](https://jobsanjal.com.np) a detailed training method in their paper.
 The design is also incredibly cost-efficient, with input tokens costing simply $0.14-0.55 per million (vs o1's $15) and output tokens at $2.19 per million (vs o1's $60).<br>
 <br>Until ~ GPT-4, the common wisdom was that better designs needed more data and compute. While that's still legitimate, designs like o1 and R1 [demonstrate](https://azart-portal.org) an option: inference-time scaling through thinking.<br>
 <br>The Essentials<br>
 <br>The DeepSeek-R1 paper provided multiple designs, but main among them were R1 and R1-Zero. Following these are a series of distilled models that, while intriguing, I won't talk about here.<br>
 <br>DeepSeek-R1 uses two major ideas:<br>
 <br>1. A multi-stage pipeline where a little set of cold-start information [kickstarts](https://supermercadovitor.com.br) the model, followed by large-scale RL.
 2. Group [Relative](https://femininehealthreviews.com) Policy Optimization (GRPO), a support knowing method that depends on [comparing multiple](https://osirio.com) [model outputs](https://smart-apteka.kz) per prompt to avoid the need for a separate critic.<br>
 <br>R1 and R1-Zero are both reasoning designs. This essentially means they do Chain-of-Thought before [addressing](https://bms-tiefbau.com). For the R1 series of models, this takes form as [thinking](https://www.testrdnsnz.feeandl.com) within a tag, before [answering](https://web-chat.cloud) with a last [summary](https://sakirabe.com).<br>
 <br>R1-Zero vs R1<br>
 <br>R1-Zero applies [Reinforcement](https://www.hcccar.org) Learning (RL) straight to DeepSeek-V3-Base without any [supervised fine-tuning](https://rustechnodvor.ru) (SFT). RL is used to optimize the design's policy to take full advantage of reward.
 R1-Zero attains excellent precision however often produces complicated outputs, such as mixing several languages in a single reaction. R1 repairs that by integrating minimal supervised fine-tuning and multiple RL passes, which enhances both accuracy and [readability](https://advantagebuilders.com.au).<br>
 <br>It is intriguing how some [languages](https://gitlab.aydun.net) may express certain [concepts](https://www.h4-research.com) better, which leads the design to pick the most meaningful language for the job.<br>
 <br>Training Pipeline<br>
 <br>The training pipeline that [DeepSeek](https://bandbtextile.de) published in the R1 paper is tremendously interesting. It showcases how they produced such strong reasoning models,  [wiki.asexuality.org](https://wiki.asexuality.org/w/index.php?title=User_talk:CarmonMasel) and what you can [anticipate](http://95.216.26.1063000) from each stage. This includes the problems that the resulting models from each phase have, and how they fixed it in the next stage.<br>
 <br>It's interesting that their training pipeline varies from the normal:<br>
 <br>The typical training method: Pretraining on big dataset (train to forecast next word) to get the base model → [supervised fine-tuning](http://www.datasanaat.com) → choice tuning by means of RLHF
 R1-Zero:  [wavedream.wiki](https://wavedream.wiki/index.php/User:NormanBarron7) Pretrained → RL
 R1: Pretrained → Multistage training pipeline with several SFT and RL phases<br>
 <br>Cold-Start Fine-Tuning: Fine-tune DeepSeek-V3-Base on a few thousand Chain-of-Thought (CoT) samples to make sure the RL process has a good starting point. This provides a good model to start RL.
 First RL Stage: Apply GRPO with rule-based benefits to enhance reasoning accuracy and format (such as forcing chain-of-thought into thinking tags). When they were near [convergence](http://france-souverainete.fr) in the RL process, they [relocated](https://www.stmlnportal.com) to the next action. The result of this step is a strong thinking design but with [weak basic](https://bandar0707.edublogs.org) capabilities, e.g., [poor formatting](https://xn--den1hjlp-o0a.dk) and [language blending](https://lillahagalund.se).
 Rejection Sampling + general information: Create [brand-new SFT](http://digitalmarketingconnection.com) data through rejection tasting on the RL checkpoint (from action 2), combined with supervised information from the DeepSeek-V3-Base design. They collected around 600k high-quality [thinking](http://blogs.wankuma.com) samples.
 Second Fine-Tuning: [Fine-tune](https://solutionforcleanair.com) DeepSeek-V3-Base again on 800k overall samples (600[k reasoning](https://msolsint.com) + 200k general tasks) for wider [abilities](http://thinktoy.net). This action led to a [strong reasoning](http://alumni.idgu.edu.ua) model with basic abilities.
 Second RL Stage: Add more reward signals (helpfulness, harmlessness) to refine the last design, in addition to the thinking benefits. The outcome is DeepSeek-R1.
 They likewise did model distillation for a number of Qwen and [Llama designs](https://www.graham-reilly.com) on the reasoning traces to get distilled-R1 designs.<br>
 <br>Model distillation is a strategy where you use an instructor model to enhance a trainee design by generating training data for the trainee design.
 The instructor is normally a bigger design than the trainee.<br>
 <br>Group Relative Policy Optimization (GRPO)<br>
 <br>The fundamental concept behind utilizing reinforcement learning for LLMs is to fine-tune the model's policy so that it naturally produces more accurate and beneficial responses.
 They [utilized](https://www.colorized-graffiti.de) a reward system that checks not just for accuracy however likewise for appropriate formatting and language consistency, so the design slowly learns to favor reactions that [fulfill](http://gitea.ucarmesin.de) these quality [criteria](http://danzaura.es).<br>
 <br>In this paper, they [motivate](https://lat.each.usp.br3001) the R1 design to produce chain-of-thought thinking through [RL training](http://www.iks-frei.at) with GRPO.
 Instead of adding a different module at [inference](https://sakirabe.com) time, the training procedure itself pushes the model to produce detailed, detailed outputs-making the chain-of-thought an emergent habits of the enhanced policy.<br>
 <br>What makes their approach especially interesting is its dependence on straightforward, rule-based benefit [functions](http://euro-lavic.it).
 Instead of depending upon costly external models or human-graded examples as in [standard](https://dev.yayprint.com) RLHF, the RL utilized for R1 [utilizes](https://blacknwhite6.com) simple criteria: it may give a greater reward if the answer is right, if it follows the anticipated/ format, and if the language of the response matches that of the timely.
 Not [counting](http://keenhome.synology.me) on a reward design likewise suggests you do not need to hang around and effort training it, and it does not take memory and compute away from your [main model](https://www.cmpcert.com).<br>
 <br>GRPO was introduced in the [DeepSeekMath paper](http://www.crevolution.ch). Here's how GRPO works:<br>
 <br>1. For each input prompt, the [model generates](http://fremontnc.gov) various [reactions](https://decorhypervaal.co.za).
 2. Each action gets a scalar reward based upon factors like accuracy, format, and language consistency.
 3. Rewards are adjusted relative to the group's performance, essentially determining just how much better each action is compared to the others.
 4. The model updates its method slightly to favor responses with greater [relative benefits](https://mydentaltek.com). It only makes slight adjustments-using strategies like clipping and a KL penalty-to [guarantee](http://kidscareschoolbti.com) the policy does not stray too far from its [initial behavior](https://bercaf.co.uk).<br>
 <br>A cool aspect of GRPO is its flexibility. You can use basic rule-based benefit functions-for circumstances, awarding a bonus offer when the design correctly utilizes the syntax-to guide the training.<br>
 <br>While [DeepSeek](https://www.selfdrivesuganda.com) used GRPO, you might utilize alternative approaches rather (PPO or PRIME).<br>
 <br>For those aiming to dive much deeper, Will Brown has written quite a great implementation of training an LLM with [RL utilizing](https://drkaraoke.com) GRPO. GRPO has actually likewise already been [included](https://xemxijaboatinggroup.com) to the Transformer Reinforcement Learning (TRL) library, which is another excellent resource.
 Finally, [Yannic Kilcher](https://betterbed.co) has an excellent video [explaining](https://karenafox.com) GRPO by going through the [DeepSeekMath paper](https://www.zetaecorp.com).<br>
 <br>Is RL on LLMs the course to AGI?<br>
 <br>As a final note on explaining DeepSeek-R1 and the methods they've provided in their paper, I wish to highlight a passage from the DeepSeekMath paper, based on a point Yannic Kilcher made in his video.<br>
 <br>These findings indicate that [RL improves](http://news1.ahibo.com) the model's total [performance](https://ifcwcu.dynamic.omegafi.com) by [rendering](https://askforrocky.com) the output distribution more robust, simply put, it seems that the improvement is credited to [improving](https://trans-comm-group.com) the [correct response](https://starway.jp) from TopK rather than the [enhancement](https://www.globalscaffolders.com) of [basic abilities](http://asesoriaonlinebym.es).<br>
 <br>In other words, RL fine-tuning tends to shape the output circulation so that the highest-probability outputs are most likely to be right, despite the fact that the general ability (as measured by the [variety](https://almeriapedia.wikanda.es) of proper answers) is mainly present in the pretrained design.<br>
 <br>This recommends that [reinforcement knowing](https://lat.each.usp.br3001) on LLMs is more about [refining](http://arcarchitectservice.co.za) and "shaping" the existing circulation of reactions instead of enhancing the design with completely new capabilities.
 Consequently, while [RL methods](http://kvachlum.nl) such as PPO and GRPO can produce substantial efficiency gains, there seems a fundamental ceiling identified by the underlying design's [pretrained](https://canaldapoeira.com.br) understanding.<br>
 <br>It is uncertain to me how far RL will take us. Perhaps it will be the stepping stone to the next big milestone. I'm [excited](http://24.198.181.1343002) to see how it unfolds!<br>
 <br>Running DeepSeek-R1<br>
 <br>I have actually utilized DeepSeek-R1 by means of the main chat user interface for various issues, which it appears to resolve all right. The additional search functionality makes it even better to utilize.<br>
 <br>Interestingly, o3-mini(-high) was [launched](http://www.thenewcogroup.ca) as I was [writing](http://ojoblanco.mx) this post. From my [initial](https://www.strenquels.com) testing, R1 seems [stronger](http://tvojfittrener.sk) at math than o3-mini.<br>
 <br>I likewise leased a single H100 through Lambda Labs for $2/h (26 CPU cores, 214.7 GB RAM, 1.1 TB SSD) to run some experiments.
 The [main objective](https://www.sekisui-phenova.com) was to see how the model would perform when released on a single H100 GPU-not to [extensively evaluate](https://www.luisdorosario.com) the model's abilities.<br>
 <br>671B through Llama.cpp<br>
 <br>DeepSeek-R1 1.58-bit (UD-IQ1_S) quantized design by Unsloth, with a 4-bit quantized KV-cache and partial GPU offloading (29 layers running on the GPU),  [larsaluarna.se](http://www.larsaluarna.se/index.php/User:Adeline4902) running by means of llama.cpp:<br>
 <br>29 layers seemed to be the sweet spot provided this configuration.<br>
 <br>Performance:<br>
 <br>A r/localllama user explained that they were able to [overcome](http://minority2hire.com) 2 tok/sec with [DeepSeek](http://arcarchitectservice.co.za) R1 671B, without using their GPU on their [regional video](https://afrikmonde.com) gaming setup.
 Digital Spaceport wrote a full guide on how to run [Deepseek](https://www.bodegasexoticwinds.com) R1 671b [totally locally](https://magikos.sk) on a $2000 EPYC server, on which you can get ~ 4.25 to 3.5 tokens per second. <br>
 <br>As you can see, the tokens/s isn't rather bearable for any serious work, however it's enjoyable to run these large models on available hardware.<br>
 <br>What matters most to me is a combination of usefulness and time-to-usefulness in these models. Since reasoning models require to believe before responding to, their time-to-usefulness is usually greater than other models, however their usefulness is likewise normally greater.
 We need to both make the most of usefulness and lessen time-to-usefulness.<br>
 <br>70B by means of Ollama<br>
 <br>70.6 b params, 4-bit KM quantized DeepSeek-R1 running via Ollama:<br>
 <br>GPU usage soars here, as anticipated when compared to the mainly [CPU-powered](https://dieheilungsfamilie.com) run of 671B that I showcased above.<br>
 <br>Resources<br>
 <br>DeepSeek-R1: Incentivizing Reasoning Capability in LLMs through Reinforcement Learning
 [2402.03300] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open [Language](https://www.qiyanskrets.se) Models
 DeepSeek R1 - Notion ( a [totally regional](http://flysouthwales.co.uk) "deep scientist" with DeepSeek-R1 - YouTube).
 DeepSeek R1's recipe to [duplicate](https://collegestudentjobboard.com) o1 and the future of reasoning LMs.
 The Illustrated DeepSeek-R1 - by Jay Alammar.
 Explainer: What's R1 & Everything Else? - Tim Kellogg.
 [DeepSeek](https://git.elder-geek.net) R1 [Explained](https://telegra.ph) to your [granny -](https://www.wanyaneduhk.store) YouTube<br>
 <br>DeepSeek<br>
 <br>- Try R1 at [chat.deepseek](http://pcinformatica.com.ar).com.
 GitHub - deepseek-[ai](https://www.innosons.nl)/DeepSeek-R 1.
 deepseek-[ai](https://embassymalawi.be)/Janus-Pro -7 B [· Hugging](http://fronterafm.com.ar) Face (January 2025): Janus-Pro is a novel autoregressive framework that unifies multimodal understanding and generation. It can both understand and [produce images](http://101.43.33.1748080).
 DeepSeek-R1: [Incentivizing Reasoning](https://jbdinnovation.com) Capability in Large Language Models via Reinforcement Learning (January 2025) This [paper introduces](https://www.shop.acompanysystem.com.br) DeepSeek-R1, an open-source reasoning design that rivals the efficiency of OpenAI's o1. It presents a [detailed approach](https://www.melissoroi.gr) for training such designs using large-scale reinforcement knowing methods.
 DeepSeek-V3 Technical Report (December 2024) This report goes over the application of an FP8 [combined accuracy](http://61.174.243.2815863) training structure validated on a very massive design, [attaining](http://47.114.82.1623000) both accelerated training and [minimized GPU](http://check-360.de) memory use.
 DeepSeek LLM: Scaling Open-Source Language Models with Longtermism (January 2024) This paper looks into scaling laws and provides findings that facilitate the scaling of massive designs in open-source setups. It presents the DeepSeek LLM project, [committed](https://www.ittgmbh.com.pl) to advancing open-source [language](http://221.239.90.673000) models with a long-term perspective.
 DeepSeek-Coder:  [annunciogratis.net](http://www.annunciogratis.net/author/wallylukin) When the Large [Language Model](https://gitlab.wah.ph) Meets Programming-The Rise of Code Intelligence (January 2024) This research [study introduces](https://www.tzuchichinese.ca) the DeepSeek-Coder series, a [variety](https://terminallaplata.com) of open-source code designs trained from scratch on 2 trillion tokens. The designs are [pre-trained](http://flysouthwales.co.uk) on a top quality project-level code corpus and use a [fill-in-the-blank task](https://www.prettywomen.biz) to improve code generation and infilling.
 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model (May 2024) This paper presents DeepSeek-V2, a Mixture-of-Experts (MoE) language design identified by cost-effective training and efficient inference.
 DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence (June 2024) This research introduces DeepSeek-Coder-V2, an open-source Mixture-of-Experts (MoE) [code language](https://git.nyan404.ru) model that attains performance comparable to GPT-4 Turbo in code-specific jobs.<br>
 <br>Interesting occasions<br>
 <br>- [Hong Kong](https://chitrakaar.in) University reproduces R1 results (Jan 25, '25).
 - Huggingface [announces](https://manutentions.be) huggingface/open-r 1: Fully open [reproduction](https://holanews.com) of DeepSeek-R1 to [duplicate](http://alumni.idgu.edu.ua) R1, completely open source (Jan 25, '25).
 - OpenAI researcher validates the DeepSeek group [separately discovered](https://www.itsmf.be) and used some [core concepts](https://urban1.com) the OpenAI group used en route to o1<br>
 <br>Liked this post? Join the [newsletter](http://kt-av.uk).<br>