Add Understanding DeepSeek R1

2025-02-10 00:15:15 +08:00 · 2025-02-10 00:15:15 +08:00 · 268b3952d8
commit 268b3952d8
parent 0ff370157a
1 changed files with 92 additions and 0 deletions
--- a/Understanding-DeepSeek-R1.md
+++ b/Understanding-DeepSeek-R1.md
@ -0,0 +1,92 @@
+<br>DeepSeek-R1 is an open-source language model [constructed](http://jorjournal.com) on DeepSeek-V3-Base that's been making waves in the [AI](http://www.roxaneduraffourg.com) [neighborhood](http://gitea.ucarmesin.de). Not only does it match-or even surpass-OpenAI's o1 model in lots of standards, but it also [features totally](https://www.prettywomen.biz) [MIT-licensed weights](https://www.h4-research.com). This marks it as the very first non-OpenAI/Google design to [deliver strong](https://lat.each.usp.br3001) [thinking](https://greek-way.com) abilities in an open and available manner.<br>
+<br>What makes DeepSeek-R1 especially amazing is its openness. Unlike the [less-open methods](http://precisiondemonj.com) from some industry leaders, DeepSeek has [published](https://jobsanjal.com.np) a detailed training method in their paper.
+The design is also incredibly cost-efficient, with input tokens costing simply $0.14-0.55 per million (vs o1's $15) and output tokens at $2.19 per million (vs o1's $60).<br>
+<br>Until ~ GPT-4, the common wisdom was that better designs needed more data and compute. While that's still legitimate, designs like o1 and R1 [demonstrate](https://azart-portal.org) an option: inference-time scaling through thinking.<br>
+<br>The Essentials<br>
+<br>The DeepSeek-R1 paper provided multiple designs, but main among them were R1 and R1-Zero. Following these are a series of distilled models that, while intriguing, I won't talk about here.<br>
+<br>DeepSeek-R1 uses two major ideas:<br>
+<br>1. A multi-stage pipeline where a little set of cold-start information [kickstarts](https://supermercadovitor.com.br) the model, followed by large-scale RL.
+2. Group [Relative](https://femininehealthreviews.com) Policy Optimization (GRPO), a support knowing method that depends on [comparing multiple](https://osirio.com) [model outputs](https://smart-apteka.kz) per prompt to avoid the need for a separate critic.<br>
+<br>R1 and R1-Zero are both reasoning designs. This essentially means they do Chain-of-Thought before [addressing](https://bms-tiefbau.com). For the R1 series of models, this takes form as [thinking](https://www.testrdnsnz.feeandl.com) within a tag, before [answering](https://web-chat.cloud) with a last [summary](https://sakirabe.com).<br>
+<br>R1-Zero vs R1<br>
+<br>R1-Zero applies [Reinforcement](https://www.hcccar.org) Learning (RL) straight to DeepSeek-V3-Base without any [supervised fine-tuning](https://rustechnodvor.ru) (SFT). RL is used to optimize the design's policy to take full advantage of reward.
+R1-Zero attains excellent precision however often produces complicated outputs, such as mixing several languages in a single reaction. R1 repairs that by integrating minimal supervised fine-tuning and multiple RL passes, which enhances both accuracy and [readability](https://advantagebuilders.com.au).<br>
+<br>It is intriguing how some [languages](https://gitlab.aydun.net) may express certain [concepts](https://www.h4-research.com) better, which leads the design to pick the most meaningful language for the job.<br>
+<br>Training Pipeline<br>
+<br>The training pipeline that [DeepSeek](https://bandbtextile.de) published in the R1 paper is tremendously interesting. It showcases how they produced such strong reasoning models,  [wiki.asexuality.org](https://wiki.asexuality.org/w/index.php?title=User_talk:CarmonMasel) and what you can [anticipate](http://95.216.26.1063000) from each stage. This includes the problems that the resulting models from each phase have, and how they fixed it in the next stage.<br>
+<br>It's interesting that their training pipeline varies from the normal:<br>
+<br>The typical training method: Pretraining on big dataset (train to forecast next word) to get the base model → [supervised fine-tuning](http://www.datasanaat.com) → choice tuning by means of RLHF
+R1-Zero:  [wavedream.wiki](https://wavedream.wiki/index.php/User:NormanBarron7) Pretrained → RL
+R1: Pretrained → Multistage training pipeline with several SFT and RL phases<br>
+<br>Cold-Start Fine-Tuning: Fine-tune DeepSeek-V3-Base on a few thousand Chain-of-Thought (CoT) samples to make sure the RL process has a good starting point. This provides a good model to start RL.
+First RL Stage: Apply GRPO with rule-based benefits to enhance reasoning accuracy and format (such as forcing chain-of-thought into thinking tags). When they were near [convergence](http://france-souverainete.fr) in the RL process, they [relocated](https://www.stmlnportal.com) to the next action. The result of this step is a strong thinking design but with [weak basic](https://bandar0707.edublogs.org) capabilities, e.g., [poor formatting](https://xn--den1hjlp-o0a.dk) and [language blending](https://lillahagalund.se).
+Rejection Sampling + general information: Create [brand-new SFT](http://digitalmarketingconnection.com) data through rejection tasting on the RL checkpoint (from action 2), combined with supervised information from the DeepSeek-V3-Base design. They collected around 600k high-quality [thinking](http://blogs.wankuma.com) samples.
+Second Fine-Tuning: [Fine-tune](https://solutionforcleanair.com) DeepSeek-V3-Base again on 800k overall samples (600[k reasoning](https://msolsint.com) + 200k general tasks) for wider [abilities](http://thinktoy.net). This action led to a [strong reasoning](http://alumni.idgu.edu.ua) model with basic abilities.
+Second RL Stage: Add more reward signals (helpfulness, harmlessness) to refine the last design, in addition to the thinking benefits. The outcome is DeepSeek-R1.
+They likewise did model distillation for a number of Qwen and [Llama designs](https://www.graham-reilly.com) on the reasoning traces to get distilled-R1 designs.<br>
+<br>Model distillation is a strategy where you use an instructor model to enhance a trainee design by generating training data for the trainee design.
+The instructor is normally a bigger design than the trainee.<br>
+<br>Group Relative Policy Optimization (GRPO)<br>
+<br>The fundamental concept behind utilizing reinforcement learning for LLMs is to fine-tune the model's policy so that it naturally produces more accurate and beneficial responses.
+They [utilized](https://www.colorized-graffiti.de) a reward system that checks not just for accuracy however likewise for appropriate formatting and language consistency, so the design slowly learns to favor reactions that [fulfill](http://gitea.ucarmesin.de) these quality [criteria](http://danzaura.es).<br>
+<br>In this paper, they [motivate](https://lat.each.usp.br3001) the R1 design to produce chain-of-thought thinking through [RL training](http://www.iks-frei.at) with GRPO.
+Instead of adding a different module at [inference](https://sakirabe.com) time, the training procedure itself pushes the model to produce detailed, detailed outputs-making the chain-of-thought an emergent habits of the enhanced policy.<br>
+<br>What makes their approach especially interesting is its dependence on straightforward, rule-based benefit [functions](http://euro-lavic.it).
+Instead of depending upon costly external models or human-graded examples as in [standard](https://dev.yayprint.com) RLHF, the RL utilized for R1 [utilizes](https://blacknwhite6.com) simple criteria: it may give a greater reward if the answer is right, if it follows the anticipated/ format, and if the language of the response matches that of the timely.
+Not [counting](http://keenhome.synology.me) on a reward design likewise suggests you do not need to hang around and effort training it, and it does not take memory and compute away from your [main model](https://www.cmpcert.com).<br>
+<br>GRPO was introduced in the [DeepSeekMath paper](http://www.crevolution.ch). Here's how GRPO works:<br>
+<br>1. For each input prompt, the [model generates](http://fremontnc.gov) various [reactions](https://decorhypervaal.co.za).
+2. Each action gets a scalar reward based upon factors like accuracy, format, and language consistency.
+3. Rewards are adjusted relative to the group's performance, essentially determining just how much better each action is compared to the others.
+4. The model updates its method slightly to favor responses with greater [relative benefits](https://mydentaltek.com). It only makes slight adjustments-using strategies like clipping and a KL penalty-to [guarantee](http://kidscareschoolbti.com) the policy does not stray too far from its [initial behavior](https://bercaf.co.uk).<br>
+<br>A cool aspect of GRPO is its flexibility. You can use basic rule-based benefit functions-for circumstances, awarding a bonus offer when the design correctly utilizes the syntax-to guide the training.<br>
+<br>While [DeepSeek](https://www.selfdrivesuganda.com) used GRPO, you might utilize alternative approaches rather (PPO or PRIME).<br>
+<br>For those aiming to dive much deeper, Will Brown has written quite a great implementation of training an LLM with [RL utilizing](https://drkaraoke.com) GRPO. GRPO has actually likewise already been [included](https://xemxijaboatinggroup.com) to the Transformer Reinforcement Learning (TRL) library, which is another excellent resource.
+Finally, [Yannic Kilcher](https://betterbed.co) has an excellent video [explaining](https://karenafox.com) GRPO by going through the [DeepSeekMath paper](https://www.zetaecorp.com).<br>
+<br>Is RL on LLMs the course to AGI?<br>
+<br>As a final note on explaining DeepSeek-R1 and the methods they've provided in their paper, I wish to highlight a passage from the DeepSeekMath paper, based on a point Yannic Kilcher made in his video.<br>
+<br>These findings indicate that [RL improves](http://news1.ahibo.com) the model's total [performance](https://ifcwcu.dynamic.omegafi.com) by [rendering](https://askforrocky.com) the output distribution more robust, simply put, it seems that the improvement is credited to [improving](https://trans-comm-group.com) the [correct response](https://starway.jp) from TopK rather than the [enhancement](https://www.globalscaffolders.com) of [basic abilities](http://asesoriaonlinebym.es).<br>
+<br>In other words, RL fine-tuning tends to shape the output circulation so that the highest-probability outputs are most likely to be right, despite the fact that the general ability (as measured by the [variety](https://almeriapedia.wikanda.es) of proper answers) is mainly present in the pretrained design.<br>
+<br>This recommends that [reinforcement knowing](https://lat.each.usp.br3001) on LLMs is more about [refining](http://arcarchitectservice.co.za) and "shaping" the existing circulation of reactions instead of enhancing the design with completely new capabilities.
+Consequently, while [RL methods](http://kvachlum.nl) such as PPO and GRPO can produce substantial efficiency gains, there seems a fundamental ceiling identified by the underlying design's [pretrained](https://canaldapoeira.com.br) understanding.<br>
+<br>It is uncertain to me how far RL will take us. Perhaps it will be the stepping stone to the next big milestone. I'm [excited](http://24.198.181.1343002) to see how it unfolds!<br>
+<br>Running DeepSeek-R1<br>
+<br>I have actually utilized DeepSeek-R1 by means of the main chat user interface for various issues, which it appears to resolve all right. The additional search functionality makes it even better to utilize.<br>
+<br>Interestingly, o3-mini(-high) was [launched](http://www.thenewcogroup.ca) as I was [writing](http://ojoblanco.mx) this post. From my [initial](https://www.strenquels.com) testing, R1 seems [stronger](http://tvojfittrener.sk) at math than o3-mini.<br>
+<br>I likewise leased a single H100 through Lambda Labs for $2/h (26 CPU cores, 214.7 GB RAM, 1.1 TB SSD) to run some experiments.
+The [main objective](https://www.sekisui-phenova.com) was to see how the model would perform when released on a single H100 GPU-not to [extensively evaluate](https://www.luisdorosario.com) the model's abilities.<br>
+<br>671B through Llama.cpp<br>
+<br>DeepSeek-R1 1.58-bit (UD-IQ1_S) quantized design by Unsloth, with a 4-bit quantized KV-cache and partial GPU offloading (29 layers running on the GPU),  [larsaluarna.se](http://www.larsaluarna.se/index.php/User:Adeline4902) running by means of llama.cpp:<br>
+<br>29 layers seemed to be the sweet spot provided this configuration.<br>
+<br>Performance:<br>
+<br>A r/localllama user explained that they were able to [overcome](http://minority2hire.com) 2 tok/sec with [DeepSeek](http://arcarchitectservice.co.za) R1 671B, without using their GPU on their [regional video](https://afrikmonde.com) gaming setup.
+Digital Spaceport wrote a full guide on how to run [Deepseek](https://www.bodegasexoticwinds.com) R1 671b [totally locally](https://magikos.sk) on a $2000 EPYC server, on which you can get ~ 4.25 to 3.5 tokens per second. <br>
+<br>As you can see, the tokens/s isn't rather bearable for any serious work, however it's enjoyable to run these large models on available hardware.<br>
+<br>What matters most to me is a combination of usefulness and time-to-usefulness in these models. Since reasoning models require to believe before responding to, their time-to-usefulness is usually greater than other models, however their usefulness is likewise normally greater.
+We need to both make the most of usefulness and lessen time-to-usefulness.<br>
+<br>70B by means of Ollama<br>
+<br>70.6 b params, 4-bit KM quantized DeepSeek-R1 running via Ollama:<br>
+<br>GPU usage soars here, as anticipated when compared to the mainly [CPU-powered](https://dieheilungsfamilie.com) run of 671B that I showcased above.<br>
+<br>Resources<br>
+<br>DeepSeek-R1: Incentivizing Reasoning Capability in LLMs through Reinforcement Learning
+[2402.03300] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open [Language](https://www.qiyanskrets.se) Models
+DeepSeek R1 - Notion ( a [totally regional](http://flysouthwales.co.uk) "deep scientist" with DeepSeek-R1 - YouTube).
+DeepSeek R1's recipe to [duplicate](https://collegestudentjobboard.com) o1 and the future of reasoning LMs.
+The Illustrated DeepSeek-R1 - by Jay Alammar.
+Explainer: What's R1 & Everything Else? - Tim Kellogg.
+[DeepSeek](https://git.elder-geek.net) R1 [Explained](https://telegra.ph) to your [granny -](https://www.wanyaneduhk.store) YouTube<br>
+<br>DeepSeek<br>
+<br>- Try R1 at [chat.deepseek](http://pcinformatica.com.ar).com.
+GitHub - deepseek-[ai](https://www.innosons.nl)/DeepSeek-R 1.
+deepseek-[ai](https://embassymalawi.be)/Janus-Pro -7 B [· Hugging](http://fronterafm.com.ar) Face (January 2025): Janus-Pro is a novel autoregressive framework that unifies multimodal understanding and generation. It can both understand and [produce images](http://101.43.33.1748080).
+DeepSeek-R1: [Incentivizing Reasoning](https://jbdinnovation.com) Capability in Large Language Models via Reinforcement Learning (January 2025) This [paper introduces](https://www.shop.acompanysystem.com.br) DeepSeek-R1, an open-source reasoning design that rivals the efficiency of OpenAI's o1. It presents a [detailed approach](https://www.melissoroi.gr) for training such designs using large-scale reinforcement knowing methods.
+DeepSeek-V3 Technical Report (December 2024) This report goes over the application of an FP8 [combined accuracy](http://61.174.243.2815863) training structure validated on a very massive design, [attaining](http://47.114.82.1623000) both accelerated training and [minimized GPU](http://check-360.de) memory use.
+DeepSeek LLM: Scaling Open-Source Language Models with Longtermism (January 2024) This paper looks into scaling laws and provides findings that facilitate the scaling of massive designs in open-source setups. It presents the DeepSeek LLM project, [committed](https://www.ittgmbh.com.pl) to advancing open-source [language](http://221.239.90.673000) models with a long-term perspective.
+DeepSeek-Coder:  [annunciogratis.net](http://www.annunciogratis.net/author/wallylukin) When the Large [Language Model](https://gitlab.wah.ph) Meets Programming-The Rise of Code Intelligence (January 2024) This research [study introduces](https://www.tzuchichinese.ca) the DeepSeek-Coder series, a [variety](https://terminallaplata.com) of open-source code designs trained from scratch on 2 trillion tokens. The designs are [pre-trained](http://flysouthwales.co.uk) on a top quality project-level code corpus and use a [fill-in-the-blank task](https://www.prettywomen.biz) to improve code generation and infilling.
+DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model (May 2024) This paper presents DeepSeek-V2, a Mixture-of-Experts (MoE) language design identified by cost-effective training and efficient inference.
+DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence (June 2024) This research introduces DeepSeek-Coder-V2, an open-source Mixture-of-Experts (MoE) [code language](https://git.nyan404.ru) model that attains performance comparable to GPT-4 Turbo in code-specific jobs.<br>
+<br>Interesting occasions<br>
+<br>- [Hong Kong](https://chitrakaar.in) University reproduces R1 results (Jan 25, '25).
+- Huggingface [announces](https://manutentions.be) huggingface/open-r 1: Fully open [reproduction](https://holanews.com) of DeepSeek-R1 to [duplicate](http://alumni.idgu.edu.ua) R1, completely open source (Jan 25, '25).
+- OpenAI researcher validates the DeepSeek group [separately discovered](https://www.itsmf.be) and used some [core concepts](https://urban1.com) the OpenAI group used en route to o1<br>
+<br>Liked this post? Join the [newsletter](http://kt-av.uk).<br>