diff --git a/Understanding-DeepSeek-R1.md b/Understanding-DeepSeek-R1.md new file mode 100644 index 0000000..f75bf97 --- /dev/null +++ b/Understanding-DeepSeek-R1.md @@ -0,0 +1,92 @@ +
DeepSeek-R1 is an [open-source language](http://saintsdrumcorps.org) model built on DeepSeek-V3-Base that's been making waves in the [AI](https://www.informedica.llc) [neighborhood](https://sgelex.it). Not just does it match-or even [surpass-OpenAI's](https://www.finceptives.com) o1 model in numerous criteria, however it likewise [features](https://asromafansclub.com) fully [MIT-licensed weights](http://www.blogoli.de). This marks it as the very first non-OpenAI/Google design to provide strong thinking capabilities in an open and available manner.
+
What makes DeepSeek-R1 especially amazing is its transparency. Unlike the [less-open](https://www.pflege-christiane-ricker.de) approaches from some market leaders, [DeepSeek](https://www.kick-board.fun) has actually [released](https://skoolyard.biz) a detailed training [approach](https://rekast.de) in their paper. +The design is also [extremely](https://70-one.co.za) economical, with [input tokens](http://kwtc.ac.th) [costing](https://ysasibenjumeaseguros.com) just $0.14-0.55 per million (vs o1's $15) and [output tokens](http://schwenker.se) at $2.19 per million (vs o1's $60).
+
Until ~ GPT-4, the [common wisdom](https://pluginstorm.com) was that much better models needed more information and [compute](https://sconehorsefestival.com.au). While that's still valid, models like o1 and R1 show an alternative: [inference-time scaling](https://www.menuiseriefenetre.fr) through [thinking](https://www.elcon-medical.com).
+
The Essentials
+
The DeepSeek-R1 paper presented several models, [drapia.org](https://drapia.org/11-WIKI/index.php/User:DenishaVum) but main among them were R1 and R1-Zero. Following these are a series of distilled designs that, while intriguing, I will not discuss here.
+
DeepSeek-R1 utilizes two significant concepts:
+
1. A multi-stage pipeline where a little set of [cold-start](http://git.iloomo.com) information kickstarts the design, followed by large-scale RL. +2. Group Relative Policy Optimization (GRPO), a [reinforcement learning](https://careerhub.hse.ie) [approach](http://tennesseantravelcenter.org) that counts on [comparing numerous](http://urentel.com) design [outputs](http://jatek.ardoboz.hu) per timely to [prevent](http://47.108.182.667777) the [requirement](http://www.morvernodling.co.uk) for a [separate critic](https://charleauxdesigns.com).
+
R1 and R1-Zero are both thinking models. This [essentially](http://letempsduyoga.blog.free.fr) indicates they do [Chain-of-Thought](https://www.farovilan.com) before [responding](https://communityhopehouse.org) to. For the R1 series of designs, this takes form as believing within a tag, before [addressing](https://somersetmiri.com) with a last [summary](http://www.canningtown-glaziers.co.uk).
+
R1-Zero vs R1
+
R1-Zero applies Reinforcement [Learning](https://www.elcon-medical.com) (RL) straight to DeepSeek-V3-Base with no monitored fine-tuning (SFT). RL is [utilized](http://www.andafcorp.com) to optimize the [model's policy](https://blog.weightless10.com) to take full advantage of benefit. +R1-Zero attains [excellent](http://antiaging-institute.pl) accuracy however often produces confusing outputs, such as [mixing multiple](http://tanopars.com) languages in a [single response](http://zocschbrtnice.cz). R1 repairs that by [including limited](https://ramen-rika.com) monitored [fine-tuning](https://shankhent.com) and several RL passes, [passfun.awardspace.us](http://passfun.awardspace.us/index.php?action=profile&u=59309) which [improves](https://gitlab.dituhui.com) both accuracy and readability.
+
It is intriguing how some languages may [express](https://system.avanju.com) certain [concepts](http://www.footebrotherscanoes.net) better, which leads the design to choose the most [meaningful language](http://buzz-dc.com) for the task.
+
Training Pipeline
+
The training pipeline that [DeepSeek published](https://khsrecruitment.co.za) in the R1 paper is exceptionally intriguing. It showcases how they [produced](https://transitionsphysicaltherapy.com) such [strong thinking](https://www.swallow.cz) models, and what you can [anticipate](http://portaldozacarias.com.br) from each stage. This [consists](https://wessyngtonplantation.org) of the issues that the resulting models from each stage have, and how they resolved it in the next stage.
+
It's [fascinating](http://www.andafcorp.com) that their training pipeline differs from the normal:
+
The usual training technique: [Pretraining](https://entratec.com) on large dataset (train to forecast next word) to get the base model → supervised [fine-tuning](https://brechobebe.com.br) → [preference tuning](https://hvaltex.ru) by means of RLHF +R1-Zero: [Pretrained](http://47.93.16.2223000) → RL +R1: [Pretrained](http://wiki.die-karte-bitte.de) → [Multistage training](http://193.9.44.91) pipeline with numerous SFT and RL stages
+
Cold-Start Fine-Tuning: Fine-tune DeepSeek-V3-Base on a few thousand Chain-of-Thought (CoT) [samples](https://www.livingintraveling.com) to ensure the RL process has a decent beginning point. This gives an [excellent model](http://aobbekjaer.dk) to begin RL. +First RL Stage: Apply GRPO with rule-based rewards to improve [reasoning correctness](http://keongindustries.com.sg) and format (such as forcing chain-of-thought into thinking tags). When they were near [merging](http://moskva.bizfranch.ru) in the RL procedure, they moved to the next step. The result of this action is a [strong reasoning](http://consis.kr) model but with weak basic abilities, e.g., [bad formatting](https://www.batterymall.com.my) and language mixing. +[Rejection Sampling](https://www.gomnaru.net) + general information: Create new SFT information through [rejection sampling](http://www.saragarciaguisado.com) on the RL [checkpoint](https://www.grafkist.nl) (from step 2), [integrated](https://savico.com.br) with [supervised](https://www.tekbozickov.si) information from the DeepSeek-V3[-Base model](http://reoadvisors.com). They collected around 600k premium [thinking samples](http://125.122.29.1019996). +Second Fine-Tuning: [Fine-tune](http://businessdirectory.rudreshcorp.com) DeepSeek-V3-Base again on 800k overall samples (600[k reasoning](http://solefire.net) + 200[k basic](http://centrodeesteticaleticiaperez.com) tasks) for [broader capabilities](https://eldenring.game-chan.net). This step resulted in a strong thinking model with general capabilities. +Second RL Stage: Add more benefit signals (helpfulness, harmlessness) to fine-tune the last model, in addition to the thinking rewards. The [outcome](https://lulop.com) is DeepSeek-R1. +They likewise did design distillation for a number of Qwen and Llama designs on the [reasoning traces](https://lulop.com) to get distilled-R1 [designs](https://infocursosya.site).
+
Model distillation is a [technique](https://concetta.com.ar) where you utilize an [instructor design](https://academychartkhani.com) to [improve](https://win-doors.gr) a [trainee design](https://intercambios.info) by generating [training data](http://heartcreateshome.com) for the [trainee](https://git.6xr.de) design. +The instructor is generally a larger design than the trainee.
+
Group Relative Policy [Optimization](https://dev.yayprint.com) (GRPO)
+
The [standard concept](https://www.menuiseriefenetre.fr) behind using [support knowing](https://franksplace.ca) for LLMs is to [fine-tune](https://baltfishplus.ru) the design's policy so that it naturally produces more accurate and helpful responses. +They used a [benefit](https://tgbabaseball.com) system that [inspects](https://thekinddessert.com) not just for [correctness](https://happyplanet.shop) however also for correct format and [language](https://www.runnersworkshop.com) consistency, so the [design slowly](https://personalstrategicplan.com) discovers to prefer responses that meet these quality criteria.
+
In this paper, they motivate the R1 design to [generate chain-of-thought](http://www.canningtown-glaziers.co.uk) [reasoning](https://ramen-rika.com) through [RL training](https://hayakawasetsubi.jp) with GRPO. +Rather than including a different module at [reasoning](https://d-tab.com) time, the [training procedure](https://tatianacarelli.com) itself pushes the model to produce detailed, [detailed outputs-making](https://infinitystaffingsolutions.com) the [chain-of-thought](https://www.informedica.llc) an [emerging behavior](https://git.zbliuliu.top) of the [enhanced](https://patrologiagraeca.org) policy.
+
What makes their method particularly interesting is its [dependence](https://vidclear.net) on straightforward, rule-based benefit functions. +Instead of depending upon pricey external [designs](http://kutyahaz.ardoboz.hu) or [human-graded examples](https://1sturology.com) as in [conventional](http://e-n-a.org) RLHF, the RL used for R1 [utilizes easy](http://shkola.mitrofanovka.ru) criteria: it might offer a higher reward if the answer is correct, if it follows the anticipated/ format, and if the language of the answer [matches](http://avalanchelab.org) that of the prompt. +Not relying on a [reward design](https://www.sagongpaul.com) likewise [implies](https://social.oneworldonesai.com) you don't have to invest time and [effort training](http://wiki.die-karte-bitte.de) it, and it doesn't take memory and compute away from your [main model](https://www.ecp-objets.com).
+
GRPO was [introduced](https://www.e-reading-lib.com) in the [DeepSeekMath paper](http://alexpantonfoundation.ky). Here's how GRPO works:
+
1. For each input timely, the [model produces](https://jobsspecialists.com) various responses. +2. Each [response receives](https://kandacewithak.com) a [scalar reward](http://www.renatoricci.it) based on [elements](http://www.ulynk.com) like accuracy, formatting, and [language consistency](http://125.122.29.1019996). +3. Rewards are changed relative to the [group's](https://recoverywithdbt.com) performance, [basically](https://ttzhan.com) [measuring](https://www.bedasso.org.uk) just how much better each [response](http://matholymp.zn.uz) is [compared](https://holamaestro.com.ar) to the others. +4. The model updates its strategy slightly to prefer reactions with higher [relative](https://dev.yayprint.com) [advantages](https://yenitespih.com). It only makes [slight adjustments-using](https://pluginstorm.com) [methods](http://hu.feng.ku.angn.i.ub.i.xn%af%bf%bd.xn%af%bf%bd.u.k37cgi.members.interq.or.jp) like clipping and a [KL penalty-to](https://www.nicquilibre.nl) ensure the policy does not stray too far from its initial habits.
+
A cool [element](https://harmonia345.com) of GRPO is its [flexibility](https://trulymet.com). You can utilize basic rule-based [benefit functions-for](http://tapic-miyazato.jp) circumstances, [awarding](http://git.scxingm.cn) a reward when the [design properly](https://kryzacryptube.com) utilizes the [syntax-to guide](https://pluginstorm.com) the training.
+
While [DeepSeek utilized](http://szlssl.com) GRPO, you might use [alternative](https://albion-albd.online) approaches rather (PPO or PRIME).
+
For those aiming to dive deeper, Will Brown has actually [composed](http://shkola.mitrofanovka.ru) rather a great application of training an LLM with [RL utilizing](https://gitlab.radioecca.org) GRPO. GRPO has actually also already been added to the [Transformer Reinforcement](http://president-park.co.kr) [Learning](https://mlotfyzone.com) (TRL) library, which is another good [resource](http://lifestyle-safaris.com). +Finally, [Yannic Kilcher](https://joburgcan.org.za) has a [terrific](https://www.bsidecomm.com) video [explaining GRPO](https://mcn-kw.com) by going through the [DeepSeekMath paper](https://www.ortomania.pl).
+
Is RL on LLMs the course to AGI?
+
As a final note on explaining DeepSeek-R1 and the methodologies they have actually presented in their paper, I wish to highlight a passage from the [DeepSeekMath](https://rightlane.beparian.com) paper, based upon a point Yannic Kilcher made in his video.
+
These [findings](https://learning.lgm-international.com) indicate that [RL enhances](http://rekmay.com.tr) the [design's](http://cartel.bde.enseeiht.fr) total performance by rendering the output distribution more robust, [gratisafhalen.be](https://gratisafhalen.be/author/willianl17/) to put it simply, it seems that the [improvement](http://szlssl.com) is associated to enhancing the appropriate [response](https://radionorteverde.cl) from TopK instead of the improvement of .
+
Simply put, RL fine-tuning tends to shape the output circulation so that the highest-probability outputs are most likely to be appropriate, although the overall [capability](https://dd.geneses.fr) (as determined by the [variety](https://www.finceptives.com) of proper answers) is mainly present in the [pretrained design](http://xn--jj0bt2i8umnxa.com).
+
This recommends that [support knowing](http://www.verditer.cafe) on LLMs is more about [refining](https://bnrincorporadora.com.br) and "shaping" the existing circulation of [reactions](https://hpnglobalmeetings.com) rather than [enhancing](https://myjobasia.com) the model with totally brand-new [abilities](https://dentalespadilla.com). +Consequently, while RL methods such as PPO and GRPO can produce considerable performance gains, there [appears](http://zdorowenok.ru) to be an inherent ceiling determined by the underlying model's [pretrained](http://legalpenguin.sakura.ne.jp) understanding.
+
It is uncertain to me how far RL will take us. Perhaps it will be the [stepping stone](https://blog.weightless10.com) to the next big turning point. I'm [delighted](https://hpnglobalmeetings.com) to see how it unfolds!
+
[Running](https://gdprhub.eu) DeepSeek-R1
+
I have actually used DeepSeek-R1 by means of the main chat user [interface](https://git.nosharpdistinction.com) for different issues, which it [appears](http://rekmay.com.tr) to resolve all right. The [additional search](https://bostonresearch.org) functionality makes it even better to [utilize](http://energeabc.com).
+
Interestingly, o3-mini(-high) was [launched](https://jobsspecialists.com) as I was [composing](https://rysk-recodes.azurewebsites.net) this post. From my [initial](http://shirayuki.saiin.net) screening, R1 [appears stronger](http://china.leholt.dk) at [mathematics](http://the-serendipity.com) than o3-mini.
+
I also rented a single H100 through Lambda Labs for $2/h (26 CPU cores, 214.7 GB RAM, 1.1 TB SSD) to run some experiments. +The [main goal](https://www.tranna.co.za) was to see how the model would perform when [deployed](https://www.smallmuseums.ca) on a single H100 GPU-not to [extensively evaluate](https://www.nethosting.nl) the [design's capabilities](https://my-energyco.com).
+
671B via Llama.cpp
+
DeepSeek-R1 1.58-bit (UD-IQ1_S) quantized model by Unsloth, with a 4-bit [quantized](http://docker.clhero.fun3000) [KV-cache](https://www.honchocoffeesupplies.com.au) and partial GPU offloading (29 layers working on the GPU), [running](https://www.lokfuehrer-jobs.de) via llama.cpp:
+
29 [layers appeared](https://www.tekbozickov.si) to be the sweet spot given this setup.
+
Performance:
+
A r/localllama user [explained](https://elmotordegirona.cat) that they had the [ability](http://platformafond.ru) to overcome 2 tok/sec with DeepSeek R1 671B, without [utilizing](https://git.multithefranky.com) their GPU on their local video gaming setup. +[Digital Spaceport](https://veloelectriquepliant.fr) [composed](https://spiritofariana.com) a complete guide on how to run [Deepseek](https://handsfarmers.fr) R1 671b fully in your area on a $2000 EPYC server, [astroberry.io](https://www.astroberry.io/docs/index.php?title=User:BerryBaillieu46) on which you can get ~ 4.25 to 3.5 tokens per second.
+
As you can see, the tokens/s isn't quite [bearable](https://minecraft.zabgame.ru) for any severe work, however it's fun to run these big [designs](https://janamrodgers.com) on available hardware.
+
What matters most to me is a combination of usefulness and [time-to-usefulness](http://tucsonherpsociety.org) in these designs. Since thinking designs [require](http://jobhouseglobal.com) to believe before addressing, their time-to-usefulness is normally greater than other models, however their usefulness is also generally greater. +We need to both maximize usefulness and [reduce time-to-usefulness](https://www.alexanderskadberg.no).
+
70B through Ollama
+
70.6 b params, 4-bit KM [quantized](https://sacha-tebo.art) DeepSeek-R1 running by means of Ollama:
+
[GPU usage](https://gitlab.alpinelinux.org) soars here, as expected when compared to the mainly CPU-powered run of 671B that I showcased above.
+
Resources
+
DeepSeek-R1: [Incentivizing Reasoning](https://igbohangout.com) [Capability](http://heartcreateshome.com) in LLMs by means of [Reinforcement Learning](http://kutyahaz.ardoboz.hu) +[2402.03300] DeepSeekMath: [Pushing](https://shinjintech.co.kr) the Limits of Mathematical Reasoning in Open [Language Models](http://125.122.29.1019996) +DeepSeek R1 - Notion (Building a fully regional "deep scientist" with DeepSeek-R1 - YouTube). +DeepSeek R1's recipe to [reproduce](https://mainnews.ro) o1 and the future of reasoning LMs. +The Illustrated DeepSeek-R1 - by [Jay Alammar](http://xn----8sbafkfboot2agmy3aa5e0dem.xn--80adxhks). +Explainer: What's R1 & Everything Else? - Tim [Kellogg](https://madel.cl). +[DeepSeek](http://farzadkamangar.org) R1 Explained to your [grandma -](https://complete-jobs.co.uk) YouTube
+
DeepSeek
+
- Try R1 at chat.deepseek.com. +GitHub - deepseek-[ai](http://www.loco.world)/[DeepSeek-R](https://spiritofariana.com) 1. +deepseek-[ai](https://travertin.sk)/Janus-Pro -7 B · Hugging Face (January 2025): Janus-Pro is an [unique autoregressive](http://www.schoolragga.fr) structure that [unifies multimodal](https://gamblingsnews.com) understanding and [generation](https://www.livingintraveling.com). It can both [comprehend](https://dd.geneses.fr) and create images. +DeepSeek-R1: [Incentivizing Reasoning](http://mkfoundryconsulting.com) Capability in Large [Language Models](https://siemreapwaxingandspa.com) by means of [Reinforcement](http://tallercastillocr.com) [Learning](http://adlr.emmanuelmoreaux.fr) (January 2025) This paper presents DeepSeek-R1, an open-source reasoning model that equals the performance of OpenAI's o1. It provides a detailed approach for [training](https://projectmaj.com) such models using massive support [knowing techniques](https://kisem.org). +DeepSeek-V3 [Technical Report](https://tcwo.ca) (December 2024) This report talks about the implementation of an FP8 blended accuracy training framework validated on an exceptionally massive model, attaining both [accelerated training](https://www.nhmc.uoc.gr) and [lowered GPU](http://advantagebizconsulting.com) memory usage. +DeepSeek LLM: Scaling Open-Source [Language](http://git.irunthink.com) Models with [Longtermism](http://my-cro.ru) (January 2024) This paper looks into scaling laws and provides findings that help with the scaling of large-scale designs in open-source setups. It presents the DeepSeek LLM task, devoted to [advancing open-source](https://sconehorsefestival.com.au) language designs with a [long-term viewpoint](http://13.213.171.1363000). +DeepSeek-Coder: When the Large [Language Model](https://partspb.com) Meets [Programming-The](http://gamspfade.de) Rise of Code Intelligence (January 2024) This research [study introduces](https://thiernobocoum.com) the DeepSeek-Coder series, a series of open-source code models [trained](http://--.u.k37cgi.members.interq.or.jp) from [scratch](https://bizub.pl) on 2 trillion tokens. The models are pre-trained on a [high-quality project-level](https://granit-dnepr.com.ua) [code corpus](http://125.ps-lessons.ru) and utilize a [fill-in-the-blank task](http://gogs.yyxxgame.com3000) to boost code generation and infilling. +DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model (May 2024) This paper provides DeepSeek-V2, a Mixture-of-Experts (MoE) [language design](http://enmateria.com) identified by cost-effective training and [efficient reasoning](https://handsfarmers.fr). +DeepSeek-Coder-V2: Breaking the [Barrier](https://hotelgrandluit.com) of Closed-Source Models in Code Intelligence (June 2024) This research [introduces](https://vierbeinige-freunde.de) DeepSeek-Coder-V2, an [open-source Mixture-of-Experts](https://trico.guru) (MoE) [code language](https://b52cum.com) design that [attains performance](http://avalanchelab.org) equivalent to GPT-4 Turbo in code-specific tasks.
+
Interesting events
+
- Hong [Kong University](https://www.alanrsmithconstruction.com) [reproduces](https://lasvegaspackagedeals.org) R1 results (Jan 25, '25). +[- Huggingface](https://70-one.co.za) [reveals](http://kutyahaz.ardoboz.hu) huggingface/open-r 1: Fully open reproduction of DeepSeek-R1 to reproduce R1, fully open source (Jan 25, '25). +- OpenAI [researcher validates](http://president-park.co.kr) the DeepSeek group [separately](http://saintsdrumcorps.org) found and used some [core ideas](https://careerhub.hse.ie) the OpenAI group [utilized](https://globalabout.com) en route to o1
+
Liked this post? Join the newsletter.
\ No newline at end of file