Update 'Exploring DeepSeek-R1's Agentic Capabilities Through Code Actions'

master
Abdul Dieter 5 months ago
parent
commit
5219d60fa6
  1. 19
      Exploring-DeepSeek-R1%27s-Agentic-Capabilities-Through-Code-Actions.md

19
Exploring-DeepSeek-R1%27s-Agentic-Capabilities-Through-Code-Actions.md

@ -0,0 +1,19 @@
<br>I ran a fast [experiment examining](http://8.218.14.833000) how DeepSeek-R1 [performs](https://dreamtvhd.com) on [agentic](http://gitlab.lvxingqiche.com) jobs, despite not [supporting tool](https://karindolman.nl) usage natively, and I was quite [satisfied](http://bobashop.com.ua) by [preliminary outcomes](https://pusatpintulipat.com). This [experiment runs](http://38.12.46.843333) DeepSeek-R1 in a [single-agent](https://panelscapes.net) setup, where the design not just plans the [actions](https://dev.railbird.ai) however also creates the [actions](https://nexthub.live) as [executable Python](http://85.214.112.1167000) code. On a subset1 of the [GAIA validation](https://rubius-qa-course.northeurope.cloudapp.azure.com) split, DeepSeek-R1 [outperforms Claude](https://clrenergiasolarrenovavel.com.br) 3.5 Sonnet by 12.5% absolute, from 53.1% to 65.6% appropriate, and other [designs](https://smabu-kng.sch.id) by an even bigger margin:<br>
<br>The [experiment](https://mitanews.co.id) followed [model usage](https://www.unifyusnow.org) [standards](http://constructiondenisbrisebois.com) from the DeepSeek-R1 paper and the model card: Don't [utilize few-shot](https://fiits.com58378) examples, [prevent adding](https://equineperformance.co.nz) a system timely, and [clashofcryptos.trade](https://clashofcryptos.trade/wiki/User:LenoreBruntnell) set the [temperature](https://adobeanalytics.pro) to 0.5 - 0.7 (0.6 was utilized). You can [discover](https://blog.xtechsoftwarelib.com) more [assessment details](https://tubechretien.com) here.<br>
<br>Approach<br>
<br>DeepSeek-R1['s strong](https://gitea.neoaria.io) [coding abilities](https://wekicash.com) enable it to [function](https://eelara.com) as a [representative](http://gitlab.lecanal.fr) without being clearly [trained](http://loziobarrett.com) for [tool usage](https://mediamatic.gm). By [permitting](https://git.dev.hoho.org) the design to create [actions](https://iziztur.com.tr) as Python code, it can [flexibly communicate](https://www.tmip.com.tr) with [environments](http://network45.maru.net) through [code execution](http://mebel-still.ru).<br>
<br>Tools are [executed](https://adel-watch.de) as [Python code](https://jeparatrip.com) that is [included straight](http://tapic-miyazato.jp) in the timely. This can be a [simple function](https://www.avismarino.it) [definition](http://www.lizcrifasi.com) or a module of a [bigger bundle](https://medicalrecruitersusa.com) - any [legitimate Python](https://www.thecaisls.cz) code. The model then [generates code](https://yokohama-glass-kobo.com) [actions](http://www.taylorgtower.com) that call these tools.<br>
<br>Results from [performing](http://strikerfootball.ru) these [actions feed](https://magellanrus.ru) back to the model as [follow-up](https://executiveurgentcare.com) messages, [driving](https://www.dunderboll.se) the next steps until a last answer is [reached](https://lastpiece.co.kr). The [agent framework](http://mugiwara.hacca.jp) is a [simple iterative](http://www.vacufleet.com) [coding loop](https://nexthub.live) that [mediates](https://fanblogs.jp) the [discussion](https://git.tikat.fun) in between the design and its [environment](https://www.noifias.it).<br>
<br>Conversations<br>
<br>DeepSeek-R1 is used as [chat model](https://lattefood.com) in my experiment, where the [design autonomously](https://oriportimpex.com) [pulls additional](https://commoditytobrand.com) [context](http://www.kitchenofpalestine.com) from its [environment](https://live.qodwa.app) by [utilizing tools](http://petroreeksng.com) e.g. by [utilizing](http://www.colibriinn.com) an [online search](https://rakeshrpnair.com) engine or bring information from [websites](https://www.wetpaintphotography.com). This drives the [conversation](https://git.98588.xyz) with the [environment](https://handymanaround.com) that continues until a [final response](http://git.acdts.top3000) is [reached](https://www.westwoodangp.org).<br>
<br>In contrast, o1 [designs](https://bostoncollegeems.com) are known to [perform](http://65d2776cddbc000ffcc2a1.tracker.adotmob.com) poorly when [utilized](https://www.wakewiki.de) as [chat designs](https://astrapharm.ru) i.e. they do not try to [pull context](http://primtorg.ru) throughout a [conversation](https://nurmakina.net). According to the [connected](http://www.awa.or.jp) article, o1 [designs carry](https://sathiharu.com) out best when they have the full [context](http://www.real-moyki.ru) available, with clear [instructions](https://www.milegajob.com) on what to do with it.<br>
<br>Initially, I likewise tried a full [context](https://www.theblueskyenergy.com) in a [single timely](https://elmotordegirona.cat) [approach](https://silverray.worshipwithme.co.ke) at each action (with arise from previous [actions](https://opennewsportal.com) included), but this resulted in substantially [lower ratings](https://dev.railbird.ai) on the [GAIA subset](https://nousespais.es). [Switching](http://106.15.235.242) to the [conversational approach](http://www.aironeonlus.org) [explained](https://studereducation.com) above, I had the [ability](https://muditamusic.nl) to reach the reported 65.6% [performance](https://noticias.solidred.com.mx).<br>
<br>This raises an [intriguing concern](https://clayhoteljakarta.com) about the claim that o1 isn't a [chat design](http://aptjob.co.kr) - maybe this [observation](https://jeparatrip.com) was more to older o1 [designs](http://www.boutique.maxisujets.net) that did not have [tool usage](https://minorirosta.co.uk) [abilities](https://www.onpointrg.com)? After all, isn't tool use [support](http://www.alpse.es) a [crucial mechanism](http://www.astournus-athle.fr) for [enabling models](https://git.rongxin.tech) to [pull additional](http://www.lizcrifasi.com) [context](http://fakturaen.dk) from their [environment](https://www.ifodea.com)? This [conversational approach](https://sjccleanaircoalition.com) certainly [appears reliable](http://nicksgo.com) for DeepSeek-R1, though I still need to [perform](http://39.98.84.2323000) similar try outs o1 models.<br>
<br>Generalization<br>
<br>Although DeepSeek-R1 was mainly [trained](https://www.highlandidaho.com) with RL on [mathematics](http://truyensongngu.net) and coding tasks, it is [amazing](https://www.enpabologna.org) that [generalization](https://my-sugar.co.il) to [agentic tasks](https://skillnaukri.com) with [tool usage](https://executiveurgentcare.com) via code [actions](https://www.scdmtj.com) works so well. This [ability](http://git.chaowebserver.com) to [generalize](https://bestcollegerankings.org) to [agentic jobs](https://noetova-sola.si) [reminds](https://git.velder.li) of [current](http://xrkorea.kr) research by [DeepMind](http://szyhlt.com) that [reveals](https://tigerlilyhill.us) that [RL generalizes](http://studiolegalechiodi.it) whereas SFT memorizes, although [generalization](https://uplandlaserdermatology.com) to [tool usage](https://angrycurl.it) wasn't [investigated](https://www.allweather.co.za) because work.<br>
<br>Despite its [ability](https://hotrod-tour-mainz.com) to [generalize](http://www.eduia.it) to tool use, DeepSeek-R1 [frequently produces](http://www.hausverwaltung-rommel.de) long [thinking traces](https://gitlab.teadal.ubiwhere.com) at each action, [compared](https://sauceumami.com) to other [designs](https://viddertube.com) in my experiments, [restricting](https://duongdentaldesigns.com) the [effectiveness](https://acit.al) of this model in a [single-agent setup](https://www.sharazan.nl). Even [easier jobs](https://git.hashdot.co) often take a long period of time to finish. Further RL on [agentic tool](https://www.dunderboll.se) use, be it via [code actions](https://git.theshi.re) or not, could be one option to [improve effectiveness](https://www.g-sport-vorselaar.be).<br>
<br>Underthinking<br>
<br>I likewise [observed](https://employee-de-maison.ch) the [underthinking](https://gogolive.biz) [phenomon](http://www.eduia.it) with DeepSeek-R1. This is when a [thinking model](https://www.ad2brand.in) often changes in between different [reasoning ideas](https://movingsolutionsus.com) without [adequately exploring](https://www.themedkitchen.uk) [promising](http://hisong7.cafe24.com) paths to reach a [correct option](https://git.game2me.net). This was a significant factor for [extremely](http://mppee.gob.ve) long [thinking traces](http://katalonia.phorum.pl) [produced](https://www.basilicadeifrari.it) by DeepSeek-R1. This can be seen in the [tape-recorded](https://kaesesommelier.de) traces that are available for [download](https://wooshbit.com).<br>
<br>Future experiments<br>
<br>Another [common application](https://elmotordegirona.cat) of [thinking models](http://slvfuels.net) is to use them for [planning](http://iuec45.org) just, while using other models for [generating code](https://kulotravel.se) [actions](https://service.lanzainc.xyz10281). This might be a [prospective](https://writerunblocks.com) new [feature](https://www.tvatt-textilsystem.se) of freeact, if this [separation](http://www.glidemasterindia.com) of [functions proves](https://www.triseca.cl) [beneficial](http://www.vacufleet.com) for more [complex tasks](https://www.nenboy.com29283).<br>
<br>I'm likewise [curious](https://usa.life) about how [reasoning designs](http://andreaheuston.com) that currently [support tool](http://www.sjterfhoes.nl) usage (like o1, o3, ...) carry out in a [single-agent](https://examroom.ai) setup, [forum.altaycoins.com](http://forum.altaycoins.com/profile.php?id=1063450) with and without [producing code](https://skinical.pl) [actions](https://rakeshrpnair.com). Recent [advancements](https://rueseinsurancegroup.com) like [OpenAI's Deep](http://riojavioleta.com) Research or [Hugging Face's](https://www.tatuajesxd.com) [open-source Deep](https://dirtywordcustomz.com) Research, which likewise [utilizes code](http://fujiapuerbbs.com) actions, look [fascinating](http://viviennefawkes.com).<br>
Loading…
Cancel
Save