kimmyseltzer

2 Exploring DeepSeek R1's Agentic Capabilities Through Code Actions

I ran a quick experiment investigating how DeepSeek-R1 performs on agentic tasks, in spite of not supporting tool use natively, and I was rather amazed by initial outcomes. This runs DeepSeek-R1 in a single-agent setup, where the design not just prepares the actions but likewise creates the actions as executable Python code. On a subset1 of the GAIA recognition split, DeepSeek-R1 outperforms Claude 3.5 Sonnet by 12.5% outright, from 53.1% to 65.6% right, and prawattasao.awardspace.info other designs by an even bigger margin:

The experiment followed model usage guidelines from the DeepSeek-R1 paper and the design card: Don't utilize few-shot examples, avoid adding a system prompt, and set the temperature level to 0.5 - 0.7 (0.6 was utilized). You can discover further evaluation details here.

Approach

DeepSeek-R1's strong coding abilities enable it to serve as an agent without being explicitly trained for tool use. By allowing the design to generate actions as Python code, it can flexibly interact with environments through code execution.

Tools are implemented as Python code that is consisted of straight in the timely. This can be a basic function meaning or a module of a bigger package - any legitimate Python code. The model then generates code actions that call these tools.

Arise from executing these actions feed back to the model as follow-up messages, driving the next actions until a final answer is reached. The agent framework is a basic iterative coding loop that moderates the discussion in between the design and its environment.

Conversations

DeepSeek-R1 is used as chat design in my experiment, botdb.win where the model autonomously pulls additional context from its environment by utilizing tools e.g. by utilizing a search engine or fetching data from websites. This drives the conversation with the environment that continues up until a last answer is reached.

In contrast, o1 designs are known to perform badly when used as chat designs i.e. they don't attempt to pull context during a conversation. According to the connected short article, o1 designs carry out best when they have the complete context available, with clear instructions on what to do with it.

Initially, I also attempted a complete context in a single prompt technique at each action (with results from previous steps consisted of), however this caused significantly lower scores on the GAIA subset. Switching to the conversational technique explained above, funsilo.date I was able to reach the reported 65.6% performance.

This raises an interesting question about the claim that o1 isn't a chat design - maybe this observation was more appropriate to older o1 designs that did not have tool usage capabilities? After all, isn't tool usage support a crucial system for making it possible for models to pull additional context from their environment? This conversational approach certainly seems effective for DeepSeek-R1, though I still need to carry out comparable explores o1 designs.

Generalization

Although DeepSeek-R1 was mainly trained with RL on mathematics and coding tasks, it is impressive that generalization to agentic jobs with tool usage via code actions works so well. This capability to generalize to agentic jobs reminds of recent research study by DeepMind that reveals that RL generalizes whereas SFT memorizes, although generalization to tool usage wasn't investigated because work.

Despite its capability to generalize to tool use, DeepSeek-R1 frequently produces really long thinking traces at each step, compared to other designs in my experiments, limiting the usefulness of this design in a single-agent setup. Even simpler jobs in some cases take a very long time to complete. Further RL on agentic tool usage, be it through code actions or not, could be one choice to enhance efficiency.

Underthinking

I likewise observed the underthinking phenomon with DeepSeek-R1. This is when a reasoning design often switches in between different reasoning thoughts without sufficiently checking out promising paths to reach an appropriate solution. This was a significant factor for extremely long thinking traces produced by DeepSeek-R1. This can be seen in the tape-recorded traces that are available for download.

Future experiments

Another typical application of thinking models is to utilize them for planning just, while using other models for generating code actions. This might be a potential brand-new function of freeact, if this separation of functions proves helpful for more complex jobs.

I'm also curious about how thinking designs that already support tool use (like o1, o3, ...) carry out in a single-agent setup, with and without producing code actions. Recent developments like OpenAI's Deep Research or Hugging Face's open-source Deep Research, which likewise uses code actions, look fascinating.