Exploring DeepSeek-R1's Agentic Capabilities Through Code Actions
I ran a fast experiment investigating how DeepSeek-R1 performs on agentic tasks, despite not supporting tool use natively, and I was quite pleased by initial results. This experiment runs DeepSeek-R1 in a single-agent setup, where the model not only prepares the actions however likewise creates the actions as executable Python code. On a subset1 of the GAIA validation split, DeepSeek-R1 outperforms Claude 3.5 Sonnet by 12.5% outright, from 53.1% to 65.6% proper, and other designs by an even bigger margin:
The experiment followed design usage guidelines from the DeepSeek-R1 paper and the model card: Don't use few-shot examples, prevent adding a system prompt, and set the temperature level to 0.5 - 0.7 (0.6 was utilized). You can discover further examination details here.
Approach
DeepSeek-R1's strong coding capabilities enable it to serve as a representative without being explicitly trained for tool use. By enabling the design to produce actions as Python code, larsaluarna.se it can flexibly communicate with environments through code execution.
Tools are executed as Python code that is consisted of straight in the prompt. This can be a basic function meaning or a module of a bigger plan - any valid Python code. The model then produces code actions that call these tools.
Results from carrying out these actions feed back to the model as follow-up messages, driving the next steps up until a last answer is reached. The representative structure is a simple iterative coding loop that mediates the discussion in between the design and its environment.
Conversations
DeepSeek-R1 is utilized as chat model in my experiment, where the model autonomously pulls extra context from its environment by utilizing tools e.g. by utilizing a search engine or bring information from web pages. This drives the discussion with the environment that continues until a final answer is reached.
In contrast, o1 models are understood to carry out poorly when utilized as chat models i.e. they do not attempt to pull context during a conversation. According to the connected post, o1 models carry out best when they have the full context available, with clear instructions on what to do with it.
Initially, I also tried a complete context in a single prompt technique at each step (with arise from previous actions included), however this caused significantly lower ratings on the GAIA subset. Switching to the conversational method explained above, wiki.insidertoday.org I had the ability to reach the reported 65.6% performance.
This raises an intriguing concern about the claim that o1 isn't a chat model - perhaps this observation was more relevant to older o1 designs that did not have tool usage abilities? After all, isn't tool use support an essential mechanism for allowing designs to pull extra context from their environment? This conversational technique certainly appears efficient for DeepSeek-R1, though I still require to perform comparable experiments with o1 designs.
Generalization
Although DeepSeek-R1 was mainly trained with RL on math and coding tasks, it is remarkable that generalization to agentic tasks with tool usage via code actions works so well. This capability to generalize to agentic tasks advises of recent research study by DeepMind that shows that RL generalizes whereas SFT remembers, although generalization to tool usage wasn't examined in that work.
Despite its capability to generalize to tool use, DeepSeek-R1 frequently produces very long thinking traces at each step, compared to other designs in my experiments, limiting the usefulness of this model in a single-agent setup. Even simpler tasks sometimes take a long period of time to complete. Further RL on agentic tool usage, be it via code actions or not, could be one choice to improve performance.
Underthinking
I also observed the underthinking phenomon with DeepSeek-R1. This is when a thinking model regularly switches between various reasoning thoughts without adequately checking out promising courses to reach a right option. This was a significant reason for overly long thinking traces produced by DeepSeek-R1. This can be seen in the recorded traces that are available for download.
Future experiments
Another typical application of thinking models is to utilize them for preparing just, while utilizing other designs for creating code actions. This could be a potential brand-new feature of freeact, if this separation of roles proves helpful for more complex tasks.
I'm likewise curious about how reasoning models that currently support tool usage (like o1, o3, ...) perform in a single-agent setup, with and without producing code actions. Recent developments like Research or Hugging Face's open-source Deep Research, which also uses code actions, look interesting.