NanoRollout: Scale digital agent rollouts without pain

*Junli Wang Zhoujun Cheng*† Yuxuan Zhang* Shibo Hao Yao Tang**

**Zhiting Hu Prithviraj Ammanabrolu Hao Zhang†**

*: Equal Contribution; †: Corresponding Author

<aside> ⭐

TL;DR

The scaling of digital agents (e.g. coding and computer-use agents) is bottlenecked by environments. Evaluation, distillation, and reinforcement learning (RL) all necessitate executing digital agents within heavy, stateful, and heterogeneous environments at a large parallel scale.

We built NanoRollout, a lightweight infrastructure (900 lines core code) for scaling agent-environment interaction that:

Unifies agent RL, distillation, and evaluation with a shared rollout service;
Decouples agent harnesses and environments from the training logic, enabling independent scaling of GPU, CPU, and memory resources;
Ships with common digital agent benchmarks and agent harnesses out of the box, and integrates with existing RL frameworks including miles, veRL, and tunix.

NanoRollout aims to help researchers and open-source developers iterate rapidly on digital agent rollouts and training.

Key Results:

RL: We scale up the SWE agent RL batch size to 4,096 rollouts with stable training using NanoRollout and the resulting model Mocha-RL-Alpha-32B outperforms the strong open baseline DeepSWE-32B on SWE-Bench.
Distillation: We trained a coding agent model Mocha-Coder-32B by distilling 250K+ trajectories from frontier open-source models using NanoRollout, which achieves state-of-the-art performance among open-data models at the ≤ 32B scale and is competitive with Qwen3-Coder-480B-A3B.
Evaluation: NanoRollout validates fast evaluations on SWE-Bench Verified / Pro, Terminal-Bench, OSWorld, and CocoaBench with parallel environments. For example, SWE-Bench is evaluated in 18 mins via the official DeepSeek-V3.2 API.

💻 NanoRollout Code | 🤗 Mocha Collections | 📒 Wandb Log

</aside>

Background

Foundation models are not born as proficient digital agents. They become effective agents through training in environments — by observing environment feedback, learning how to act, and improving through iterative interactions. Recent work shows that environment-based training for agents spans multiple stages of the agent development. Taking coding agents as an example, ***Qwen3-Coder-Next*** utilizes large-scale executable and verifiable coding tasks not only during reinforcement learning (RL), but also during mid-training, bringing environment-grounded signals earlier into the stack. ***MiMo-V2-Flash*** illustrates the complementary trend on the RL side: as RL scales up the number of environments, agent performance continues to climb. Figure 1 below provides an overview of coding agent training pipeline.

Figure 1. A schematic training pipeline for a general-purpose coding agent, inspired by the discussion in this blog post. After general pre-training, later training stages increasingly rely on environments in different ways: Agentic CPT (continue pre-training) uses large volumes of agent trajectories; SFT learns from smaller amounts of high-quality trajectories; and on-policy RL learns directly from environment feedback.

However, scaling up digital agent environments is challenging in practice.

First, digital agents require different harnesses and environment providers. Within a domain**,** multiple harnesses (i.e., management of system prompt + action space + memory) coexist: for SWE agents alone, popular options include but not limited to OpenHands, mini-swe-agent, and R2E-Gym, and recent work like SWE-Universe demonstrates the effectiveness of mid-training across multiple harnesses, motivating infrastructure that makes harness switching easy. Across domains, environment providers diverge with what each domain demands: a coding agent typically needs a bash environment, which a local Docker setup or a managed sandbox provider (e.g., Modal, Daytona, or E2B) can scale well; a computer-use agent like OSWorld has a more complex sandbox requiring QEMU-backed VMs, for which general-purpose clouds such as AWS, Azure, or GCP become the more practical option to scale. Looking forward, this heterogeneity may grow further as unified agents that require coding, computer use, and deep search jointly (e.g., CocoaBench) emerge, motivating flexible environment provider assignment.

Second, model and environment workloads demand different resources. Model training and inference require GPUs, while environments require CPUs, memory, disk storage, network resources. Container images for thousands of tasks take up disk space and stress memory and I/O when loaded at sandbox startup, and tasks that fetch dependencies at runtime need network bandwidth as well. Coupling them is a common pattern in existing open implementations, but it becomes painful when practitioners want to scale resources separately. For example, as the bottleneck shifts between the two, one may need to add more data-parallel workers or more environment workers. This motivates a design that decouples model and environment workloads so each can be scaled independently.

NanoRollout is designed to scale agent-environment interactions in light of these challenges. We demonstrate its effectiveness across large-batch agent RL, massive-scale trajectory distillation, and large-scale parallel evaluation.

NanoRollout Design

Figure 2. Overview of NanoRollout. NanoRollout separates agent harnesses, domain-specific environment runtimes, and execution backends behind a unified rollout server. The same interface supports on-policy agentic RL, trajectory distillation, and parallel evaluation.

Figure 2. Overview of NanoRollout. NanoRollout separates agent harnesses, domain-specific environment runtimes, and execution backends behind a unified rollout server. The same interface supports on-policy agentic RL, trajectory distillation, and parallel evaluation.

The role of NanoRollout is simple:

Given a batch of tasks and a model inference endpoint, it executes the corresponding agent-environment rollouts and return trajectories with optional rewards.

Behind this simple contract, NanoRollout handles resource-aware orchestration and backend-agnostic rollout execution. As shown in Figure 2, NanoRollout exposes a rollout server as the entry point for all rollout workloads. It resolves tasks, agent harness and environment type for each run request, prepares the corresponding runtime, and admits rollout jobs under concurrency and resource limits. When a rollout finishes, NanoRollout collects the interaction trace, execution status, metadata, and optional reward into a normalized response.

Harness Layer: From Model to Agent

A harness is the model-facing protocol of an agent. It defines the system prompt, action space, tool schema, observation format, and message-history policy. During a rollout, the harness turns environment observations into model inputs, parses model outputs into actions, and decides what context is carried into the next turn.

NanoRollout ships common harnesses across coding, terminal, desktop, and AIO tasks, while keeping harness-specific changes isolated to this layer.