TB2 Environment (Terminal-Bench 2)#
OpenEnv wrapper for Terminal-Bench 2 tasks. Supports two execution modes:
Mode |
Description |
Use Case |
|---|---|---|
Local |
Runs commands in the server process (no Docker) |
Hugging Face Spaces, environments without Docker access |
Docker |
Runs each task in its own container |
Full TB2.0 fidelity with custom task images |
Quick Start#
from tbench2_env import Tbench2Env, Tbench2Action
env = Tbench2Env(base_url="http://localhost:8000")
result = env.reset(task_id="headless-terminal")
print(result.observation.instruction)
result = env.step(Tbench2Action(action_type="exec", command="ls -la"))
print(result.observation.output)
result = env.step(Tbench2Action(action_type="evaluate"))
print(result.reward, result.done)
env.close()
Building the Docker Image#
Before using the environment, build the Docker image:
# From project root
docker build -t tbench2-env:latest -f envs/tbench2_env/server/Dockerfile .
Environment Details#
Action#
Tbench2Action: Controls interaction with the TB2 task session
Field |
Type |
Default |
Description |
|---|---|---|---|
|
str |
|
Action to perform ( |
|
str |
|
Shell command or input to send |
|
str | None |
|
Session ID for streaming processes |
|
bool |
|
Whether to block until command completes |
|
float | None |
|
Time to wait (for |
|
str |
|
File path (for |
|
str |
|
Content to write (for |
Observation#
Tbench2Observation: Contains the environment response
Field |
Type |
Description |
|---|---|---|
|
str |
Task instruction/prompt from the TB2 task |
|
str |
Command output (stdout/stderr) |
|
bool |
Whether the action succeeded |
|
str |
Error message if action failed |
|
str |
Current task identifier |
|
str |
Path to the task directory |
|
str | None |
Session ID for streaming processes |
|
str |
The action type that produced this observation |
|
dict |
Additional metadata |
State#
Tbench2State: Server-side state for the task session
Field |
Type |
Description |
|---|---|---|
|
str |
Current task identifier |
|
str |
Path to the task directory |
|
str |
Active session ID |
|
bool |
Whether the terminal is ready for commands |
|
str |
Last action type executed |
|
str |
Last command executed |
|
str |
Output from last command |
Execution Modes#
Local Mode (Default)#
Commands execute directly in the server process. Ideal for HF Spaces where Docker-in-Docker is unavailable.
# Default - local mode
python -m tbench2_env.server.app
# Or explicitly set mode
TB2_MODE=local python -m tbench2_env.server.app
Note: Local mode ignores Docker images specified in task.toml. Tasks requiring specific runtime environments may fail.
Docker Mode#
Each task runs in its own Docker container, using the image specified in the taskโs task.toml:
# Enable Docker mode
TB2_MODE=docker python -m tbench2_env.server.app
Requirements:
Docker socket mounted at
/var/run/docker.sockSufficient disk space for container images
Network access to pull images if not cached
Environment Variables for Docker Mode:
TB2_MODE=docker- Enable Docker-backed executionDocker socket must be accessible (mounted volume)
Action Types#
Action |
Description |
Required Fields |
|---|---|---|
|
Run a shell command |
|
|
Send input to a running session |
|
|
Read pending output |
|
|
Wait for output |
|
|
Terminate a running session |
|
|
Write content to a file |
|
|
Run pytest tests, return reward |
(none) |
|
Stop and cleanup |
(none) |
Session IDs (Streaming Processes)#
session_id is only required when you start a non-blocking process and want to interact with it (write, view, wait, kill). For plain exec commands, you can omit it.
Example (Python):
# Start a long-running process
env.step(Tbench2Action(action_type="exec", command="python -i", block=False, session_id="sess1"))
# Send input to it
env.step(Tbench2Action(action_type="write", session_id="sess1", command="print(2+2)\n"))
# Read its output
env.step(Tbench2Action(action_type="view", session_id="sess1"))
Environment Variables#
Variable |
Default |
Description |
|---|---|---|
|
|
Execution mode: |
|
(auto-download) |
Path to local Terminal-Bench-2 repo checkout |
|
|
Directory for session logs and cache |
|
|
Where to extract TB2 repo |
|
(GitHub main.zip) |
Repo zip URL for auto-download |
Reward#
Binary reward on evaluate action:
1.0- All pytest tests pass (exit code 0)0.0- Tests fail (non-zero exit code)
Intermediate steps return reward=None.
Running the Server#
# Install dependencies
uv sync --all-extras
# Local mode (default, for Spaces)
python -m tbench2_env.server.app --port 8000
# Docker mode (full TB2.0 compatibility)
TB2_MODE=docker python -m tbench2_env.server.app --port 8000
# With local TB2 repo
TB2_TASKS_DIR=/path/to/terminal-bench-2 python -m tbench2_env.server.app
Project Structure#
tbench2_env/
โโโ __init__.py # Module exports (Tbench2Env, Tbench2Action, etc.)
โโโ README.md # This file
โโโ client.py # Tbench2Env client implementation
โโโ models.py # Tbench2Action, Tbench2Observation, Tbench2State
โโโ openenv.yaml # OpenEnv configuration
โโโ pyproject.toml # Package dependencies
โโโ server/
โโโ __init__.py # Server exports
โโโ app.py # FastAPI application
โโโ tbench2_env_environment.py # Core environment logic
โโโ Dockerfile # Container image definition