Rate this Page
โ˜… โ˜… โ˜… โ˜… โ˜…

TB2 Environment (Terminal-Bench 2)#

OpenEnv wrapper for Terminal-Bench 2 tasks. Supports two execution modes:

Mode

Description

Use Case

Local

Runs commands in the server process (no Docker)

Hugging Face Spaces, environments without Docker access

Docker

Runs each task in its own container

Full TB2.0 fidelity with custom task images

Quick Start#

from tbench2_env import Tbench2Env, Tbench2Action

env = Tbench2Env(base_url="http://localhost:8000")
result = env.reset(task_id="headless-terminal")
print(result.observation.instruction)

result = env.step(Tbench2Action(action_type="exec", command="ls -la"))
print(result.observation.output)

result = env.step(Tbench2Action(action_type="evaluate"))
print(result.reward, result.done)

env.close()

Building the Docker Image#

Before using the environment, build the Docker image:

# From project root
docker build -t tbench2-env:latest -f envs/tbench2_env/server/Dockerfile .

Environment Details#

Action#

Tbench2Action: Controls interaction with the TB2 task session

Field

Type

Default

Description

action_type

str

"exec"

Action to perform (exec, write, view, wait, kill, write_file, evaluate, close)

command

str

""

Shell command or input to send

session_id

str | None

None

Session ID for streaming processes

block

bool

True

Whether to block until command completes

wait_seconds

float | None

None

Time to wait (for wait action)

file_path

str

""

File path (for write_file action)

content

str

""

Content to write (for write_file action)

Observation#

Tbench2Observation: Contains the environment response

Field

Type

Description

instruction

str

Task instruction/prompt from the TB2 task

output

str

Command output (stdout/stderr)

success

bool

Whether the action succeeded

error

str

Error message if action failed

task_id

str

Current task identifier

task_path

str

Path to the task directory

session_id

str | None

Session ID for streaming processes

action_type

str

The action type that produced this observation

info

dict

Additional metadata

State#

Tbench2State: Server-side state for the task session

Field

Type

Description

task_id

str

Current task identifier

task_path

str

Path to the task directory

session_id

str

Active session ID

terminal_ready

bool

Whether the terminal is ready for commands

last_action_type

str

Last action type executed

last_command

str

Last command executed

last_output

str

Output from last command

Execution Modes#

Local Mode (Default)#

Commands execute directly in the server process. Ideal for HF Spaces where Docker-in-Docker is unavailable.

# Default - local mode
python -m tbench2_env.server.app

# Or explicitly set mode
TB2_MODE=local python -m tbench2_env.server.app

Note: Local mode ignores Docker images specified in task.toml. Tasks requiring specific runtime environments may fail.

Docker Mode#

Each task runs in its own Docker container, using the image specified in the taskโ€™s task.toml:

# Enable Docker mode
TB2_MODE=docker python -m tbench2_env.server.app

Requirements:

  • Docker socket mounted at /var/run/docker.sock

  • Sufficient disk space for container images

  • Network access to pull images if not cached

Environment Variables for Docker Mode:

  • TB2_MODE=docker - Enable Docker-backed execution

  • Docker socket must be accessible (mounted volume)

Action Types#

Action

Description

Required Fields

exec

Run a shell command

command, optionally block, session_id

write

Send input to a running session

session_id, command

view

Read pending output

session_id

wait

Wait for output

session_id, optionally wait_seconds

kill

Terminate a running session

session_id

write_file

Write content to a file

file_path, content

evaluate

Run pytest tests, return reward

(none)

close

Stop and cleanup

(none)

Session IDs (Streaming Processes)#

session_id is only required when you start a non-blocking process and want to interact with it (write, view, wait, kill). For plain exec commands, you can omit it.

Example (Python):

# Start a long-running process
env.step(Tbench2Action(action_type="exec", command="python -i", block=False, session_id="sess1"))

# Send input to it
env.step(Tbench2Action(action_type="write", session_id="sess1", command="print(2+2)\n"))

# Read its output
env.step(Tbench2Action(action_type="view", session_id="sess1"))

Environment Variables#

Variable

Default

Description

TB2_MODE

local

Execution mode: local or docker

TB2_TASKS_DIR

(auto-download)

Path to local Terminal-Bench-2 repo checkout

TB2_OUTPUT_DIR

/tmp/tbench2_env_runs

Directory for session logs and cache

TB2_CACHE_DIR

$TB2_OUTPUT_DIR/repo_cache

Where to extract TB2 repo

TB2_REPO_URL

(GitHub main.zip)

Repo zip URL for auto-download

Reward#

Binary reward on evaluate action:

  • 1.0 - All pytest tests pass (exit code 0)

  • 0.0 - Tests fail (non-zero exit code)

Intermediate steps return reward=None.

Running the Server#

# Install dependencies
uv sync --all-extras

# Local mode (default, for Spaces)
python -m tbench2_env.server.app --port 8000

# Docker mode (full TB2.0 compatibility)
TB2_MODE=docker python -m tbench2_env.server.app --port 8000

# With local TB2 repo
TB2_TASKS_DIR=/path/to/terminal-bench-2 python -m tbench2_env.server.app

Project Structure#

tbench2_env/
โ”œโ”€โ”€ __init__.py              # Module exports (Tbench2Env, Tbench2Action, etc.)
โ”œโ”€โ”€ README.md                # This file
โ”œโ”€โ”€ client.py                # Tbench2Env client implementation
โ”œโ”€โ”€ models.py                # Tbench2Action, Tbench2Observation, Tbench2State
โ”œโ”€โ”€ openenv.yaml             # OpenEnv configuration
โ”œโ”€โ”€ pyproject.toml           # Package dependencies
โ””โ”€โ”€ server/
    โ”œโ”€โ”€ __init__.py          # Server exports
    โ”œโ”€โ”€ app.py               # FastAPI application
    โ”œโ”€โ”€ tbench2_env_environment.py  # Core environment logic
    โ””โ”€โ”€ Dockerfile           # Container image definition