- openenv¶
REPL Environment for OpenEnv¶
A Python REPL environment for training language models on code execution tasks, based on the Recursive Language Models (RLM) paradigm.
Overview¶
The RLM paradigm allows language models to:
- Execute Python code in a sandboxed REPL environment
- Make recursive calls to themselves or other LMs via llm_query() / llm_query_batched()
- Handle near-infinite context by programmatically decomposing and exploring data
- Terminate with explicit FINAL(answer) or answer = {"content": ..., "ready": True} signals
Features¶
- Unified API: Same
REPLEnvclass works for both local and remote execution - Sandboxed Python Execution: Safe code execution with restricted builtins
- Context Loading: Load large contexts that agents can explore programmatically
- Multiple Finalization Patterns:
- Direct call:
FINAL(answer)- helper function injected into namespace - Print pattern:
print('FINAL(answer)')orprint('FINAL_VAR(var_name)') - Prime Intellect style:
answer = {"content": "...", "ready": True} - Iteration Limits: Configurable maximum steps per episode
- Reward Signals: Customizable reward functions for RL training
- Optional LLM Oracle: Can enable
llm_query()andllm_query_batched()for recursive calls
Quick Start¶
Local Mode (No Server Required)¶
from repl_env import REPLEnv
# Create environment - runs locally by default
with REPLEnv() as env:
result = env.reset(
context="This is a large document with lots of text...",
task_prompt="Find the word count"
)
# Execute code iteratively
result = env.execute("words = context.split()")
result = env.execute("count = len(words)")
result = env.execute("print(f'FINAL({count})')")
print(f"Done: {result.done}")
print(f"Final Answer: {env.state().final_answer}")
Remote Server Mode¶
from repl_env import REPLEnv
# Connect to a running server - same API!
with REPLEnv(base_url="https://my-server.hf.space") as env:
result = env.reset(context="...", task_prompt="...")
result = env.execute("count = len(context)")
result = env.execute("print(f'FINAL({count})')")
Local Mode with LLM Support¶
from repl_env import REPLEnv
def my_llm_query(prompt: str) -> str:
return your_llm.generate(prompt)
def my_llm_query_batched(prompts: list[str]) -> list[str]:
return [my_llm_query(p) for p in prompts]
# Pass LLM functions for recursive calls
with REPLEnv(llm_query_fn=my_llm_query, llm_batch_fn=my_llm_query_batched) as env:
result = env.reset(context=large_document, task_prompt="Summarize this")
# Now the executed code can use llm_query() and llm_query_batched()!
result = env.execute("summary = llm_query('Summarize: ' + context[:1000])")
From Docker or HuggingFace Hub¶
from repl_env import REPLEnv
# Start from Docker image
env = REPLEnv.from_docker_image("repl-env:latest")
# Or from HuggingFace Hub
env = REPLEnv.from_hub("openenv/repl-env")
API Reference¶
REPLEnv¶
class REPLEnv:
def __init__(
self,
base_url: str | None = None, # Server URL (None = local mode)
*,
# Local-only options
llm_query_fn: Callable | None = None, # Function for llm_query()
llm_batch_fn: Callable | None = None, # Function for llm_query_batched()
max_output_length: int = 8192, # Max stdout/stderr chars
context_preview_length: int = 500, # Chars in context preview
reward_on_success: float = 1.0, # Reward on FINAL()
reward_on_iteration: float = 0.0, # Reward per step
reward_on_failure: float = -0.1, # Reward on max iterations
reward_on_error: float = -0.05, # Reward on execution error
# Remote-only options
connect_timeout_s: float = 10.0,
message_timeout_s: float = 60.0,
): ...
def reset(
self,
*,
context: str = "", # Text to analyze (as `context` variable)
task_prompt: str = "", # Task description
max_iterations: int = 30, # Max code execution steps
seed: int | None = None, # Random seed
episode_id: str | None = None, # Custom episode ID
hf_token: str | None = None, # HF token for llm_query (remote mode)
llm_model: str | None = None, # Model for llm_query (remote mode)
) -> StepResult[REPLObservation]: ...
def execute(self, code: str) -> StepResult[REPLObservation]: ...
def step(self, action: REPLAction) -> StepResult[REPLObservation]: ...
def submit_final_answer(self, answer: str) -> StepResult[REPLObservation]: ...
def state(self) -> REPLState: ...
def close(self) -> None: ...
Action Space¶
class REPLAction:
code: str = "" # Python code to execute
is_final: bool = False # Whether this signals the final answer
final_answer: str | None = None # The final answer (if is_final=True)
Observation Space¶
class REPLObservation:
result: CodeBlockResult # Execution result (stdout, stderr, etc.)
context_preview: str | None # First 500 chars of context
context_length: int # Total context length
available_variables: list # Variables in namespace
iteration: int # Current iteration
max_iterations: int # Max iterations
done: bool # Episode complete?
reward: float # Step reward
metadata: dict # Additional info (final_answer, etc.)
Finalization Patterns¶
Pattern 1: Direct FINAL() call (recommended)¶
result = env.execute("answer = 42")
result = env.execute("FINAL(answer)")
# -> done=True, final_answer="42"
Pattern 2: FINAL() via print¶
result = env.execute("answer = 42")
result = env.execute("print(f'FINAL({answer})')")
# -> done=True, final_answer="42"
Pattern 3: FINAL_VAR() for variable reference¶
result = env.execute("my_result = 'The answer is 42'")
# Direct call (recommended) - pass variable name as string
# FINAL_VAR looks up the variable and returns FINAL(value)
result = env.execute('FINAL_VAR("my_result")')
# -> done=True, final_answer="The answer is 42"
# Also works via print (for regex detection)
result = env.execute("print('FINAL_VAR(my_result)')")
# -> done=True, final_answer="The answer is 42"
Pattern 4: Prime Intellect style answer dict¶
result = env.execute("answer['content'] = '42'")
result = env.execute("answer['ready'] = True")
# -> done=True, final_answer="42"
Prompts Module¶
The prompts module provides RLM-style prompts and parsing utilities:
from repl_env.prompts import (
# System prompts (from official RLM repo)
RLM_SYSTEM_PROMPT, # Base prompt with llm_query_batched
RLM_SYSTEM_PROMPT_QWEN, # For Qwen models (adds cost warning)
# Prompt building
QueryMetadata, # Context metadata dataclass
build_rlm_system_prompt, # Build system messages with metadata
build_user_prompt, # Build user prompt for each iteration
build_initial_prompt, # Convenience wrapper for iteration 0
# Parsing utilities
extract_code_blocks, # Extract code from ```repl``` or ```python``` blocks
format_observation, # Format execution result for LLM
)
# Example: Build messages using official RLM style
query_metadata = QueryMetadata(
context_lengths=[len(context)],
context_total_length=len(context),
context_type="str",
)
messages = build_rlm_system_prompt(RLM_SYSTEM_PROMPT_QWEN, query_metadata)
messages.append(build_user_prompt(root_prompt="Count words in the context", iteration=0))
# Extract code from LLM response (supports ```repl``` and ```python```)
response = "Here's my solution:\n```repl\ncount = len(context.split())\nFINAL(count)\n```"
code_blocks = extract_code_blocks(response) # ["count = len(context.split())\nFINAL(count)"]
Examples¶
See the examples/ directory for complete working examples:
examples/repl_with_llm.py- Full RLM loop with local Qwen modelexamples/repl_oolong_simple.py- RLM on Oolong benchmark with HuggingFace Inference API
Run examples:
# Full RLM example with local model (requires GPU)
python examples/repl_with_llm.py
# Oolong benchmark with HF Inference API (requires HF_TOKEN)
python examples/repl_oolong_simple.py
Model Usage¶
Inference Loop¶
A typical model inference loop where the LLM generates code and the environment executes it:
from repl_env import REPLEnv
from repl_env.prompts import RLM_SYSTEM_PROMPT, build_initial_prompt, extract_code_blocks, format_observation
# Works with both local and remote!
with REPLEnv(base_url="http://localhost:8000") as env: # or REPLEnv() for local
result = env.reset(
context="The quick brown fox jumps over the lazy dog. " * 1000,
task_prompt="Count how many times 'fox' appears"
)
messages = [
{"role": "system", "content": RLM_SYSTEM_PROMPT},
{"role": "user", "content": build_initial_prompt(
task_prompt="Count how many times 'fox' appears",
context_length=result.observation.context_length,
context_preview=result.observation.context_preview,
variables=result.observation.available_variables,
)},
]
while not result.done:
# Get code from LLM
response = your_llm.chat(messages)
code_blocks = extract_code_blocks(response)
for code in code_blocks:
result = env.execute(code)
if result.done:
break
# Update conversation
messages.append({"role": "assistant", "content": response})
messages.append({"role": "user", "content": format_observation(result.observation)})
print(f"Final answer: {env.state().final_answer}")
Recursive LLM Calls (RLM Paradigm)¶
The key insight of RLM is that models can make recursive calls to themselves or other LLMs from within the code:
from repl_env import REPLEnv
def llm_query(prompt: str) -> str:
"""Single LLM call - model can call this from executed code"""
return your_llm.generate(prompt)
def llm_query_batched(prompts: list[str]) -> list[str]:
"""Batch LLM calls for efficiency (parallel in production)"""
return [your_llm.generate(p) for p in prompts]
# Create environment with LLM oracle (local mode)
with REPLEnv(llm_query_fn=llm_query, llm_batch_fn=llm_query_batched) as env:
result = env.reset(
context=massive_document, # Could be 100K+ chars
task_prompt="Summarize each section and find key themes"
)
# The model can now generate code like this:
code = """
# Split document into sections
sections = context.split('\\n\\n')
# Use LLM to summarize each section (recursive call!)
summaries = llm_query_batched([f"Summarize: {s[:1000]}" for s in sections[:10]])
# Combine summaries
combined = '\\n'.join(summaries)
# Final synthesis using another LLM call
answer['content'] = llm_query(f"Find key themes in: {combined}")
answer['ready'] = True
"""
result = env.execute(code)
print(f"Done: {result.done}, Answer: {env.state().final_answer}")
RL Training Integration¶
For RL training, integrate with frameworks like TRL, prime-rl, or verifiers:
from repl_env import REPLEnv
def collect_trajectory(env, policy, context, task):
"""Collect a single trajectory for RL training"""
result = env.reset(context=context, task_prompt=task)
trajectory = []
total_reward = 0
while not result.done:
# Policy generates code
code = policy.generate(result.observation)
# Step environment
next_result = env.execute(code)
# Store transition
trajectory.append({
"observation": result.observation,
"action": code,
"reward": next_result.reward,
"next_observation": next_result.observation,
"done": next_result.done,
})
total_reward += next_result.reward
result = next_result
return trajectory, total_reward
# Training loop
with REPLEnv(
reward_on_success=1.0,
reward_on_iteration=0.0,
reward_on_error=-0.05,
reward_on_failure=-0.1,
) as env:
for epoch in range(num_epochs):
for context, task, ground_truth in dataset:
trajectory, reward = collect_trajectory(env, policy, context, task)
# Verify answer correctness (optional external reward)
if trajectory:
final_answer = env.state().final_answer
if final_answer == ground_truth:
reward += verification_bonus
# Update policy (use your RL framework - PPO, GRPO, DPO, etc.)
policy.update(trajectory, reward)
Reward Configuration¶
Configure rewards for different outcomes:
env = REPLEnv(
reward_on_success=1.0, # When FINAL() is called
reward_on_iteration=0.0, # Per step (can be negative to encourage efficiency)
reward_on_error=-0.05, # When code execution fails
reward_on_failure=-0.1, # When max iterations reached without answer
)
Environment Configuration¶
| Environment Variable | Description | Default |
|---|---|---|
REPL_CONTEXT |
Initial context to load | "" |
REPL_TASK_PROMPT |
Task description | "" |
REPL_MAX_ITERATIONS |
Max steps per episode | 30 |
HF_TOKEN |
HuggingFace token for llm_query (server fallback) | None |
LLM_MODEL |
Model for llm_query/llm_query_batched | Qwen/Qwen3-Coder-480B-A35B-Instruct |
Running the Server¶
Using UV¶
cd envs/repl_env
uv run --project . server
Using Docker¶
docker build -t repl-env:latest -f server/Dockerfile .
docker run -p 8000:8000 repl-env:latest
Testing¶
pytest tests/