- openenv
- reasoning
- evaluation
Reasoning Gym Environment¶
An OpenEnv environment that integrates the Reasoning Gym library to provide single-step reasoning tasks. Each episode presents one question from a configurable dataset, the agent submits an answer, and receives a score.
Quick Start¶
The simplest way to use the Reasoning Gym environment is through the ReasoningGymEnv class:
from reasoning_gym_env import ReasoningGymAction, ReasoningGymEnv
try:
# Create environment from Docker image
env = ReasoningGymEnv.from_docker_image("reasoning_gym-env:latest")
# Create a dataset with 10 leg_counting questions
result = env.reset(
dataset_name='leg_counting',
seed=42,
size=10
)
print(f"Question: {result.observation.question}")
# Question: "How many legs does a cat have?"
# Submit answer
result = env.step(ReasoningGymAction(answer="4"))
print(f"Score: {result.observation.score}") # 1.0 (correct)
print(f"Correct answer: {result.observation.correct_answer}") # "4"
print(f"Reward: {result.reward}") # 1.0
print(f"Done: {result.done}") # True (single-step episodes)
# Access dataset metadata if available
if result.observation.dataset_metadata:
print(f"Metadata: {result.observation.dataset_metadata}") # Dataset-specific info
# Get next question from same dataset
result = env.reset() # No params = reuse dataset
print(f"Next question: {result.observation.question}")
# Note: First reset() without params creates a default leg_counting dataset
# with seed=42 and size=1000 for reproducible, out-of-the-box functionality
finally:
# Always clean up
env.close()
That's it! The ReasoningGymEnv.from_docker_image() method handles:
- Starting the Docker container
- Waiting for the server to be ready
- Connecting to the environment
- Container cleanup when you call
close()
Building the Docker Image¶
Before using the environment, you need to build the Docker image:
# From the reasoning_gym_env directory
docker build -t reasoning_gym-env:latest -f server/Dockerfile .
Deploying to Hugging Face Spaces¶
You can easily deploy your OpenEnv environment to Hugging Face Spaces using the openenv push command:
# From the environment directory (where openenv.yaml is located)
openenv push
# Or specify options
openenv push --namespace my-org --private
The openenv push command will:
- Validate that the directory is an OpenEnv environment (checks for
openenv.yaml) - Prepare a custom build for Hugging Face Docker space (enables web interface)
- Upload to Hugging Face (ensuring you're logged in)
Prerequisites¶
- Authenticate with Hugging Face: The command will prompt for login if not already authenticated
Options¶
--directory,-d: Directory containing the OpenEnv environment (defaults to current directory)--repo-id,-r: Repository ID in format 'username/repo-name' (defaults to 'username/env-name' from openenv.yaml)--base-image,-b: Base Docker image to use (overrides Dockerfile FROM)--private: Deploy the space as private (default: public)
Examples¶
# Push to your personal namespace (defaults to username/env-name from openenv.yaml)
openenv push
# Push to a specific repository
openenv push --repo-id my-org/reasoning-gym-env
# Push with a custom base image
openenv push --base-image ghcr.io/meta-pytorch/openenv-base:latest
# Push as a private space
openenv push --private
# Combine options
openenv push --repo-id my-org/reasoning-gym --base-image custom-base:latest --private
After deployment, your space will be available at:
https://huggingface.co/spaces/<repo-id>
The deployed space includes:
- Web Interface at
/web- Interactive UI for exploring the environment - API Documentation at
/docs- Full OpenAPI/Swagger interface - Health Check at
/health- Container health monitoring - WebSocket at
/ws- Persistent session endpoint for low-latency interactions
Environment Details¶
Episode Structure¶
Each episode is single-step:
reset()returns a questionstep(answer)returns score and marks episode as donereset()without params gets next question from same dataset
Action¶
ReasoningGymAction: Contains the agent's answer
answer(str) - The agent's answer to the current question
Observation¶
ReasoningGymObservation: Contains the question or result
question(Optional[str]) - The current question (only in reset)score(Optional[float]) - Score for the answer, 0.0 to 1.0 (only after step)correct_answer(Optional[str]) - The correct answer (only after step)dataset_metadata(Optional[Dict]) - Metadata from the reasoning gym dataset entrydone(bool) - Always True after step (single-step episodes)reward(float) - Equal to score (0.0 to 1.0)
Reward¶
The reward equals the score returned by the dataset's scoring function:
- Correct answer → score: 1.0
- Incorrect answer → score: 0.0 to 1.0 (dataset-dependent, may use partial credit)
Dataset Configuration¶
Simple Datasets¶
Use a single dataset with configuration:
result = env.reset(
dataset_name='leg_counting',
seed=42,
size=10
)
Available datasets from reasoning_gym library:
leg_counting- Count legs of various objectsreverse_sort- Sort lists in reverse orderchess_state_eval- Chess position evaluation- And more (see Reasoning Gym documentation)
Composite Datasets¶
Mix multiple datasets with custom weights:
result = env.reset(
dataset_name='composite',
dataset_specs=[
{
"name": "leg_counting",
"weight": 3, # 3x more likely
"config": {}
},
{
"name": "reverse_sort",
"weight": 1,
"config": {"min_length": 3, "max_length": 5}
}
],
seed=42,
size=20
)
Dataset Persistence¶
The dataset persists across resets until configuration changes:
# Create dataset
result = env.reset(dataset_name='leg_counting', seed=42, size=10)
question1 = result.observation.question
# Get next question from SAME dataset
result = env.reset() # No params = reuse dataset
question2 = result.observation.question
# Create NEW dataset (different seed)
result = env.reset(dataset_name='leg_counting', seed=99, size=10)
question3 = result.observation.question # From new dataset
Advanced Usage¶
Connecting to an Existing Server¶
If you already have a Reasoning Gym environment server running, you can connect directly:
from reasoning_gym_env import ReasoningGymEnv, ReasoningGymAction
# Connect to existing server
env = ReasoningGymEnv(base_url="http://localhost:8000")
# Use as normal
result = env.reset(
dataset_name='leg_counting',
dataset_config={"min_animals": 5, "max_animals": 15}
seed=42,
size=10
)
result = env.step(ReasoningGymAction(answer="4"))
Note: When connecting to an existing server, env.close() will NOT stop the server.
Using the Context Manager¶
The client supports context manager usage for automatic connection management:
from reasoning_gym_env import ReasoningGymAction, ReasoningGymEnv
# Connect with context manager (auto-connects and closes)
with ReasoningGymEnv(base_url="http://localhost:8000") as env:
result = env.reset(
dataset_name='leg_counting',
seed=42,
size=5
)
print(f"Question: {result.observation.question}")
# Multiple steps with low latency
for i in range(5):
result = env.step(ReasoningGymAction(answer="4"))
print(f"Score: {result.observation.score}")
if result.done:
result = env.reset() # Get next question
The client uses WebSocket connections for: - Lower latency: No HTTP connection overhead per request - Persistent session: Server maintains your environment state - Efficient for episodes: Better for many sequential steps
Concurrent WebSocket Sessions¶
The server is configured to support multiple concurrent WebSocket connections:
from reasoning_gym_env import ReasoningGymAction, ReasoningGymEnv
from concurrent.futures import ThreadPoolExecutor
def run_episode(client_id: int):
with ReasoningGymEnv(base_url="http://localhost:8000") as env:
result = env.reset(
dataset_name='leg_counting',
seed=42 + client_id, # Different seed per client
size=10
)
total_score = 0.0
for i in range(10):
result = env.step(ReasoningGymAction(answer="4"))
total_score += result.observation.score
if result.done:
result = env.reset() # Next question
return client_id, total_score
# Run 4 episodes concurrently
with ThreadPoolExecutor(max_workers=4) as executor:
results = list(executor.map(run_episode, range(4)))
for client_id, score in results:
print(f"Client {client_id}: Total score = {score}")
Development & Testing¶
Direct Environment Testing¶
Test the environment logic directly without starting the HTTP server:
# From the reasoning_gym_env directory
python3 server/reasoning_gym_environment.py
This verifies that:
- Environment resets correctly
- Step executes actions properly
- Dataset creation and iteration works
- Scoring is calculated correctly
Running Locally¶
Run the server locally for development:
# From the reasoning_gym_env directory
uvicorn server.app:app --reload
Then test with:
from reasoning_gym_env import ReasoningGymEnv, ReasoningGymAction
env = ReasoningGymEnv(base_url="http://localhost:8000")
result = env.reset(dataset_name='leg_counting', seed=42, size=5)
print(result.observation.question)
Project Structure¶
reasoning_gym_env/
├── .dockerignore # Docker build exclusions
├── __init__.py # Module exports
├── README.md # This file
├── openenv.yaml # OpenEnv manifest
├── pyproject.toml # Project metadata and dependencies
├── client.py # ReasoningGymEnv client
├── models.py # Action and Observation models
└── server/
├── __init__.py # Server module exports
├── reasoning_gym_environment.py # Core environment logic
├── app.py # FastAPI application (HTTP + WebSocket endpoints)
├── requirements.txt # Server dependencies
└── Dockerfile # Container image definition
Use Cases¶
The Reasoning Gym environment is ideal for:
- LLM Evaluation: Benchmark reasoning capabilities across diverse tasks
- Agent Training: Train RL agents on reasoning tasks with clear score signals
- Curriculum Learning: Mix datasets with composite configurations for progressive difficulty
- Research: Reproducible environments with seed control for scientific experiments
- Educational: Learn about environment design and agent-environment interaction