# RL Training with OpenEnv: 2048 Game

This tutorial covers training a language model to play the 2048 game using
reinforcement learning with GRPO (Group Relative Policy Optimization).

```{note}
**Time**: ~45 minutes | **Difficulty**: Advanced | **GPU Required**: Yes (T4 or better)
```

## What You'll Learn

- **Model Setup**: Load and configure LLMs with Unsloth for efficient RL
- **Environment Connection**: Connect to the 2048 OpenEnv environment
- **Reward Design**: Create effective reward functions
- **GRPO Training**: Train models with reinforcement learning
- **Deployment**: Save and deploy trained models

## Prerequisites

Before starting this tutorial, you should have completed the
[Getting Started](/auto_getting_started/index) series to understand:

- How OpenEnv environments work
- The reset/step/state API pattern
- How to connect to environments

You'll also need:

- A GPU (free T4 on Google Colab works)
- Basic understanding of PyTorch
- ~30 minutes for training

## Part 1: Environment Setup

### Installation

```bash
# Install required packages
!pip install -q unsloth openenv-core trl

# For Google Colab, also run:
!pip install -q "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
```

### Imports

```python
import torch
from dataclasses import dataclass
from typing import List, Optional, Dict, Any
import random

# Check GPU availability
print(f"GPU Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
```

## Part 2: Model Configuration

We use Unsloth for memory-efficient training with LoRA adapters.

### Configuration Classes

```python
@dataclass
class ModelConfig:
    """Configuration for loading LLM models."""
    model_name: str = "unsloth/Qwen2.5-1.5B"
    max_seq_length: int = 768
    load_in_4bit: bool = True
    dtype: Optional[str] = None  # Auto-detect


@dataclass
class LoRAConfig:
    """Configuration for LoRA fine-tuning."""
    r: int = 16
    lora_alpha: int = 32
    target_modules: List[str] = None
    lora_dropout: float = 0.0

    def __post_init__(self):
        if self.target_modules is None:
            self.target_modules = [
                "q_proj", "k_proj", "v_proj", "o_proj",
                "gate_proj", "up_proj", "down_proj",
            ]
```

### Loading the Model

```python
from unsloth import FastLanguageModel

# Create configurations
model_config = ModelConfig()
lora_config = LoRAConfig()

# Load model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_config.model_name,
    max_seq_length=model_config.max_seq_length,
    load_in_4bit=model_config.load_in_4bit,
    dtype=model_config.dtype,
)

# Apply LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=lora_config.r,
    target_modules=lora_config.target_modules,
    lora_alpha=lora_config.lora_alpha,
    lora_dropout=lora_config.lora_dropout,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=42,
)

# Check parameter counts
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"Trainable: {trainable:,} / {total:,} ({trainable/total*100:.2f}%)")
```

## Part 3: The 2048 Environment

### Game Overview

2048 is a sliding puzzle game where you combine tiles to reach 2048.

**Actions:**
- `0` = UP
- `1` = RIGHT
- `2` = DOWN
- `3` = LEFT

**Goal:** Create a tile with value 2048 (or higher!)

### Connecting to the Environment

```python
from envs.openspiel_env import OpenSpielEnv, OpenSpielAction

# Connect to 2048 environment
# Option 1: From Hub
env = OpenSpielEnv.from_hub("openenv/openspiel-env")

# Option 2: From running server
# env = OpenSpielEnv(base_url="http://localhost:8000")

# Test connection
with env:
    result = env.reset()
    print(f"Game started!")
    print(f"Legal actions: {result.observation.legal_actions}")

    # Take a test action
    action = OpenSpielAction(action_id=0, game_name="2048")
    result = env.step(action)
    print(f"After UP: reward={result.reward}, done={result.done}")
```

### Board Utilities

```python
import numpy as np
from typing import List

def info_state_to_board(info_state: List[int], size: int = 4) -> List[List[int]]:
    """Convert flat info_state to 2D board."""
    return np.array(info_state, dtype=int).reshape(size, size).tolist()

def render_board(board: List[List[int]]) -> str:
    """Render board as ASCII string."""
    lines = ["+------" * len(board[0]) + "+"]
    for row in board:
        cells = [f"{v:5d}" if v > 0 else "    ." for v in row]
        lines.append("|" + " |".join(cells) + " |")
        lines.append("+------" * len(row) + "+")
    return "\n".join(lines)

def get_max_tile(board: List[List[int]]) -> int:
    """Get highest tile value."""
    return max(cell for row in board for cell in row)
```

## Part 4: Reward Function Design

The reward function is crucial for RL. We consider:

1. **Success**: Did we reach 2048?
2. **Progress**: What's the highest tile achieved?
3. **Code Quality**: Did the generated code execute correctly?

### Reward Implementation

```python
import math

def calculate_reward(
    max_tile: int,
    success: bool,
    code_error: bool = False
) -> float:
    """
    Calculate reward for a 2048 game outcome.

    Args:
        max_tile: Highest tile achieved (2, 4, 8, ..., 2048)
        success: Whether we reached 2048
        code_error: Whether generated code had errors

    Returns:
        Float reward value
    """
    if code_error:
        return -0.5  # Penalty for invalid code

    if success:
        return 1.0  # Full reward for winning

    # Progress reward: log scale from 0 to 0.9
    if max_tile > 0:
        progress = math.log2(max_tile) / math.log2(2048)
        return min(0.9, progress)

    return 0.0

# Test reward function
test_cases = [
    (2048, True, False, "Won!"),
    (1024, False, False, "Got to 1024"),
    (512, False, False, "Got to 512"),
    (64, False, False, "Early game"),
]

for max_tile, success, error, desc in test_cases:
    reward = calculate_reward(max_tile, success, error)
    print(f"{desc:20s} -> Reward: {reward:+.3f}")
```

## Part 5: Strategy Generation

We'll train the model to generate Python strategy functions.

### Prompt Template

```python
SYSTEM_PROMPT = """You are an expert at playing 2048. Generate a Python function
that takes a board state and returns the best action (0=UP, 1=RIGHT, 2=DOWN, 3=LEFT).

The board is a 4x4 list of integers. Empty cells are 0.
Your function should analyze the board and return an optimal move.
"""

def create_prompt(board: List[List[int]]) -> str:
    """Create prompt for strategy generation."""
    board_str = "\n".join(str(row) for row in board)
    return f"""{SYSTEM_PROMPT}

Current board:
{board_str}

Generate a strategy function:
```python
def strategy(board):
    # Your code here
    return action  # 0, 1, 2, or 3
```"""
```

### Executing Generated Strategies

```python
import ast
from typing import Callable

def extract_and_execute_strategy(
    generated_code: str,
    board: List[List[int]],
    timeout: float = 5.0
) -> tuple[int, bool]:
    """
    Extract and execute a generated strategy function.

    Returns:
        (action, success): The action to take and whether execution succeeded
    """
    try:
        # Extract code block
        if "```python" in generated_code:
            code = generated_code.split("```python")[1].split("```")[0]
        else:
            code = generated_code

        # Parse and validate AST
        tree = ast.parse(code)

        # Execute in sandbox
        namespace = {"board": board}
        exec(compile(tree, "<strategy>", "exec"), namespace)

        # Call the strategy function
        if "strategy" in namespace:
            action = namespace["strategy"](board)
            if action in [0, 1, 2, 3]:
                return action, True

        return 0, False  # Default action on failure

    except Exception as e:
        print(f"Strategy execution error: {e}")
        return 0, False
```

## Part 6: GRPO Training

GRPO (Group Relative Policy Optimization) is optimized for language models.

### Training Configuration

```python
from trl import GRPOConfig, GRPOTrainer

grpo_config = GRPOConfig(
    # Learning rate
    learning_rate=2e-6,

    # Batch sizes
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,

    # Training duration
    max_steps=200,

    # Memory optimization
    bf16=True,
    gradient_checkpointing=True,

    # Logging
    logging_steps=1,
    output_dir="./2048_grpo_output",
    report_to="none",
)
```

### Training Loop

```python
def train_2048_agent(
    model,
    tokenizer,
    env,
    config: GRPOConfig,
    num_episodes: int = 100,
):
    """
    Train the model to play 2048 using GRPO.
    """
    # Prepare model for training
    FastLanguageModel.for_training(model)

    training_data = []

    for episode in range(num_episodes):
        # Reset environment
        result = env.reset()
        board = info_state_to_board(result.observation.info_state)

        episode_reward = 0
        steps = 0

        while not result.done and steps < 1000:
            # Generate strategy
            prompt = create_prompt(board)
            inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

            outputs = model.generate(
                **inputs,
                max_new_tokens=256,
                temperature=0.7,
                do_sample=True,
            )

            generated = tokenizer.decode(outputs[0], skip_special_tokens=True)

            # Execute strategy
            action, success = extract_and_execute_strategy(generated, board)

            # Take action in environment
            env_action = OpenSpielAction(action_id=action, game_name="2048")
            result = env.step(env_action)

            # Update board
            board = info_state_to_board(result.observation.info_state)
            episode_reward += result.reward if result.reward else 0
            steps += 1

        # Calculate final reward
        max_tile = get_max_tile(board)
        final_reward = calculate_reward(max_tile, max_tile >= 2048)

        # Store for training
        training_data.append({
            "prompt": prompt,
            "response": generated,
            "reward": final_reward,
        })

        if episode % 10 == 0:
            print(f"Episode {episode}: Max tile={max_tile}, Reward={final_reward:.3f}")

    return training_data
```

## Part 7: Deployment

After training, save and deploy your model.

### Saving the Model

```python
# Save LoRA adapters only
model.save_pretrained("./2048_strategy_model")
tokenizer.save_pretrained("./2048_strategy_model")

# Save merged model for inference
model.save_pretrained_merged(
    "./2048_strategy_model_merged",
    tokenizer,
    save_method="merged_16bit",
)
```

### Push to Hugging Face Hub

```python
# Push to Hub
model.push_to_hub(
    "your-username/2048-strategy-model",
    tokenizer,
    save_method="merged_16bit",
    private=False,
)

print("Model deployed to: huggingface.co/your-username/2048-strategy-model")
```

### Using the Trained Model

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load trained model
model = AutoModelForCausalLM.from_pretrained("your-username/2048-strategy-model")
tokenizer = AutoTokenizer.from_pretrained("your-username/2048-strategy-model")

# Generate strategy
def get_action(board: List[List[int]]) -> int:
    prompt = create_prompt(board)
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = model.generate(**inputs, max_new_tokens=256)
    generated = tokenizer.decode(outputs[0], skip_special_tokens=True)
    action, _ = extract_and_execute_strategy(generated, board)
    return action

# Play a game
with OpenSpielEnv.from_hub("openenv/openspiel-env") as env:
    result = env.reset()
    board = info_state_to_board(result.observation.info_state)

    while not result.done:
        action = get_action(board)
        result = env.step(OpenSpielAction(action_id=action, game_name="2048"))
        board = info_state_to_board(result.observation.info_state)

    print(f"Final max tile: {get_max_tile(board)}")
```

## Preventing Reward Hacking

Be aware of potential reward hacking strategies:

1. **Code that modifies rewards** - Run in sandboxed environment
2. **Infinite loops** - Set execution timeouts
3. **Memory exhaustion** - Limit resource usage

```python
import resource
import signal

def safe_execute(code: str, board: List[List[int]], timeout: float = 5.0) -> int:
    """Execute strategy with safety limits."""

    def handler(signum, frame):
        raise TimeoutError("Strategy timed out")

    # Set timeout
    signal.signal(signal.SIGALRM, handler)
    signal.alarm(int(timeout))

    try:
        # Set memory limit (100MB)
        resource.setrlimit(resource.RLIMIT_AS, (100 * 1024 * 1024, -1))

        # Execute in restricted namespace
        namespace = {"board": board, "__builtins__": {"len": len, "max": max, "min": min}}
        exec(code, namespace)

        return namespace.get("strategy", lambda b: 0)(board)
    finally:
        signal.alarm(0)
```

## Summary

In this tutorial, you learned:

1. **Model Setup**: Loading LLMs with Unsloth and LoRA
2. **Environment Connection**: Using OpenEnv's 2048 environment
3. **Reward Design**: Creating balanced reward functions
4. **GRPO Training**: Training with reinforcement learning
5. **Deployment**: Saving and sharing trained models

## Next Steps

- Try different model architectures
- Experiment with reward function designs
- Train on other OpenEnv environments
- Share your trained models on Hugging Face Hub!

## Related Resources

- [OpenEnv Getting Started](../auto_getting_started/index)
- [Building Custom Environments](../auto_getting_started/plot_03_building_environments)
- [GRPO Documentation](https://huggingface.co/docs/trl/grpo_trainer)
- [Unsloth Documentation](https://github.com/unslothai/unsloth)