# RL Training with OpenEnv: 2048 Game This tutorial covers training a language model to play the 2048 game using reinforcement learning with GRPO (Group Relative Policy Optimization). ```{note} **Time**: ~45 minutes | **Difficulty**: Advanced | **GPU Required**: Yes (T4 or better) ``` ## What You'll Learn - **Model Setup**: Load and configure LLMs with Unsloth for efficient RL - **Environment Connection**: Connect to the 2048 OpenEnv environment - **Reward Design**: Create effective reward functions - **GRPO Training**: Train models with reinforcement learning - **Deployment**: Save and deploy trained models ## Prerequisites Before starting this tutorial, you should have completed the [Getting Started](/auto_getting_started/index) series to understand: - How OpenEnv environments work - The reset/step/state API pattern - How to connect to environments You'll also need: - A GPU (free T4 on Google Colab works) - Basic understanding of PyTorch - ~30 minutes for training ## Part 1: Environment Setup ### Installation ```bash # Install required packages !pip install -q unsloth openenv-core trl # For Google Colab, also run: !pip install -q "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git" ``` ### Imports ```python import torch from dataclasses import dataclass from typing import List, Optional, Dict, Any import random # Check GPU availability print(f"GPU Available: {torch.cuda.is_available()}") if torch.cuda.is_available(): print(f"GPU: {torch.cuda.get_device_name(0)}") print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB") ``` ## Part 2: Model Configuration We use Unsloth for memory-efficient training with LoRA adapters. ### Configuration Classes ```python @dataclass class ModelConfig: """Configuration for loading LLM models.""" model_name: str = "unsloth/Qwen2.5-1.5B" max_seq_length: int = 768 load_in_4bit: bool = True dtype: Optional[str] = None # Auto-detect @dataclass class LoRAConfig: """Configuration for LoRA fine-tuning.""" r: int = 16 lora_alpha: int = 32 target_modules: List[str] = None lora_dropout: float = 0.0 def __post_init__(self): if self.target_modules is None: self.target_modules = [ "q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj", ] ``` ### Loading the Model ```python from unsloth import FastLanguageModel # Create configurations model_config = ModelConfig() lora_config = LoRAConfig() # Load model model, tokenizer = FastLanguageModel.from_pretrained( model_name=model_config.model_name, max_seq_length=model_config.max_seq_length, load_in_4bit=model_config.load_in_4bit, dtype=model_config.dtype, ) # Apply LoRA adapters model = FastLanguageModel.get_peft_model( model, r=lora_config.r, target_modules=lora_config.target_modules, lora_alpha=lora_config.lora_alpha, lora_dropout=lora_config.lora_dropout, bias="none", use_gradient_checkpointing="unsloth", random_state=42, ) # Check parameter counts trainable = sum(p.numel() for p in model.parameters() if p.requires_grad) total = sum(p.numel() for p in model.parameters()) print(f"Trainable: {trainable:,} / {total:,} ({trainable/total*100:.2f}%)") ``` ## Part 3: The 2048 Environment ### Game Overview 2048 is a sliding puzzle game where you combine tiles to reach 2048. **Actions:** - `0` = UP - `1` = RIGHT - `2` = DOWN - `3` = LEFT **Goal:** Create a tile with value 2048 (or higher!) ### Connecting to the Environment ```python from envs.openspiel_env import OpenSpielEnv, OpenSpielAction # Connect to 2048 environment # Option 1: From Hub env = OpenSpielEnv.from_hub("openenv/openspiel-env") # Option 2: From running server # env = OpenSpielEnv(base_url="http://localhost:8000") # Test connection with env: result = env.reset() print(f"Game started!") print(f"Legal actions: {result.observation.legal_actions}") # Take a test action action = OpenSpielAction(action_id=0, game_name="2048") result = env.step(action) print(f"After UP: reward={result.reward}, done={result.done}") ``` ### Board Utilities ```python import numpy as np from typing import List def info_state_to_board(info_state: List[int], size: int = 4) -> List[List[int]]: """Convert flat info_state to 2D board.""" return np.array(info_state, dtype=int).reshape(size, size).tolist() def render_board(board: List[List[int]]) -> str: """Render board as ASCII string.""" lines = ["+------" * len(board[0]) + "+"] for row in board: cells = [f"{v:5d}" if v > 0 else " ." for v in row] lines.append("|" + " |".join(cells) + " |") lines.append("+------" * len(row) + "+") return "\n".join(lines) def get_max_tile(board: List[List[int]]) -> int: """Get highest tile value.""" return max(cell for row in board for cell in row) ``` ## Part 4: Reward Function Design The reward function is crucial for RL. We consider: 1. **Success**: Did we reach 2048? 2. **Progress**: What's the highest tile achieved? 3. **Code Quality**: Did the generated code execute correctly? ### Reward Implementation ```python import math def calculate_reward( max_tile: int, success: bool, code_error: bool = False ) -> float: """ Calculate reward for a 2048 game outcome. Args: max_tile: Highest tile achieved (2, 4, 8, ..., 2048) success: Whether we reached 2048 code_error: Whether generated code had errors Returns: Float reward value """ if code_error: return -0.5 # Penalty for invalid code if success: return 1.0 # Full reward for winning # Progress reward: log scale from 0 to 0.9 if max_tile > 0: progress = math.log2(max_tile) / math.log2(2048) return min(0.9, progress) return 0.0 # Test reward function test_cases = [ (2048, True, False, "Won!"), (1024, False, False, "Got to 1024"), (512, False, False, "Got to 512"), (64, False, False, "Early game"), ] for max_tile, success, error, desc in test_cases: reward = calculate_reward(max_tile, success, error) print(f"{desc:20s} -> Reward: {reward:+.3f}") ``` ## Part 5: Strategy Generation We'll train the model to generate Python strategy functions. ### Prompt Template ```python SYSTEM_PROMPT = """You are an expert at playing 2048. Generate a Python function that takes a board state and returns the best action (0=UP, 1=RIGHT, 2=DOWN, 3=LEFT). The board is a 4x4 list of integers. Empty cells are 0. Your function should analyze the board and return an optimal move. """ def create_prompt(board: List[List[int]]) -> str: """Create prompt for strategy generation.""" board_str = "\n".join(str(row) for row in board) return f"""{SYSTEM_PROMPT} Current board: {board_str} Generate a strategy function: ```python def strategy(board): # Your code here return action # 0, 1, 2, or 3 ```""" ``` ### Executing Generated Strategies ```python import ast from typing import Callable def extract_and_execute_strategy( generated_code: str, board: List[List[int]], timeout: float = 5.0 ) -> tuple[int, bool]: """ Extract and execute a generated strategy function. Returns: (action, success): The action to take and whether execution succeeded """ try: # Extract code block if "```python" in generated_code: code = generated_code.split("```python")[1].split("```")[0] else: code = generated_code # Parse and validate AST tree = ast.parse(code) # Execute in sandbox namespace = {"board": board} exec(compile(tree, "", "exec"), namespace) # Call the strategy function if "strategy" in namespace: action = namespace["strategy"](board) if action in [0, 1, 2, 3]: return action, True return 0, False # Default action on failure except Exception as e: print(f"Strategy execution error: {e}") return 0, False ``` ## Part 6: GRPO Training GRPO (Group Relative Policy Optimization) is optimized for language models. ### Training Configuration ```python from trl import GRPOConfig, GRPOTrainer grpo_config = GRPOConfig( # Learning rate learning_rate=2e-6, # Batch sizes per_device_train_batch_size=4, gradient_accumulation_steps=4, # Training duration max_steps=200, # Memory optimization bf16=True, gradient_checkpointing=True, # Logging logging_steps=1, output_dir="./2048_grpo_output", report_to="none", ) ``` ### Training Loop ```python def train_2048_agent( model, tokenizer, env, config: GRPOConfig, num_episodes: int = 100, ): """ Train the model to play 2048 using GRPO. """ # Prepare model for training FastLanguageModel.for_training(model) training_data = [] for episode in range(num_episodes): # Reset environment result = env.reset() board = info_state_to_board(result.observation.info_state) episode_reward = 0 steps = 0 while not result.done and steps < 1000: # Generate strategy prompt = create_prompt(board) inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model.generate( **inputs, max_new_tokens=256, temperature=0.7, do_sample=True, ) generated = tokenizer.decode(outputs[0], skip_special_tokens=True) # Execute strategy action, success = extract_and_execute_strategy(generated, board) # Take action in environment env_action = OpenSpielAction(action_id=action, game_name="2048") result = env.step(env_action) # Update board board = info_state_to_board(result.observation.info_state) episode_reward += result.reward if result.reward else 0 steps += 1 # Calculate final reward max_tile = get_max_tile(board) final_reward = calculate_reward(max_tile, max_tile >= 2048) # Store for training training_data.append({ "prompt": prompt, "response": generated, "reward": final_reward, }) if episode % 10 == 0: print(f"Episode {episode}: Max tile={max_tile}, Reward={final_reward:.3f}") return training_data ``` ## Part 7: Deployment After training, save and deploy your model. ### Saving the Model ```python # Save LoRA adapters only model.save_pretrained("./2048_strategy_model") tokenizer.save_pretrained("./2048_strategy_model") # Save merged model for inference model.save_pretrained_merged( "./2048_strategy_model_merged", tokenizer, save_method="merged_16bit", ) ``` ### Push to Hugging Face Hub ```python # Push to Hub model.push_to_hub( "your-username/2048-strategy-model", tokenizer, save_method="merged_16bit", private=False, ) print("Model deployed to: huggingface.co/your-username/2048-strategy-model") ``` ### Using the Trained Model ```python from transformers import AutoModelForCausalLM, AutoTokenizer # Load trained model model = AutoModelForCausalLM.from_pretrained("your-username/2048-strategy-model") tokenizer = AutoTokenizer.from_pretrained("your-username/2048-strategy-model") # Generate strategy def get_action(board: List[List[int]]) -> int: prompt = create_prompt(board) inputs = tokenizer(prompt, return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=256) generated = tokenizer.decode(outputs[0], skip_special_tokens=True) action, _ = extract_and_execute_strategy(generated, board) return action # Play a game with OpenSpielEnv.from_hub("openenv/openspiel-env") as env: result = env.reset() board = info_state_to_board(result.observation.info_state) while not result.done: action = get_action(board) result = env.step(OpenSpielAction(action_id=action, game_name="2048")) board = info_state_to_board(result.observation.info_state) print(f"Final max tile: {get_max_tile(board)}") ``` ## Preventing Reward Hacking Be aware of potential reward hacking strategies: 1. **Code that modifies rewards** - Run in sandboxed environment 2. **Infinite loops** - Set execution timeouts 3. **Memory exhaustion** - Limit resource usage ```python import resource import signal def safe_execute(code: str, board: List[List[int]], timeout: float = 5.0) -> int: """Execute strategy with safety limits.""" def handler(signum, frame): raise TimeoutError("Strategy timed out") # Set timeout signal.signal(signal.SIGALRM, handler) signal.alarm(int(timeout)) try: # Set memory limit (100MB) resource.setrlimit(resource.RLIMIT_AS, (100 * 1024 * 1024, -1)) # Execute in restricted namespace namespace = {"board": board, "__builtins__": {"len": len, "max": max, "min": min}} exec(code, namespace) return namespace.get("strategy", lambda b: 0)(board) finally: signal.alarm(0) ``` ## Summary In this tutorial, you learned: 1. **Model Setup**: Loading LLMs with Unsloth and LoRA 2. **Environment Connection**: Using OpenEnv's 2048 environment 3. **Reward Design**: Creating balanced reward functions 4. **GRPO Training**: Training with reinforcement learning 5. **Deployment**: Saving and sharing trained models ## Next Steps - Try different model architectures - Experiment with reward function designs - Train on other OpenEnv environments - Share your trained models on Hugging Face Hub! ## Related Resources - [OpenEnv Getting Started](../auto_getting_started/index) - [Building Custom Environments](../auto_getting_started/plot_03_building_environments) - [GRPO Documentation](https://huggingface.co/docs/trl/grpo_trainer) - [Unsloth Documentation](https://github.com/unslothai/unsloth)