Generator#
The Generator (Policy) is the core inference engine in TorchForge, built on top of vLLM. It manages model serving, text generation, and weight updates for reinforcement learning workflows.
Generator#
- class forge.actors.generator.Generator(engine_args=<factory>, sampling_params=<factory>, prefetch_weights_to_shm=True, n_fetcher_procs=8)[source]#
vLLM-based generator using AsyncLLM with Monarch distributed execution.
Wraps vLLM’s AsyncLLM engine and uses MonarchExecutor for multi-GPU inference. See MonarchExecutor docstring for architecture diagram.
- Parameters:
engine_args – vLLM EngineArgs for model configuration. Can be EngineArgs or dict.
sampling_params – Default SamplingParams for generation. Can be SamplingParams or dict.
prefetch_weights_to_shm – Whether to prefetch weights to shared memory for faster weight updates. When enabled, weight fetchers download weights in parallel to shared memory while generation is still running. Defaults to True.
n_fetcher_procs – Number of fetcher processes for parallel weight downloading. Only used when prefetch_weights_to_shm is True. Defaults to 8.
Example
>>> generator = await Generator.options(procs=1, with_gpus=True).as_service( ... engine_args={"model": "meta-llama/Llama-3-8B", "tensor_parallel_size": 2}, ... sampling_params={"max_tokens": 128, "temperature": 0.7}, ... ) >>> completions = await generator.generate("Tell me a joke") >>> await generator.shutdown()
- engine_args#
- generate#
- n_fetcher_procs = 8#
- prefetch_weights_to_shm = True#
- sampling_params#
- save_model_params#
- setup#
- async classmethod shutdown(actor)[source]#
Shutdown the generator and cleanup all resources.
Cleanup order: 1. Stop AsyncLLM (triggers MonarchExecutor.shutdown() which destroys
process groups and stops proc_mesh)
Stop generator_proc
- stop#
- update_weights#
- validate_model_params#