Batched reward model inference and Best-of-N sampling
Posted by retrolione@reddit | LocalLLaMA | View on Reddit | 0 comments
Quick blog post on reward model inference with dynamic batching (for llm as a judge, best of n sampling, preference tuning, and other RL use cases)