Batched reward model inference and Best-of-N sampling

Quick blog post on reward model inference with dynamic batching (for llm as a judge, best of n sampling, preference tuning, and other RL use cases)