A model trained to score outputs, providing the reward signal that guides reinforcement-learning fine-tuning.
← All terms