Policy Optimazation: PPO and GRPO

The recent pulication of DeepSeek R1 paper on Nature showed reinforcememnt learning (RL) has become one of the center piece in LLM developmen. In this post we discuss some of the more prominent algorithms in RL

Proximal Policy Optimizaiton (PPO)

One of the most well-studied RL algorithm.

Group Relative Policy Optimization (GRPO)

First introduced in DeepSeek Math paper. Instead of training a critic network, it samples multiple outputs (the group) and compute the group relative advantage for each output and perform gradient update, essentially mimicking a ‘sampel and eliminate’ process.