The Basics of Reinforcement Learning from Human Feedback RLHF PPO From Zero to PPO: Understanding the Path to Helpful AI Models DPO