RLHF
Before starting, it’s advisable to first complete David Silver’s Course on RL and read Lilian’s notes on RL which explains/provides notes on the David’s course in sequential manner. In simple problems, we simply start with an arbitrary value function, and then go on updating that value function incrementally, using different algorithms such as Monte Carlo (which collects reward over the whole episoe), Temporal difference, aka TD(0) (which considers bootstrapping, i.e only considering the immediate reward and then approximating other remaining rewards with the help of value function $r + V(s)$ ) and other algorithms....