A rover moves on a 1‑D line of states \(s_1,\dots,s_7\). Rewards can be set per state below (default \(R_1=1, R_7=10, R_i=0\) otherwise). Policy chooses Right with probability \(p\) and Left with probability \(1-p\). We evaluate the value function from zeros and update iteratively.
Expectation update used for the Bellman Sweep (terminals absorb: \(V(s_1)=R_1\), \(V(s_7)=R_7\); reward obtained for being in current state):
The Simulate Step (TD) button moves the rover one step by sampling Left/Right; reward is the value of \(R\) at the next state (upon entry):
This board shows editable rewards \(R_i\) and the value function estimates \(V(s_i)\). Terminals are absorbing with \(V(s_1)=R_1\) and \(V(s_7)=R_7\). The Bellman Sweep updates all non‑terminal states using the expectation under the current policy \(p\). The TD step samples a single transition and applies a TD(0) update. Change \(\gamma\), \(p\), and \(\alpha\) at any time.