Perseverance As a Special MDP 7 states

A rover moves on a 1‑D line of states \(s_1,\dots,s_7\). Rewards can be set per state below (default \(R_1=1, R_7=10, R_i=0\) otherwise). Policy chooses Right with probability \(p\) and Left with probability \(1-p\). We evaluate the value function from zeros and update iteratively.

Expectation update used for the Bellman Sweep (terminals absorb: \(V(s_1)=R_1\), \(V(s_7)=R_7\); reward obtained for being in current state):

\( V(s_i) \leftarrow R_i + \gamma\,[\,(1-p)V(s_{i-1}) + pV(s_{i+1})\,],\quad i=2,\dots,6 \)

The Simulate Step (TD) button moves the rover one step by sampling Left/Right; reward is the value of \(R\) at the next state (upon entry):

\( V(s_t) \leftarrow V(s_t) + \alpha\,[\,R(s_{t+1}) + \gamma V(s_{t+1}) - V(s_t)\,],\qquad V(s_1)=R_1,\;V(s_7)=R_7 \)

Goal‑reach Probabilities

Straight‑right path (exactly 3 steps)

0.125000

Eventually reach s7 before s1 (Gambler’s Ruin)

0.500000

Success within ≤ N steps (starting s4)

V(s) bars    within‑N success curve

What you’re seeing

This board shows editable rewards \(R_i\) and the value function estimates \(V(s_i)\). Terminals are absorbing with \(V(s_1)=R_1\) and \(V(s_7)=R_7\). The Bellman Sweep updates all non‑terminal states using the expectation under the current policy \(p\). The TD step samples a single transition and applies a TD(0) update. Change \(\gamma\), \(p\), and \(\alpha\) at any time.

Single standalone HTML. Tip: set \(p=0.8\), \(\gamma=0.95\), then auto‑run some sweeps.