Perseverance As a Special MDP 7 states

A rover moves on a 1‑D line of states \(s_1,\dots,s_7\). Rewards can be set per state below (default \(R_1=1, R_7=10, R_i=0\) otherwise). Policy chooses Right with probability \(p\) and Left with probability \(1-p\). We evaluate the value function from zeros and update iteratively.

Discount \(\gamma\) 0.90 Right probability \(p\) 0.50 TD learning rate \(\alpha\) 0.30 Auto steps

Expectation update used for the Bellman Sweep (terminals absorb: \(V(s_1)=R_1\), \(V(s_7)=R_7\); reward obtained for being in current state):

\( V(s_i) \leftarrow R_i + \gamma\,[\,(1-p)V(s_{i-1}) + pV(s_{i+1})\,],\quad i=2,\dots,6 \)

The Simulate Step (TD) button moves the rover one step by sampling Left/Right; reward is the value of \(R\) at the next state (upon entry):

\( V(s_t) \leftarrow V(s_t) + \alpha\,[\,R(s_{t+1}) + \gamma V(s_{t+1}) - V(s_t)\,],\qquad V(s_1)=R_1,\;V(s_7)=R_7 \)

Goal‑reach Probabilities

Straight‑right path (exactly 3 steps)

0.125000

Eventually reach s₇ before s₁ (Gambler’s Ruin)

0.500000

Compute probability of success within ≤ N steps

Success within ≤ N steps (starting s₄)

—

V(s) bars within‑N success curve

What you’re seeing

This board shows editable rewards \(R_i\) and the value function estimates \(V(s_i)\). Terminals are absorbing with \(V(s_1)=R_1\) and \(V(s_7)=R_7\). The Bellman Sweep updates all non‑terminal states using the expectation under the current policy \(p\). The TD step samples a single transition and applies a TD(0) update. Change \(\gamma\), \(p\), and \(\alpha\) at any time.

Single standalone HTML. Tip: set \(p=0.8\), \(\gamma=0.95\), then auto‑run some sweeps.