Task context
max_score —Student submission
Pipeline history
live transcript—
Action
Submit an action matching the environment schema. Routing decision matters only in the validator stage.
Human-in-the-loop tester
Play the role of the RL agent: step through arbiter → scrutinizer → validator → mentor, inspect observation state, and submit actions.
Choose difficulty (or run all three). Auto-run calls the same LLM loop as inference.py via /api/llm/complete (uses server env: HF_TOKEN / API_KEY, API_BASE_URL, MODEL_NAME).
Rewards are dense along the pipeline, then a final payout on the last step. Each episode starts with a flow_bank of 0.10. Intermediate transitions subtract small amounts from the bank as “progress” rewards.
The episode completes at mentor. The step reward is the sum of:
The total is clamped to [0, 1]. When the episode ends, diagnostics include accuracy_reward, validity_reward, flow_bank_payout, and total_step_reward.
—
Submit an action matching the environment schema. Routing decision matters only in the validator stage.