Human-in-the-loop tester

AEGIS-Env: Automated Evaluation Pipeline

Play the role of the RL agent: step through arbiter → scrutinizer → validator → mentor, inspect observation state, and submit actions.

Benchmark episode — step — stage —

Task & auto-run

Choose difficulty (or run all three). Auto-run calls the same LLM loop as inference.py via /api/llm/complete (uses server env: HF_TOKEN / API_KEY, API_BASE_URL, MODEL_NAME).

Task

Max steps / episode

Reward function (how step rewards are computed)

Rewards are dense along the pipeline, then a final payout on the last step. Each episode starts with a flow_bank of 0.10. Intermediate transitions subtract small amounts from the bank as “progress” rewards.

Pipeline transitions

arbiter → scrutinizer, scrutinizer → validator: reward 0.02 each (bank decreases).
validator with routing_decision = proceed → mentor: reward 0.02.
validator with revise → scrutinizer (refinement loop): reward 0.01. At most two refinement loops; exceeding that or an invalid route ends the episode with a fatal step (reward 0).

Final step (mentor)

The episode completes at mentor. The step reward is the sum of:

Accuracy (up to 0.6): compare normalized proposed score vs. hidden human score: 0.6 × (1 − |norm_agent − norm_human|) when proposed_score ∈ [0, max_score].
Validity (up to 0.3): if agent_reasoning has at least 10 words, 0.3 × Jaccard(reasoning, reference_feedback).
Flow bank remainder: whatever is left in flow_bank is paid out with the final step.

The total is clamped to [0, 1]. When the episode ends, diagnostics include accuracy_reward, validity_reward, flow_bank_payout, and total_step_reward.

Task context

max_score —

Question

—

Rubric

—

Student submission

—

Pipeline history

live transcript

—

Action

Submit an action matching the environment schema. Routing decision matters only in the validator stage.

Controls

Reset starts a new episode. Step submits your action.

Task & auto-run

Task context

Student submission

Pipeline history

Grading diagnostics

Action