Model benchmark

List models from an OpenAI-compatible endpoint (e.g. GET …/v1/models), choose five models and a task difficulty, then compare runs. Only the chat model name changes between episodes; prompts and environment settings are identical.

Configuration

Default API root matches Ollama’s OpenAI-compatible surface ( ollama.com/v1/models). For a local daemon use http://127.0.0.1:11434/v1.

API root (list + chat)

Optional API key

Select five models

Task difficulty

Max steps

Seed (optional)

Results

Model	Total reward	Steps	Error

Total reward by model

Steps to last transition

Cumulative reward over steps

Per-episode reward sequence (same task + seed per model).