Mosaic Eval Harness

Model adapters available

5 ready, 2 need setup.

Openai.gpt Oss 120b readyAnthropic.claude Sonnet 4 6 readyMoonshotai.kimi K2.5 readyMinimax.minimax M2.5 readyGemini 2.5 Pro setup neededMistral Large Latest setup neededLocal Model ready

Compose a run

Select curated tasks, pick the routing strategy, and launch the run into the background.

Run name

Routing strategy

Judge model

Max steps per task

Max concurrent requests

Cost budget USD

Include baseline runs

Keep solo and random comparison runs in the same session.

Tasks

19 of 19 selected

Proxy set

Model set

Select the model adapters to compare. The first selected model becomes the fallback judge.

Selected models: Bedrock OpenAI, Bedrock Claude, Bedrock Kimi K2.5, Bedrock MiniMax M2.5, LM Studio local-model

Launch

The run persists checkpoints, audit logs, and summary metrics in SQLite.

Keep the public corpus aligned with the research scope and curated fixture policy.

Need to configure a provider or test a local LM Studio server? Open model setup

Need to import tasks first? Open task library