The fast evaluation checklist
- Accuracy: fewer wrong answers on your real tasks.
- Reliability: follows format and constraints without reminders.
- Tool use: fewer failed steps if you use tools/agents.
- Latency: response time within your SLA.
- Cost: total cost per successful outcome (not per token).
- Safety: fewer risky outputs in sensitive contexts.
A simple comparison table (fill it with your results)
| Metric | GPT-5.4 | GPT-6 | Notes |
|---|---|---|---|
| Task success rate | [ ] | [ ] | Use 20–50 representative tasks. |
| Format compliance | [ ] | [ ] | Same prompt, same output schema. |
| Latency | [ ] | [ ] | Use p50/p95 if you can. |
| Cost per success | [ ] | [ ] | Include retries and tool calls. |
What not to over-weight
- Marketing adjectives: they don’t predict your success rate.
- One cherry-picked demo: use a small benchmark of your real tasks.
- Token price alone: retries and failure modes usually cost more.