Autonomous Benchmarking
TeaQL evaluation reports measure how well AI coding agents can complete business software tasks when they are constrained by generated APIs, modeling rules, compile checks, and audit requirements.
What the Benchmark Measures
The benchmark is intended to show whether an agent can:
- Understand a domain modeling task.
- Modify the model without violating KSML rules.
- Generate or use the correct Java or Rust API.
- Avoid guessed method names.
- Add required query intent and audit metadata.
- Compile and test the resulting application.
- Read Markdown error reports and fix the source cause.
Common Failure Classes
The most useful benchmark output is not only the final score. It is the failure classification:
- Invented generated API methods.
- Missing
purposeorcommentbefore execution. - Missing
audit_as/auditAsbefore persistence. - Empty KSML attributes.
- Excessive nested references.
- Raw SQL added where generated APIs should be used.
- Runtime policy bypassed in application code.
How TeaQL Guardrails Help
TeaQL narrows the agent's search space:
- The model defines the vocabulary.
- Generated source defines the exact method names.
- Compile errors expose invalid API calls.
- Query and audit metadata make intent reviewable.
- Markdown reports give agents a structured path from error to fix.
Report Links
Evaluation reports should be published as ordinary documentation or blog content with links to:
- A human-readable summary.
- The downloadable report artifact.
- Raw data or reproducibility notes.
- Known limitations of the benchmark run.
Keep benchmark pages explicit about the tested version, model, scenario, and command set. Results become less useful when the reader cannot reproduce the environment.