Autonomous Benchmarking

TeaQL evaluation reports measure how well AI coding agents can complete business software tasks when they are constrained by generated APIs, modeling rules, compile checks, and audit requirements.

What the Benchmark Measures

The benchmark is intended to show whether an agent can:

Understand a domain modeling task.
Modify the model without violating KSML rules.
Generate or use the correct Java or Rust API.
Avoid guessed method names.
Add required query intent and audit metadata.
Compile and test the resulting application.
Read Markdown error reports and fix the source cause.

Common Failure Classes

The most useful benchmark output is not only the final score. It is the failure classification:

Invented generated API methods.
Missing purpose or comment before execution.
Missing audit_as / auditAs before persistence.
Empty KSML attributes.
Excessive nested references.
Raw SQL added where generated APIs should be used.
Runtime policy bypassed in application code.

How TeaQL Guardrails Help

TeaQL narrows the agent's search space:

The model defines the vocabulary.
Generated source defines the exact method names.
Compile errors expose invalid API calls.
Query and audit metadata make intent reviewable.
Markdown reports give agents a structured path from error to fix.

Report Links

Evaluation reports should be published as ordinary documentation or blog content with links to:

A human-readable summary.
The downloadable report artifact.
Raw data or reproducibility notes.
Known limitations of the benchmark run.

Keep benchmark pages explicit about the tested version, model, scenario, and command set. Results become less useful when the reader cannot reproduce the environment.

What the Benchmark Measures​

Common Failure Classes​

How TeaQL Guardrails Help​

Report Links​

What the Benchmark Measures

Common Failure Classes

How TeaQL Guardrails Help

Report Links