Skip to main content

TeaQL Evaluation Report 001: Why We Publish the Evidence

· 3 min read
TeaQL Team
Core Team

TeaQL Evaluation Report 001 is now available. We are publishing it together with the raw evaluation data because AI coding for business software should be inspectable, not only demonstrated.

Business software has a different risk profile from a small demo. It carries business rules, audit requirements, data boundaries, operational workflows, and long-lived maintenance obligations. If an AI coding agent works in that environment, the important question is not only whether it can produce code once. The important question is whether its work can be measured, reviewed, and traced back to evidence.

That is why this report exists.

What This Report Is

Report 001 is our first public autonomous evaluation report for TeaQL agent workflows. It looks at how an AI coding agent behaves when working against TeaQL's generated business API surface and evaluation materials.

The report is not presented as a universal industry benchmark. It is an early, public, reproducible evaluation artifact for TeaQL's own agent-facing approach. Its value is in the evidence it exposes:

  • the tasks being evaluated
  • the agent behavior observed during the run
  • the measured outcomes
  • the failure patterns
  • the lessons for stronger guardrails and better agent workflows

That evidence matters more than a single success story.

Why Publish Reports

AI coding tools can look convincing when the output is shown without context. For business software, that is not enough. Teams need to understand how the result was produced, where the agent stayed inside the intended API boundary, where it drifted, and what kinds of mistakes still require evaluation and review.

TeaQL is designed around that premise. Generated APIs, structured model information, audit-aware operations, and evaluation workflows should make agent work easier to inspect. Publishing evaluation reports is a way to apply the same discipline to TeaQL itself.

The PDF is the readable summary. The raw data is the evidence trail.

How To Read It

Readers do not need to treat the report as marketing material. A better way to read it is as an engineering artifact:

  1. Start with the task definitions.
  2. Check what the agent actually produced.
  3. Compare the result against the intended TeaQL API boundary.
  4. Look at the evaluation notes and failure patterns.
  5. Use the raw data to verify the summary instead of trusting the summary alone.

This is also how we want TeaQL adoption to work: not by hiding complexity, but by making the important parts reviewable.

What Comes Next

Report 001 is only a first step. Future evaluations should cover more tasks, more agent tools, more models, and clearer comparisons across time. The goal is not to claim that autonomous coding is solved. The goal is to make progress visible, repeatable, and honest enough for serious software teams to reason about.

You can read the report here: