Counsel:
A Meta-Evaluation Dataset for Agentic Tasks

A public dataset for measuring whether LLM-as-a-Judge critiques of agent trajectories agree with human meta-evaluations.

Sashank Pisupati^1,*, Henry Broomfield^1,*, Eujong Choi^2,*, Antonia Calvi^3,*, Charlie Wang², Roman Engeler¹, Max Bartolo⁴, Patrick Lewis²

¹Atla AI ²Cohere AI ³Mistral AI ⁴Google DeepMind
^*Equal contribution

arXiv PDF HF Paper 🤗 Dataset Collection

Abstract

As agentic systems tackle increasingly complex multi-step tasks, evaluating their trajectories presents a major bottleneck: human annotation of a single trajectory on popular agentic benchmarks can take hours, making it difficult to scale evaluations for measuring performance or curating training data. This has driven widespread reliance on automated approaches such as LLM-as-a-Judge (LLMJ) to critique agents at the process and outcome levels at scale, but the soundness of LLMJ critiques often goes unmeasured.

Counsel fills this gap as the first public dataset of meta-evaluations for agentic tasks. It pairs process-level critiques from open-weight LLMJs on two realistic agent benchmarks, τ-bench for customer-support agents and DA-Code for coding agents, with human meta-evaluations of whether those critiques are valid. This enables something outcome-only agent benchmarks do not: researchers can directly measure when an automated judge's reasoning about a trajectory agrees with humans, compare judge failure modes across domains, and train or calibrate evaluators using supervised feedback on the critiques themselves. Counsel therefore provides a reusable testbed for building more faithful LLM judges, debugging evaluation pipelines, and scaling agent development without treating automated feedback as an unchecked black box.

Example tau-bench and DA-Code trajectories with judge critiques and human meta-annotations from Counsel — **Figure 1:** Example agent trajectories, judge critiques, and human meta-annotations from Counsel. The left example shows a tau-bench customer-support trajectory where the judge correctly flags an error; the right example shows a DA-Code trajectory where a judge critique is rejected by the human meta-annotation.

1.13k

human meta-annotations released in the dataset

225

agent trajectories spanning customer-support and coding tasks

0.78

Krippendorff's alpha for human meta-annotation agreement

What Counsel Measures

Error location

Does the judge flag a real problematic step in the agent trajectory?

Critique quality

When the judge flags a step, is its natural-language critique actually sound?

Human alignment

How well do automated critiques agree with human meta-evaluations across agentic domains?

Resources

Paper

Read the full technical report on arXiv or the Hugging Face paper page.

arxiv.org/abs/2606.21627

huggingface.co/papers/2606.21627

Dataset

Browse or load the released Counsel dataset on Hugging Face.

huggingface.co/datasets/AtlaAI/counsel

AtlaAI Collection

Find the paper and dataset grouped together on the AtlaAI organization page.

huggingface.co/collections/AtlaAI/counsel

Citation

@misc{pisupati2026counsel,
  title = {Counsel: A Meta-Evaluation Dataset for Agentic Tasks},
  author = {Pisupati, Sashank and Broomfield, Henry and Choi, Eujong and Calvi, Antonia and Wang, Charlie and Engeler, Roman and Bartolo, Max and Lewis, Patrick},
  year = {2026},
  eprint = {2606.21627},
  archivePrefix = {arXiv},
  primaryClass = {cs.AI},
  url = {https://arxiv.org/abs/2606.21627}
}

Counsel:A Meta-Evaluation Dataset for Agentic Tasks