A public dataset for measuring whether LLM-as-a-Judge critiques of agent trajectories agree with human meta-evaluations.
As agentic systems tackle increasingly complex multi-step tasks, evaluating their trajectories presents a major bottleneck: human annotation of a single trajectory on popular agentic benchmarks can take hours, making it difficult to scale evaluations for measuring performance or curating training data. This has driven widespread reliance on automated approaches such as LLM-as-a-Judge (LLMJ) to critique agents at the process and outcome levels at scale, but the soundness of LLMJ critiques often goes unmeasured.
Counsel fills this gap as the first public dataset of meta-evaluations for agentic tasks. It pairs process-level critiques from open-weight LLMJs on two realistic agent benchmarks, τ-bench for customer-support agents and DA-Code for coding agents, with human meta-evaluations of whether those critiques are valid. This enables something outcome-only agent benchmarks do not: researchers can directly measure when an automated judge's reasoning about a trajectory agrees with humans, compare judge failure modes across domains, and train or calibrate evaluators using supervised feedback on the critiques themselves. Counsel therefore provides a reusable testbed for building more faithful LLM judges, debugging evaluation pipelines, and scaling agent development without treating automated feedback as an unchecked black box.
Does the judge flag a real problematic step in the agent trajectory?
When the judge flags a step, is its natural-language critique actually sound?
How well do automated critiques agree with human meta-evaluations across agentic domains?
Read the full technical report on arXiv or the Hugging Face paper page.
Browse or load the released Counsel dataset on Hugging Face.
Find the paper and dataset grouped together on the AtlaAI organization page.
@misc{pisupati2026counsel,
title = {Counsel: A Meta-Evaluation Dataset for Agentic Tasks},
author = {Pisupati, Sashank and Broomfield, Henry and Choi, Eujong and Calvi, Antonia and Wang, Charlie and Engeler, Roman and Bartolo, Max and Lewis, Patrick},
year = {2026},
eprint = {2606.21627},
archivePrefix = {arXiv},
primaryClass = {cs.AI},
url = {https://arxiv.org/abs/2606.21627}
}