Real failure modes we hit building a multi-database data agent against DataAgentBench (DAB)

Posted by Life_Meringue_4343@reddit | LocalLLaMA | View on Reddit | 2 comments

Been building against DataAgentBench (github.com/ucbepic/DataAgentBench) last week — 54 queries across PostgreSQL, MongoDB, SQLite, DuckDB. Best frontier model score is 38%.

Here's what actually broke our agent. Not SQL generation. Reality.

Silent join failure — same entity stored as "businessid_49" in MongoDB and

"businessref_49" in DuckDB. Agent joins, gets zero rows, returns empty with no error. Looks like a valid answer. Isn't.

Mixed date formats — same column, 6 formats. Single strptime pattern silently drops rows that don't match. We were undercounting by nearly half before we caught it.

No category field — categories are embedded in a free text description field. Querying for a category field returns zero rows with no error raised.

Validator sensitivity — right answer, wrong word order = fail. The validator checks exact format not just correctness.

Fix for all of these: load the knowledge into context before the query arrives.

Not fine-tuning, not a bigger model. Context engineering.

Submitting to DAB this week. Will post results.

What's the messiest data issue you've hit building agents in production?