Real failure modes we hit building a multi-database data agent against DataAgentBench (DAB)

Posted by Life_Meringue_4343@reddit | LocalLLaMA | View on Reddit | 2 comments

Been building against DataAgentBench (github.com/ucbepic/DataAgentBench) last week — 54 queries across PostgreSQL, MongoDB, SQLite, DuckDB. Best frontier model score is 38%.

Here's what actually broke our agent. Not SQL generation. Reality.

Silent join failure — same entity stored as "businessid_49" in MongoDB and

"businessref_49" in DuckDB. Agent joins, gets zero rows, returns empty with no error. Looks like a valid answer. Isn't.

Mixed date formats — same column, 6 formats. Single strptime pattern silently drops rows that don't match. We were undercounting by nearly half before we caught it.

No category field — categories are embedded in a free text description field. Querying for a category field returns zero rows with no error raised.

Validator sensitivity — right answer, wrong word order = fail. The validator checks exact format not just correctness.

Fix for all of these: load the knowledge into context before the query arrives.

Not fine-tuning, not a bigger model. Context engineering.

Submitting to DAB this week. Will post results.

What's the messiest data issue you've hit building agents in production?

[-]

sanchita_1607@reddit

yk this is why agents break in prod ...not reasoning, but hidden assumptions in data. i like the load context before query approach, almost like giving it the much needed reality check, we’ve also tried routing diff steps (exploration and execution via kilo nd more) to reduce these failures, helps a bit but data quality still the main focus

Life_Meringue_4343@reddit (OP)

Exactly! hidden assumptions are the silent killer. The context-before-query approach is basically forcing the agent to read the room before acting. Your routing idea is interesting! Does routing diff steps help with join key mismatches specifically or more with planning failures generally?