is there a better way to track schema changes without silently breaking downstream reports?
Posted by Ok_Abrocoma_6369@reddit | ExperiencedDevs | View on Reddit | 12 comments
we have dbt models pushing schema changes to prod pretty regularly but downstream reports and bi dashboards keep breaking silently. No alerts, just find out when someone complains a week later.
current setup is basic git history + dbt docs but that doesn't catch when a column rename or type change nukes a join in some forgotten looker dashboard. tried adding pre deploy checks with sql fluff but its too static, misses runtime impacts.
our team is small, 4 data engs handling 50+ models across prod/staging. leadership wants zero breakage but manually reviewing every pr is killing us.
anyone got a lightweight way to track this like dbt macros that flag downstream deps, or some schema diff tool that pings slack on breaks open source preferred since budget sucks. what've you seen work at scale without turning into a full ci nightmare?
curious how others avoid this treadmill.
mattgen88@reddit
We use events with strict backwards compatibility enforcement. Publish those to kafka. Anyone else consumes those instead of relying on our schema. That allows us to change schemas for our application independent of the events. Events can be create/update/delete or business events, or even a mix of both.
caprisunkraftfoods@reddit
I agree with those saying they shouldn't depend on your schema, but the more fundamental antipattern here is a lack of tests. Nothing will ever truly stop you from making break changes besides a test that says "hey you broke this". Either the reports need to be in your test pipeline, or they need to be generated from something tested in your pipeline. Or secret third option: you communicate how much work this would be to leadership and they get off your case.
hurley_chisholm@reddit
I like a lot of what JakoMyto said, I’ll add two suggestions: 1) partially non-tech and the other more technical:
A few things will happen: old dashboards and report generators will get archived/deleted reducing your overall analytics footprint. Fewer people will come to you annoyed or at least will have specific questions. If you do step 2, it will make the support process on your side go faster since you’ll have a docs to follow.
Hope this helps & good luck!
dvorgson@reddit
there's a reason apis have /v1/, /v2/, etc
JakoMyto@reddit
Thats a good one. In the first place I find it useful to do schema migrations via dedicated tool with verdioning lile flyway.
Another point is trying to keep schema backwards comptible for one release back. This means two step column renaming for example - first add new one and duplicate data, then move towards using the new one everywhere and only then drop the old column.
Another thing I found useful when I was working on BI solution I made a pipeline that looks for diff in the folder containing thw db migrations and call a slack webhook if it finds something. You can make it more advanced and look for specific changes.
One more options is to never read directly for tables but instead use db views. They should break earlier and maybe add tests to verify they works.
But in general shared database is an antipattern.
aa-b@reddit
Too right! My org solved this by building reports based on a federated graphql schema backed by distributed microservices, and... wait, no, nothing is solved. Turns out it's just as fragile, but also spread out all over the place
JakoMyto@reddit
That's also very good point! API have a bit more tooling help with versioning and breaking changes like specs and contract tests but still we have the human factor.
At end of the day a breaking change will eventually come and one would need a sort of a strategy to deal with it.
PrideDense2206@reddit
Are you using something like open lineage to track dependencies (columns of upstream tables) across the Lakehouse? This can help to show what is being used where (can even track from the reports to get a full end to end dependency graph).
Without adding lineage, what about table formats? If you’re using Iceberg or Delta Lake they both have schema enforcement (validation and evolution capabilities), so even if you want to nuke the schema or drop some columns - you have to make table modifications with explicit overwrite parameters or the operations get short circuited.
If you test for backwards compatibility (which was a good question from JakoMyto) then these kinds of problems are usually more of a “this column used to have data and now doesn’t” vs this column doesn’t exist anymore.
Testing is always the name of the game. If you can incorporate table lineage at the column level then you also start to understand how each table is being used and then you can of course start capturing data quality metrics (based on the expectations of each table - does it have nulls where it shouldn’t, is this value within the normal range of 1-100, etc)
yxhuvud@reddit
Don't let them use your schema directly - either create an external API or a set of db views that you promise to keep up to date (and test, preferably).
TolfdirsAlembic@reddit
We also had this problem (same stack). Closest we got was using https://github.com/looker-open-source/look-at-me-sideways
And automating an alert or action in GitHub when one of the models does something that might break a model. Required a fair amount of custom linting rules and wasn't 100%
You could also consider some kind of data contracts. Again not a perfect solution but would be better. No breakages ever isn't realistic. But if your team are the bottleneck, you either need to review every PR as the people accountable for these dashboards, or provide self serve tools and take yourself out of the review loop for dashboards entirely
HolyPommeDeTerre@reddit
Integrations test ?
I would like a suite of different dataset structure for the reports. Then run them through a suite everytime you make a change.
Should give you an anti regression suite. If there is a missing test, add it.
hibbelig@reddit
Add tests that test the downstream reports?