is there a better way to track schema changes without silently breaking downstream reports?

Posted by Ok_Abrocoma_6369@reddit | ExperiencedDevs | View on Reddit | 12 comments

we have dbt models pushing schema changes to prod pretty regularly but downstream reports and bi dashboards keep breaking silently. No alerts, just find out when someone complains a week later.

current setup is basic git history + dbt docs but that doesn't catch when a column rename or type change nukes a join in some forgotten looker dashboard. tried adding pre deploy checks with sql fluff but its too static, misses runtime impacts.

our team is small, 4 data engs handling 50+ models across prod/staging. leadership wants zero breakage but manually reviewing every pr is killing us.

anyone got a lightweight way to track this like dbt macros that flag downstream deps, or some schema diff tool that pings slack on breaks open source preferred since budget sucks. what've you seen work at scale without turning into a full ci nightmare?

curious how others avoid this treadmill.

[-]

mattgen88@reddit

We use events with strict backwards compatibility enforcement. Publish those to kafka. Anyone else consumes those instead of relying on our schema. That allows us to change schemas for our application independent of the events. Events can be create/update/delete or business events, or even a mix of both.

caprisunkraftfoods@reddit

I agree with those saying they shouldn't depend on your schema, but the more fundamental antipattern here is a lack of tests. Nothing will ever truly stop you from making break changes besides a test that says "hey you broke this". Either the reports need to be in your test pipeline, or they need to be generated from something tested in your pipeline. Or secret third option: you communicate how much work this would be to leadership and they get off your case.

hurley_chisholm@reddit

I like a lot of what JakoMyto said, I’ll add two suggestions: 1) partially non-tech and the other more technical:

Notify early and often about upcoming changes: First, create a list of the downstream consumers at least to down to the individual dashboard and report generator. If you can, try to get field level usage. You can write a script to map this from the downstream service APIs and throw a cron job on it to keep it updated. Assign human owners to all the consumers, it will be annoying. Do it anyway. Finally, make a Slack bot that pings/emails the impacted owners every time a schema change introduces a relevant breaking change. Have it notify multiple times: when the changes get merged into your main branch, just before release, and after the release (more on that in step 2).
Generate field migration release notes: This is a nice to have for you and your non-technical teammates. This should be pretty straightforward and there are lot of reasonable ways to approach this. Just add whatever process you decide to your automated release job. Using the same bot you made in step 1, notify all the owners that the changes are out and people should use the ✨included release notes to patch their stuff.

A few things will happen: old dashboards and report generators will get archived/deleted reducing your overall analytics footprint. Fewer people will come to you annoyed or at least will have specific questions. If you do step 2, it will make the support process on your side go faster since you’ll have a docs to follow.

Hope this helps & good luck!

dvorgson@reddit

there's a reason apis have /v1/, /v2/, etc

JakoMyto@reddit

Thats a good one. In the first place I find it useful to do schema migrations via dedicated tool with verdioning lile flyway.

Another point is trying to keep schema backwards comptible for one release back. This means two step column renaming for example - first add new one and duplicate data, then move towards using the new one everywhere and only then drop the old column.

Another thing I found useful when I was working on BI solution I made a pipeline that looks for diff in the folder containing thw db migrations and call a slack webhook if it finds something. You can make it more advanced and look for specific changes.

One more options is to never read directly for tables but instead use db views. They should break earlier and maybe add tests to verify they works.

But in general shared database is an antipattern.

aa-b@reddit

But in general shared database is an antipattern

Too right! My org solved this by building reports based on a federated graphql schema backed by distributed microservices, and... wait, no, nothing is solved. Turns out it's just as fragile, but also spread out all over the place

That's also very good point! API have a bit more tooling help with versioning and breaking changes like specs and contract tests but still we have the human factor.

At end of the day a breaking change will eventually come and one would need a sort of a strategy to deal with it.

PrideDense2206@reddit

Are you using something like open lineage to track dependencies (columns of upstream tables) across the Lakehouse? This can help to show what is being used where (can even track from the reports to get a full end to end dependency graph).

Without adding lineage, what about table formats? If you’re using Iceberg or Delta Lake they both have schema enforcement (validation and evolution capabilities), so even if you want to nuke the schema or drop some columns - you have to make table modifications with explicit overwrite parameters or the operations get short circuited.

If you test for backwards compatibility (which was a good question from JakoMyto) then these kinds of problems are usually more of a “this column used to have data and now doesn’t” vs this column doesn’t exist anymore.

Testing is always the name of the game. If you can incorporate table lineage at the column level then you also start to understand how each table is being used and then you can of course start capturing data quality metrics (based on the expectations of each table - does it have nulls where it shouldn’t, is this value within the normal range of 1-100, etc)

yxhuvud@reddit

Don't let them use your schema directly - either create an external API or a set of db views that you promise to keep up to date (and test, preferably).

TolfdirsAlembic@reddit

We also had this problem (same stack). Closest we got was using https://github.com/looker-open-source/look-at-me-sideways

And automating an alert or action in GitHub when one of the models does something that might break a model. Required a fair amount of custom linting rules and wasn't 100%

You could also consider some kind of data contracts. Again not a perfect solution but would be better. No breakages ever isn't realistic. But if your team are the bottleneck, you either need to review every PR as the people accountable for these dashboards, or provide self serve tools and take yourself out of the review loop for dashboards entirely

HolyPommeDeTerre@reddit

Integrations test ?

I would like a suite of different dataset structure for the reports. Then run them through a suite everytime you make a change.

Should give you an anti regression suite. If there is a missing test, add it.

hibbelig@reddit

Add tests that test the downstream reports?