Kafka schema evolution & breaking changes: what do production teams actually do?
Posted by Lucky_Psychology8275@reddit | ExperiencedDevs | View on Reddit | 47 comments
My company kinda lacks Kafka experts and I really need guidance on what are the accepted standard practices when it comes to managing Kafka schema and ser/deser on client side (spring cloud stream), especially in the context of HA deployment.
Obviously using a schema registry like confluent seems like a no brainer. But then stuff like handling breaking changes does not seem to have, to my knowledge at least, any well established solution. You could use headers, different topic names, or even union types.
Is there a state of the art reference for documenting issues that teams that run it in production have encountered and their solutions? I’m not looking a cookie cutter solution I just want some guidance
coleavenue@reddit
In my experience what production teams actually do is bitterly regret using Kafka. Not necessarily because it’s bad, but because people frequently pick it when all they actually need is sqs or something and they fuck it all up and have a bad time.
CodelinesNL@reddit
Only the incompetent ones. Bad engineers blame the tools. Good engineers learn what patterns do and don't work.
CodelinesNL@reddit
It does: it's called API versioning. Because that's what you're doing.
We have the version as part of the topic name. Breaking versions can only happen in version upgrades. An we need to maintain the previous version for a certain amount of time too. Typically a few months for external topics. That approach is pretty much standard. You don't want different versions of the same event on the same topic.
Using systems like schema registry and using AVRO makes it easier to do backward compatible changes. But the reason this is 'hard' is because API versioning is hard, especially when engineers don't care about thinking ahead.
rsalot@reddit
Steps are always the same
Using protobuf or avro with a schema registry is just a way to prevent breaking change
The schema registry don't help to do breaking change
Most of the time the solution is not to use Kafka even if it's cool for your resume
Also defining what a breaking change is. If you want to rename a variable for example, this is not a breaking change in protobuf
If you use a real Kafka system with high load, the answer is almost all the time to not do a breaking change
Lucky_Psychology8275@reddit (OP)
I agree about avoiding breaking changes, although I don’t want this rule to create a huge mess of technical debt just to avoid breaking changes.
And the rest of what you say is interesting but situational. I’m really looking for something broader. I only have one consumer so I’d rather double read than double write.
I agree Kafka does fit our asynchronous logic but yeah I didn’t make those choices
Duathdaert@reddit
You will create a mess in prod if you don't follow these rules
Lucky_Psychology8275@reddit (OP)
Why though?
Illustrious_Pea_3470@reddit
If you instead do double reads, but the new write path has issues, then between cutting over the right path and finding the issue, you are losing data. Always unacceptable.
Lucky_Psychology8275@reddit (OP)
I also have another question? How do you rollback prod if user data has been written on new tables? Do you always plan a way to rollback the changes with an undo script?
Illustrious_Pea_3470@reddit
That’s why we double write. If the rollout goes badly, just drop the new table and try again from scratch.
Lucky_Psychology8275@reddit (OP)
I meant outside the Kafka context. Just a rest api with breaking changes in its db.
Illustrious_Pea_3470@reddit
Yes, all changes should always have an immediate rollback plan. In some rare cases it’s not possible, in which case you either have to consider other solutions that would make it possible (such as decoupling things so you can do the double write pattern), or have an extremely high level of testing and a lot of engineering resources available when you go live.
So e.g. adding an enum value in Postgres should come with a downgrade script that understands what to do if the new value has been written.
Lucky_Psychology8275@reddit (OP)
You could apply to same technique for rolling back a double read Kafka consumer, couldn’t you? Do you prefer a double write producer because you see it as a simpler alternative?
Illustrious_Pea_3470@reddit
If you’re not double writing at some point, then errors in the new write path will always lead to data loss.
Lucky_Psychology8275@reddit (OP)
Even if the consumer is just a database writer ?
Illustrious_Pea_3470@reddit
In your double read scenario, are you creating a new consumer for the new output, or plugging both queues into the consumer and behaving differently if the new version is detected?
Lucky_Psychology8275@reddit (OP)
I would start by updating the consumer and make it so it can double read. It would right the same data type to the database
Illustrious_Pea_3470@reddit
Then yes, that will lead to data loss. You’ll be ready for both output formats. You’ll swap the writer to the new format.
Now the bug is discovered. It takes non zero time to swap the writer back.
During the non zero time, a request happened. The writer tried to write it, but the bug means that whatever got written isn’t enough to reconstruct the request.
That request was just lost. Poof. Gone. Hope it wasn’t important!
Lucky_Psychology8275@reddit (OP)
I see. If a few messages are lost in specific circumstances that could be in our case not a big deal. It’s informative more than actionable.
Illustrious_Pea_3470@reddit
Then this is not a high availability system
Illustrious_Pea_3470@reddit
Because this is trivial to rollback one step at a time, so you can get where you’re going if things go right AND back to where you started if things go wrong.
Lucky_Psychology8275@reddit (OP)
So the correct way for me would be to upgrade all producer that will double write, then update my consumer?
Illustrious_Pea_3470@reddit
Yes, literally always!
Duathdaert@reddit
Breaking changes to a contract deployed in one fell swoop will lead to events that can't be processed.
Illustrious_Pea_3470@reddit
This IS the general case solution:
Feature flag to double writes and feature flag to change where you read from.
Cut over feature flag to double your writes. Confirm system still works.
Cut over feature flag to change read location to new (second) write location. Confirm system still works.
Remove old write path.
Remove feature flag for old read location.
Done.
Odd_Soil_8998@reddit
Only way to deal with breaking changes without technical debt is to build a monolith. If you aren't migrating everything at the same time you can't avoid it.
Connect_Detail98@reddit
Don't do breaking changes, keep track of your clients. Once you have no clients consuming the old fields (or whatever), then you can cleanup.
You need to set deadlines. Just notify the company that the fields will be removed from the topic (or whatever breaking change) on a specific date. Notify this message multiple times. When the date comes, you decide if you go "good luck everyone" or if you give one last and final warning saying it will be removed in the next 2 days.
This is why designing well from the beginning is important, depending on the nature of what you're building. You don't want to give your clients a bumpy ride or constant depreciation threats.
roger_ducky@reddit
If you’re using protobuf or avro, it’s relatively easy to avoid breaking changes unless you’re literally dropping the whole schema for a new one.
Basic idea is to “always have a default value” and “avoid dropping fields if you don’t want data loss”
Buffocado@reddit
There are also free tools available for protobuf that allow automated breaking change detection, like
buf breakingwhich is helpful for those that may not have all of the edge cases memorized, or to protect against a careless AI coding agent.A good schema registry should also have breaking change detection and enforcement built right in.
*Disclaimer: I work for Buf, if its not obvious :-)
Macrobian@reddit
How are you meant to add or remove a
requiredfield safely if Protobuf doesn't have any support for asymmetric fields like typical?Buffocado@reddit
While it's not my favorite answer, the proto2 advice was: "Don't".
The proto3 solution was to remove required fields entirely.
Macrobian@reddit
I think you agree that pushing boundary type validations into the application layer is unwise. It's a shame more flexible validations were not added to proto3 or editions instead of throwing out the whole validation system because Google couldn't figure out how to build a compatibility checker.
Buffocado@reddit
Partly agree! The proto2
requireddebacle was a real lesson in how a seemingly simple constraint can become a wire-format compatibility nightmare at Google's scale — hence the nuclear option of just removing it in proto3.But, Protovalidate is essentially the answer to exactly this complaint: CEL-based validation rules that live in the schema, are enforced consistently across languages, and, crucially, don't couple your validation semantics to the wire format. You get expressive field constraints back without the "now your schema is permanently broken" trap that
requiredcreated.As for building a compatibility checker — someone did eventually figure it out. 😇
Macrobian@reddit
Doesn't Protovalidate have the same issue? The reason that the
asymmetriclabel works for typical is because the validation is fundamentally asymmetric: it enforces that the constructor validation is stronger than that of the deserialization validation, and thus can be used as a stepping stone. e.g., converting anoptionalfield torequired, you:optionaltoasymmetric, parametric required data everywhere through the monorepooptionalare no longer active. Validate this withbuf breakingor an equivalent compatibility checker)asymmetrictorequired.https://i.imgur.com/uIWIuNu.png
With protovalidate you only have one validation, which means you can never strengthen or weaken your validation across blue-green deployments.
If you had 2 validations, you could:
Relax the validation by first relaxing the deserialisation validation, then the constructor validation:
https://i.imgur.com/cQ6h71y.png
Or narrow the validation by first narrowing the constructor validation, then the deserialisation validation:
https://i.imgur.com/Ljjm4kr.png
Buffocado@reddit
That's a genuinely interesting design. I wasn't familiar with
typical's asymmetric label and the staged constructor/deserialization approach. That's a real gap worth acknowledging. I'll bring it back to the team, as they may find it interesting.Appreciate the thorough writeup, either way!
Macrobian@reddit
No problem! I worked on a closed source proto fork and compatibility checker for 4 years with these features: happy to answer any questions! Great to see Buf improving the Protobuf open source ecosystem.
nog_ar_nog@reddit
This. We use Avro schemas and manage them entirely through a UI tool that does not allow schema fields to be removed. The UI tool submits PRs with the actual schema to a central schema repo. Services use pinned schema repo git refs.
Next_Dance_6076@reddit
i had a phase where i seriously considered it, but never got past the research stage
Best_Recover3367@reddit
travelinzac@reddit
Monotonic additions, make new topics
Sensitive_West_3216@reddit
With kafka, you will either be using it as a queue, a known set of downstream services who are consuming messages from the topic or as a broadcast, an unknown set of downstream services who are consuming messages.
We keep a key value pair in the headers, which has the schema name the consumer should use to deserialize the message.
So in case of queues, we ask downstream services to add support for the new schema as well. They use headers to decide to run the old or new flow. After all consumers have added support, we then do the deployment on the producer service so that it only produces messages in the new schema on the same topic.
In case of using it as a broadcast, we create a new topic and publish the same event in both old and new topics for a while. We then send a mail, to all relevant people (basically all team leads) that they have to move to the new topic we will stop producing events on the old one, usually a date 2 sprints ahead so they have time to migrate.
PredictableChaos@reddit
We just create a new topic and once all clients have migrated we shut down the old one. If the new features/fields are needed they'll migrate quickly. If they aren't needed they might lag a little until we set a sunset date on the old topic but they'll get there.
On the client side you may need to coordinate/synchronize the cutover but if you're lucky it won't matter if you double process and then the approach is easier.
Spare_Helicopter4655@reddit
A sane default IMO is to avoid all this (potentially) unnecessary complexity by basically choosing to not adopt Kafka...
Not to say it doesn't have a proper place in some orgs.
on_the_mark_data@reddit
I'm fully biased because I wrote the book on the topic, but Data Contracts might be the pattern you are looking for. For Kafka specifically, using the write-audit-publish (WAP) pattern is powerful. This is a case study I featured in the book that uses data contracts on Kafka in an enterprise setting. https://adevinta.com/techblog/creating-source-aligned-data-products-in-adevinta-spain/
Old-Worldliness-1335@reddit
Just think of it as immutable infrastructure, and your life will be better. If you need to treat it as a flexible thing, then don’t. Understand that every change you make as part of your breaking change more than likely will require you to reprocess your entire topic which is fine, but that’s why I like micro services per topics, that handle that logic specifically.
Tarazena@reddit
Different topics if there are breaking changes, write a job that converts old messages to new topic then migrate old systems to use the new topic
Sheldor5@reddit
a breaking change by definition breaks something
sometimes you can't avoid downtime (to update everything simultaneously)