Kafka schema evolution & breaking changes: what do production teams actually do?

Posted by Lucky_Psychology8275@reddit | ExperiencedDevs | View on Reddit | 47 comments

My company kinda lacks Kafka experts and I really need guidance on what are the accepted standard practices when it comes to managing Kafka schema and ser/deser on client side (spring cloud stream), especially in the context of HA deployment.

Obviously using a schema registry like confluent seems like a no brainer. But then stuff like handling breaking changes does not seem to have, to my knowledge at least, any well established solution. You could use headers, different topic names, or even union types.

Is there a state of the art reference for documenting issues that teams that run it in production have encountered and their solutions? I’m not looking a cookie cutter solution I just want some guidance

[-]

coleavenue@reddit

In my experience what production teams actually do is bitterly regret using Kafka. Not necessarily because it’s bad, but because people frequently pick it when all they actually need is sqs or something and they fuck it all up and have a bad time.

[-]

CodelinesNL@reddit

In my experience what production teams actually do is bitterly regret using Kafka.

Only the incompetent ones. Bad engineers blame the tools. Good engineers learn what patterns do and don't work.

[-]

CodelinesNL@reddit

But then stuff like handling breaking changes does not seem to have, to my knowledge at least, any well established solution.

It does: it's called API versioning. Because that's what you're doing.

We have the version as part of the topic name. Breaking versions can only happen in version upgrades. An we need to maintain the previous version for a certain amount of time too. Typically a few months for external topics. That approach is pretty much standard. You don't want different versions of the same event on the same topic.

Using systems like schema registry and using AVRO makes it easier to do backward compatible changes. But the reason this is 'hard' is because API versioning is hard, especially when engineers don't care about thinking ahead.

[-]

rsalot@reddit

Don't do breaking change if you think you might need to do a breaking change
If you must do a breaking change, don't do a breaking change
If you really must do a breaking change. So the same thing as any ha system

Steps are always the same

Hopefully you don't have to port the old data to the New topic, otherwise you have to write something for that
Double writes to the old and new system (or topic in that case)
Migrate all clients to the new topics using a feature flag or environment variable when you reached the data retention policy or Hopefully clients are able to replay a topic, or otherwise hack it to start at the right place
Once all clients are migrated delete the old topic

Using protobuf or avro with a schema registry is just a way to prevent breaking change

The schema registry don't help to do breaking change

Most of the time the solution is not to use Kafka even if it's cool for your resume

Also defining what a breaking change is. If you want to rename a variable for example, this is not a breaking change in protobuf

If you use a real Kafka system with high load, the answer is almost all the time to not do a breaking change

[-]

Lucky_Psychology8275@reddit (OP)

I agree about avoiding breaking changes, although I don’t want this rule to create a huge mess of technical debt just to avoid breaking changes.

And the rest of what you say is interesting but situational. I’m really looking for something broader. I only have one consumer so I’d rather double read than double write.

I agree Kafka does fit our asynchronous logic but yeah I didn’t make those choices

[-]

Duathdaert@reddit

You will create a mess in prod if you don't follow these rules

[-]

Lucky_Psychology8275@reddit (OP)

Why though?

[-]

Illustrious_Pea_3470@reddit

If you instead do double reads, but the new write path has issues, then between cutting over the right path and finding the issue, you are losing data. Always unacceptable.

[-]

Lucky_Psychology8275@reddit (OP)

I also have another question? How do you rollback prod if user data has been written on new tables? Do you always plan a way to rollback the changes with an undo script?

[-]

Illustrious_Pea_3470@reddit

That’s why we double write. If the rollout goes badly, just drop the new table and try again from scratch.

[-]

Lucky_Psychology8275@reddit (OP)

I meant outside the Kafka context. Just a rest api with breaking changes in its db.

[-]

Illustrious_Pea_3470@reddit

Yes, all changes should always have an immediate rollback plan. In some rare cases it’s not possible, in which case you either have to consider other solutions that would make it possible (such as decoupling things so you can do the double write pattern), or have an extremely high level of testing and a lot of engineering resources available when you go live.

So e.g. adding an enum value in Postgres should come with a downgrade script that understands what to do if the new value has been written.

[-]

Lucky_Psychology8275@reddit (OP)

You could apply to same technique for rolling back a double read Kafka consumer, couldn’t you? Do you prefer a double write producer because you see it as a simpler alternative?

[-]

Illustrious_Pea_3470@reddit

If you’re not double writing at some point, then errors in the new write path will always lead to data loss.

[-]

Lucky_Psychology8275@reddit (OP)

Even if the consumer is just a database writer ?

[-]

Illustrious_Pea_3470@reddit

In your double read scenario, are you creating a new consumer for the new output, or plugging both queues into the consumer and behaving differently if the new version is detected?

[-]

Lucky_Psychology8275@reddit (OP)

I would start by updating the consumer and make it so it can double read. It would right the same data type to the database

[-]

Illustrious_Pea_3470@reddit

Then yes, that will lead to data loss. You’ll be ready for both output formats. You’ll swap the writer to the new format.

Now the bug is discovered. It takes non zero time to swap the writer back.

During the non zero time, a request happened. The writer tried to write it, but the bug means that whatever got written isn’t enough to reconstruct the request.

That request was just lost. Poof. Gone. Hope it wasn’t important!

[-]

Lucky_Psychology8275@reddit (OP)

I see. If a few messages are lost in specific circumstances that could be in our case not a big deal. It’s informative more than actionable.

[-]

Illustrious_Pea_3470@reddit

Then this is not a high availability system

[-]

Illustrious_Pea_3470@reddit

Because this is trivial to rollback one step at a time, so you can get where you’re going if things go right AND back to where you started if things go wrong.

[-]

Lucky_Psychology8275@reddit (OP)

So the correct way for me would be to upgrade all producer that will double write, then update my consumer?

[-]

Illustrious_Pea_3470@reddit

Yes, literally always!

[-]

Duathdaert@reddit

Breaking changes to a contract deployed in one fell swoop will lead to events that can't be processed.

[-]

Illustrious_Pea_3470@reddit

This IS the general case solution:

Feature flag to double writes and feature flag to change where you read from.

Cut over feature flag to double your writes. Confirm system still works.

Cut over feature flag to change read location to new (second) write location. Confirm system still works.

Remove old write path.

Remove feature flag for old read location.

Done.

[-]

Odd_Soil_8998@reddit

Only way to deal with breaking changes without technical debt is to build a monolith. If you aren't migrating everything at the same time you can't avoid it.

[-]

Connect_Detail98@reddit

Don't do breaking changes, keep track of your clients. Once you have no clients consuming the old fields (or whatever), then you can cleanup.

You need to set deadlines. Just notify the company that the fields will be removed from the topic (or whatever breaking change) on a specific date. Notify this message multiple times. When the date comes, you decide if you go "good luck everyone" or if you give one last and final warning saying it will be removed in the next 2 days.

This is why designing well from the beginning is important, depending on the nature of what you're building. You don't want to give your clients a bumpy ride or constant depreciation threats.

[-]

roger_ducky@reddit

If you’re using protobuf or avro, it’s relatively easy to avoid breaking changes unless you’re literally dropping the whole schema for a new one.

Basic idea is to “always have a default value” and “avoid dropping fields if you don’t want data loss”

[-]

Buffocado@reddit

There are also free tools available for protobuf that allow automated breaking change detection, like buf breaking which is helpful for those that may not have all of the edge cases memorized, or to protect against a careless AI coding agent.

A good schema registry should also have breaking change detection and enforcement built right in.

*Disclaimer: I work for Buf, if its not obvious :-)

[-]

Macrobian@reddit

How are you meant to add or remove a required field safely if Protobuf doesn't have any support for asymmetric fields like typical?

[-]

Buffocado@reddit

While it's not my favorite answer, the proto2 advice was: "Don't".
The proto3 solution was to remove required fields entirely.

[-]

Macrobian@reddit

I think you agree that pushing boundary type validations into the application layer is unwise. It's a shame more flexible validations were not added to proto3 or editions instead of throwing out the whole validation system because Google couldn't figure out how to build a compatibility checker.

[-]

Buffocado@reddit

Partly agree! The proto2 required debacle was a real lesson in how a seemingly simple constraint can become a wire-format compatibility nightmare at Google's scale — hence the nuclear option of just removing it in proto3.

But, Protovalidate is essentially the answer to exactly this complaint: CEL-based validation rules that live in the schema, are enforced consistently across languages, and, crucially, don't couple your validation semantics to the wire format. You get expressive field constraints back without the "now your schema is permanently broken" trap that required created.

As for building a compatibility checker — someone did eventually figure it out. 😇

[-]

Macrobian@reddit

Doesn't Protovalidate have the same issue? The reason that the asymmetric label works for typical is because the validation is fundamentally asymmetric: it enforces that the constructor validation is stronger than that of the deserialization validation, and thus can be used as a stepping stone. e.g., converting an optional field to required, you:

Change optional to asymmetric, parametric required data everywhere through the monorepo
Wait a release (or as many releases as necessary to be confident that protos using optional are no longer active. Validate this with buf breaking or an equivalent compatibility checker)
Change asymmetric to required.

https://i.imgur.com/uIWIuNu.png

With protovalidate you only have one validation, which means you can never strengthen or weaken your validation across blue-green deployments.

If you had 2 validations, you could:

Relax the validation by first relaxing the deserialisation validation, then the constructor validation:

https://i.imgur.com/cQ6h71y.png

Or narrow the validation by first narrowing the constructor validation, then the deserialisation validation:

https://i.imgur.com/Ljjm4kr.png

[-]

Buffocado@reddit

That's a genuinely interesting design. I wasn't familiar with typical's asymmetric label and the staged constructor/deserialization approach. That's a real gap worth acknowledging. I'll bring it back to the team, as they may find it interesting.

Appreciate the thorough writeup, either way!

[-]

Macrobian@reddit

No problem! I worked on a closed source proto fork and compatibility checker for 4 years with these features: happy to answer any questions! Great to see Buf improving the Protobuf open source ecosystem.

[-]

nog_ar_nog@reddit

This. We use Avro schemas and manage them entirely through a UI tool that does not allow schema fields to be removed. The UI tool submits PRs with the actual schema to a central schema repo. Services use pinned schema repo git refs.

[-]

Next_Dance_6076@reddit

i had a phase where i seriously considered it, but never got past the research stage

[-]

Best_Recover3367@reddit

AsyncAPI docs
Schema registry

[-]

travelinzac@reddit

Monotonic additions, make new topics

[-]

Sensitive_West_3216@reddit

With kafka, you will either be using it as a queue, a known set of downstream services who are consuming messages from the topic or as a broadcast, an unknown set of downstream services who are consuming messages.

We keep a key value pair in the headers, which has the schema name the consumer should use to deserialize the message.

So in case of queues, we ask downstream services to add support for the new schema as well. They use headers to decide to run the old or new flow. After all consumers have added support, we then do the deployment on the producer service so that it only produces messages in the new schema on the same topic.

In case of using it as a broadcast, we create a new topic and publish the same event in both old and new topics for a while. We then send a mail, to all relevant people (basically all team leads) that they have to move to the new topic we will stop producing events on the old one, usually a date 2 sprints ahead so they have time to migrate.

[-]

PredictableChaos@reddit

We just create a new topic and once all clients have migrated we shut down the old one. If the new features/fields are needed they'll migrate quickly. If they aren't needed they might lag a little until we set a sunset date on the old topic but they'll get there.

On the client side you may need to coordinate/synchronize the cutover but if you're lucky it won't matter if you double process and then the approach is easier.

[-]

Spare_Helicopter4655@reddit

A sane default IMO is to avoid all this (potentially) unnecessary complexity by basically choosing to not adopt Kafka...

Not to say it doesn't have a proper place in some orgs.

[-]

on_the_mark_data@reddit

I'm fully biased because I wrote the book on the topic, but Data Contracts might be the pattern you are looking for. For Kafka specifically, using the write-audit-publish (WAP) pattern is powerful. This is a case study I featured in the book that uses data contracts on Kafka in an enterprise setting. https://adevinta.com/techblog/creating-source-aligned-data-products-in-adevinta-spain/

[-]

Old-Worldliness-1335@reddit

Just think of it as immutable infrastructure, and your life will be better. If you need to treat it as a flexible thing, then don’t. Understand that every change you make as part of your breaking change more than likely will require you to reprocess your entire topic which is fine, but that’s why I like micro services per topics, that handle that logic specifically.

[-]

Tarazena@reddit

Different topics if there are breaking changes, write a job that converts old messages to new topic then migrate old systems to use the new topic

[-]

Sheldor5@reddit

a breaking change by definition breaks something

sometimes you can't avoid downtime (to update everything simultaneously)