Kafka Fundamentals - Guide to Distributed Messaging

[-]

beebeeep@reddit

Why people are often thinking that Kafka is some sort of complex monstrosity with tremendous infrastructure footprint? It is literally just two things to deploy (broker and ZK/controller), barely requiring any configuration beyond putting hostnames of stuff.

I think that for what it does, Kafka is as simple as it can get, both from perspective of operations and of internal architecture.

[-]

CodelinesNL@reddit

Indeed. Also the reason it works so well is because how it handles writes and reads over partitions is fundamentally very simple: you're only doing appends to files, and read from an offset into those files.

Thinking Kafka is "complex" shows a complete lack of understanding. If you think synchronous integrations via REST are "simple" and Kafka is "hard", you never worked on something complex. Or you did, but the tough shit was done for you by better engineers.

[-]

cstoner@reddit

People have a hard time grasping complexity when it veers outside of what they've been doing.

At my current company the architects are afraid of "the complexity of distributed systems" so instead, they have written an orchestration layer using redis pubsub to distribute cache updates to stateful monolithic application servers. We distribute billions of these cache updates per day.

Good thing we avoided all that "complexity" I guess...

[-]

josluivivgar@reddit

I think depending on your needs, it might be complex to configure, which I think that's where the "complexity" comes from, once it's setup it's not hard to use, and it's simple and doesn't require a lot of maintenance for most use cases.

BUT, overall anything distributed is technically complex, and I do think that kafka is complex by nature of the problem it's trying to solve.

but as a user of kafka you don't really see anything like that (I also don't think it's overengineered, but it might be an overengineered solution for some problems, AKA you really didn't need kafka)

[-]

TheRealStepBot@reddit

I think running Kafka on the metal was a bit of a pita as it is jvm on the metal and needing additionally the zookeeper element. Modern Kafka on k8s it’s not bad

I think the main practical issues are with the degree to which the operator needs to be aware of the traffic distribution flowing through the system and pre partition the topic. More managed alternatives abstract this away and there might be a way to generally make this easier but idk, now you are getting into all kind of auto rebalancing shit that can have downstream effects so the Kafka way is not the most automatic way of doing things but it’s easy enough to reason about and prevents downstream weirdness.

At the end of the day I think distributed processing is just not easy and requires a measure of engineering intuition which 80% of engineers don’t possess much of.

People want to not have to think about where data actually is located but you can’t abstract this away forever. Ultimately there is a specific drive that the bits need to go to and you only have so many of them. Kafka forces people to have to grapple with this and boy do they not love that.

Kafka is fundamentally super simple and straightforward to reason about, its append only log files on disks.

[-]

ViewTrick1002@reddit

Because you don't need the scale, and previously with Zookeper, you needed a team to run it due to how finicky it was.

Not sure how that has changed with their own raft implementation.

[-]

beebeeep@reddit

Idk what was that in your case, but I ran 5 clusters with total number of brokers of around hundred - with zk, later migrated to kraft. There is nothing finicky there, you set things up and after that it requires almost no attention at all. Even catastrophic loss of broker is essentially self-healing, you just bring on the new one and it will eventually replicate all the data on its own. So no, allow me to disagree, Kafka is orders of magnitude easier to run compared to, idk, replicated Postgres, for example - that's one needy sob.

[-]

harshit181@reddit

Just 1 for newer version,no zookeeper needed .

[-]

th0ma5w@reddit

I don't know what's up these other replies, but having used many, including Kafka, I'm with you. I do think the extra complexity has clear benefits if you need them, and certainly I've observed it to be a kind of universal state machine with some robust operation... It is funny that Exactly Once is mentioned when it Exactly Once*** within a certain bounds lol ... but anyway ... yeah .... The way I even saw some seasoned developers and architects assume how Kafka works but then treat it in a way that literally any queue would work for them ... It's a weird mix of obviously amazing and obviously too much often, heh.

[-]

Blecki@reddit

My opinion of Kafka might be different if my organization didn't insist on it being used for literally any data transfer between internal stake holders at all.

A particularly egregious example... we have a fleet of vehicles. If I just want a list of these vehicles, I need to subscribe to and consume a topic. Okay, so already, Kafka is a message queue. It's not great for lists of things. But they publish telemetry in this topic. So if you just consume... you get a message per truck, per second. All for just a list of trucks. Okay, but if the truck is inactive, there's no messages... so to get that initial load you have to start from offset 0. And consume potentially billions of records.

Kafka has a niche but frankly, it just makes nightmares too easy. Devs reach for this monstrosity when they should just be writing an api.

[-]

Murky-Relation481@reddit

I love when problems are actively engineered to be harder than they ever could be. Like you really have to wonder what goes on in the heads of those people that would think that is an acceptable way to get a list of something that is functionally known a priori.

[-]

TheRealStepBot@reddit

Tell me you don’t understand either the so called simple problem or it’s solutions. What are you going to use instead of Kafka. Better not say anything that uses a log under the hood because Kafka is literally just that log stripped of all the extra crap around it. Kafka or at least logs like it are literally the most simple solution to this problem. There might be implementation differences between different logs and they may have slightly different performance capabilities but it’s one of the most basic building blocks of basically all datastore technologies on which everything else is built.

That you don’t know how anything works under the hood doesn’t change how they work.

[-]

Blecki@reddit

You think logs are what Kafka is for?

[-]

TheRealStepBot@reddit

Yeah sure that’s what I said.

No Kafka is an append only log which is a data structure that sits at the core of many other technologies as well. Kafka is that data structure directly exposed and running without the complexity and overhead imposed on it by other shit like say a query engine.

You can store application logs in Kafka if you want, but that’s not what log means in this context. A log is a very simple and quite well behaved data structure.

[-]

Blecki@reddit

Okay, I'll give you some ground; my opinion might be different if I worked for an organization that actually used it for that. Instead, we have a mandate to use it for all internal communications between systems.

[-]

TheRealStepBot@reddit

Depends on what internal means. Across service boundaries? It’s pretty much perfectly suited for this.

Within some service? Probably overkill to mandate it but idk don’t know your system. Done correctly it can allow complete versioned replay over data streams which can be a key enabler for ML use cases, and may be a nice to have generally for audit and downstream analytics use cases.

As always it’s the right tool for the job. People who can’t explain why they are using Kafka probably shouldn’t be in charge of mandating it but by the same token the average developer spouting off the bs you led with similarly doesn’t really get to have an opinion on this either as they simply fundamentally often don’t have the foggiest clue about how it works or what sorts of capabilities it provides.

Just because it isn’t a php crud api in front of a MySQL database like the good old days of yore doesn’t mean it “over engineered” Its much more likely you are burnt out and salty and haven’t kept up with the increasingly challenging demands made of the data and ml teams.

Kafka is an incredibly good comparatively unopinionated integration layer that allows data stored in it to be reused by a variety of potentially unintended downstream use cases as well as serve to allow system architects to gradually move towards building systems capable of high degrees of repeatability not easily accessible in other ways.

If I were you I’d maybe stop and ask yourself what it is that those requiring its use in your organization may know or understand that you do not. And maybe you discover it’s nothing and it’s just more aimless blind leading the blind from blog posts they don’t understand. But that is no more a problem due to Kafka than it is any other technology. Good technical leadership is not a widely available commodity.

[-]

Blecki@reddit

I know that the people making the mandate couldn't write a line of code to save their lives.

In the meantime; holding strong opinions sparks discussion. And I'll be happy when they're done humping kafka... except that it's probably ai next on the agenda.

[-]

TheRealStepBot@reddit

There are indeed many such cases.

Just understand your mindless criticism of it is no more substantive than their blind pursuit of it. Taking strong stances on shit you don’t understand is idiotic irrespective of whether that’s for or against a change.

Kafka is in contrast to your original claim a very elegant tool that when used correctly can absolutely unlock massive capabilities, but it takes technical vision that is often lacking in many orgs to actually understand where to use it and what you can do with it once you have it.

Lots of “leaders” will jump on bandwagons like this and they just hope someone else then comes along and can actually build value on top of their blind adoption of a consensus opinion they didn’t understand. And that’s not all bad but it does often leave a sour taste, and can blow up if someone doesn’t come along to actually build on it.

On the ai side of this I hate to break it to you but ai is going to massively increase the volume of high value difficult to process telemetry that will be well suited to Kafka so I don’t think it’s going anywhere except in the “Kafka is dead, long live Kafka” sense of it being replaced by a modern philosophical descendant.

[-]

Blecki@reddit

No, if a tool is this easy to use wrong, it's not an elegant tool. Perhaps the next iteration will be better. Kafka fits right into a pattern I've seen repeat several times, but I don't really know what to call this yet. A phenomenon where "they" want to make everything work with just some configuration in a dashboard, like they're afraid to write the few lines of code it would take to just do it.

The BI folks are like this too. code-phobic. Dataflows and power automate everywhere. Their shit is always breaking.

[-]

TheRealStepBot@reddit

See again you misunderstand. Yes that is maybe the goal of some people and those people certainly are on a fools errand. But again that because they and you both misunderstand the point of dashboards and interfaces that only accept a couple of clicks. That is not the goal in and of itself.

Correctly designed very reliable and repeatable data pipelines look from the surface like “just some dashboards” but that is only the ui to a massive amount of sophistication and engineering under the hood.

The reason you often end up surfacing it as a dashboard with a couple of stage gates is that you’ve already done all the work of automating the entire process so that it’s extremely repeatable. The repeatability is the goal. That level of repeatability ends up consuming all the complexity so that at the end of it all you basically sit and watch dashboards for anomalies, and click approve on various stage gates.

But you don’t do that because you want to sit and stare at dashboards, staring at dashboards is the natural consequence of correctly built process that run with a high degree of autonomy.

[-]

Blecki@reddit

Contrary. YOU do not understand. I did not say that their goal was looking at dashboards or that mine was LOC (how did you make that logical leap??).

What I said was that they want everything to be infinitely configurable by clicking things in dashboards because they are code-phobic.

They are afraid of putting down their mouse and touching their keyboard because code is scary.

Kafka doesn't consume all of the complexity regardless. Like most frameworks, it works great as long as you stay inside its well defined lanes. And becomes a hindrance as soon as you try anything novel.

[-]

CodelinesNL@reddit

I know that the people making the mandate couldn't write a line of code to save their lives.

Yet they are in charge, and no one listen to you. Funny how that works eh?

[-]

Blecki@reddit

I don't buy into your prosperity gospel bullshit. Every executive I've ever known has only had one appreciable skill, and that was running their mouth. Tech executives aren't an exception.

You're assuming they have a "good reason" but there's no good reason to mandate any technology for an entire class of tasks.

[-]

CodelinesNL@reddit

Every executive I've ever known

Disagrees with you. And every dev here seems to disagree with you too. Looks like a pattern with a single common denominator.

[-]

Blecki@reddit

Yes and you're all wrong. Consensus doesn't mean correct.

I'm not interested in arguing with you in three different threads.

[-]

ykmaguro@reddit

It sounds like your organization should be a mandatory case study for how to use the wrong tool for the wrong problem

[-]

CodelinesNL@reddit

I understand that they mandate this, since they have ignorant devs like you who think that just doing REST calls between all the services is 'better'.

[-]

Blecki@reddit

We can have a discussion about rest as well. It's just another in a long series of engineers dressing up the old in new clothes.

[-]

CodelinesNL@reddit

Yeah lets shovel files around like you suggested. That's a great idea. Let's do away with databases too.

[-]

Blecki@reddit

Hey so, did you know that your database is just a bunch of files?

Anyway: which is going to stand the test of time and which one is faster and transmits the least data; a bunch of servers to host a topic to feed a consumer a bunch of json records that get written into a database one at a time - or a csv file and a bulkinsert?

Sure use Kafka for your real time message queue, whatever. But you probably don't need a real-time message queue.

[-]

TheRealStepBot@reddit

Literally the internal implementation of most databases is as a set of append only log files. Which is what Kafka is. Without all the rest of the crap.

[-]

CodelinesNL@reddit

Hey so, did you know that your database is just a bunch of files?

The Dunning-Kruger definition in the dictionary probably has your image next to it.

[-]

Blecki@reddit

Do you not know how data storage works? 🤭

[-]

CodelinesNL@reddit

Do you? Saying a database is just a bunch of files is like saying a car is just a bunch of wheels.

Also are you going to respond to the rest? Your assertion that the disk transfer would be faster is flat out wrong, and you also did not consider data consistency.

[-]

Blecki@reddit

Just so we're clear I'm laughing at you.

[-]

TheRealStepBot@reddit

Most people who say kiss are mouth breathing morons. Change my mind.

The world is a complex place. Kiss within the framework of an accurate understanding your problem? Sure. But most people haven’t the foggiest about the problems they are actually solving or how the various techniques available might go together.

[-]

CodelinesNL@reddit

A lot of developers stagnated years ago and use dogmatic shit like "KISS" to make up for not wanting to learn new things.

Very often async event driven communication via Kafka is architecturally simpler than direct REST communication, with all the nasty retry, bulkhead and circuit breaker stuff you need in real systems. But because John has been with the company for 3 decades and hasn't learned anything new in the past 2, he'll use "KISS" as an argument to keep doing the same shit for 2 more decades.

[-]

TheRealStepBot@reddit

Exactly. Simple? By what metric? By the fact you don’t need to learn anything you don’t already know? Sure.

In terms of actually building a reliable system that isn’t dotted with footguns and weird behaviors of so many different kinds? Probably not.

“Each working data pipeline is designed like a log; each broken data pipeline is broken in its own way."—Count Leo Tolstoy translated by jay kreps

Every place I’ve had to introduce data centric architecture always think their special snowflake of how exactly they fucked up rest is special and due to their challenging domain or some such hogwash. No it’s because you didn’t understand side effects and relativity which means you assumed consistency when it didn’t exist and build an a causal system.

But if those kids could read they would probably be very angry.

[-]

_souphanousinphone_@reddit

Sounds like you’re using the wrong tool? You need a message queue, not kafka?

[-]

CodelinesNL@reddit

Kafka is excellent for inter-service async communication.

[-]

TheCritFisher@reddit

So what does the solution to this "simple problem" look like?

[-]

Blecki@reddit

Usually literally a file copy.

[-]

Similar-Option467@reddit

I wonder if you’re a really new dev or you’re actually this clueless

[-]

Blecki@reddit

Kafka is easily a 9 out of 10 on the yagni scale.

[-]

ryuzaki49@reddit

A distributed message queue that guarantees exactly-once and in-order delivery is a simple problem?

[-]

jasie3k@reddit

Side question - can you really guarantee an exactly-once delivery in a distributed system?

What if the consumer processes the message, saves the changes to the database, but crashes before sending the ack back to the broker? Won't that mean the message gets redelivered, thus creating an at-least-once scenario?

AFAIK exactly-once delivery is guaranteed only between Kafka nodes, no?

[-]

yk313@reddit

No. But the end result - idempotency - can be achieved by combining at-least-once-delivery with exactly-once-processing.

[-]

jasie3k@reddit

Yeah this one I get, it's just that exactly once requires some additional work on the side of the consumer

[-]

ForeverAlot@reddit

As a technicality, "exactly-once delivery" is a theoretical impossibility. It doesn't matter how much control over parties one has or how much effort one puts into it, it cannot be achieved. Some approaches are practically attainable and adequately satisfy the desire for "exactly-once delivery" but they are a substitutes with different properties rather than merely implementation details.

[-]

josluivivgar@reddit

I mean exactly-once delivery to me means you abandon messages that are not processed, but arrived.

so A sends B a message B receives it and somewhere along the lines soemthing goes wrong, A will not send the same message again, which is essentially a useless proposition.

it can be achieved it's called UDP, it's just not a great system when you want reliability.

[-]

yk313@reddit

Not just the consumer.

The producer also needs to implement at-least-once-delivery which is also non-trivial.

[-]

jasie3k@reddit

Is the producer side non-trivial? I mean as a general principle - you send a message and expect acknowledgement, if the ack doesn't come in the specified time you send the message again, no?

Am I missing something?

[-]

yk313@reddit

Imagine doing that synchronously, now you have coupled your availability with that of the downstream system you are trying to write to. And we haven't even begun to talk about the potential dual-write problem here.

To avoid the availabilities of the two systems you could to decouple your transaction from the outbound call. Which could lead to data loss in case of crashes.

The solution to these usually is a transactional outbox.

Non-trivial.

[-]

jasie3k@reddit

Yeah I am aware of the outbox pattern, I misspoke in my previous comment.

I meant from the point of view of the broker.

[-]

jvallet@reddit

Only if control the consumer.

[-]

yk313@reddit

If you don't control the consumer then it's also not your problem if they are idempotent or not.

The producer can't bypass the laws of physics.

[-]

jvallet@reddit

Well, they can complain that you are sending the same request twice and please stop. I do agree with you, that I can at most send the message once or at least once, but not exactly once.

[-]

fuscator@reddit

Short answer, no, you can't guarantee exactly once delivery in a distributed system.

[-]

lelanthran@reddit

Short answer, no, you can't guarantee exactly once delivery in a distributed system.

Pretty damn difficult to guarantee that in a non-distributed system too.

If you ever find yourself needing "exactly-once", maybe better to change the recipient to be idempotent.

You are never going to get "exactly-once" working for common failure modes.

[-]

kaoD@reddit

Pretty damn difficult to guarantee that in a non-distributed system too.

I'm curious, how not?

[-]

lelanthran@reddit

Consider the following base scenario, which is as simple as you can get.

Process A delivers messages to Process B, which processes it and saves the result (or sends the result to Process C, or something).

Scenario 1: Delivery is by copying a message into a RAM buffer shared by both. Copy is complete, before B sees the message the power is cycled. A thinks that the message was delivered, B never saw it.

Scenario 2: Okay, you're using acks to ensure that a message is seen: B sees the message, sends the ack, and then power is cycled - B never processed the message

Scenario 3: Okay, you decide to only send the ack after a successful save! B saves the result, and power is cycled before the ack is sent. After bootup, A realises that the ack was never receives, so sends it to B again (thus no longer exactly-once)

Scenario 4: You're a smart dev; you look at the above scenarios and think "No prob, DB transaction/commit FTW!" You install PostgreSQL, ensure that the journal/WAL/whatever is turned on. A inserts a record for B to process. B, unfortunately, cannot run as a stored-proc within PostgreSQL, so B starts a transaction that a) Gets the record, b) Does processing c) Saves the result d) updates the record to say "done", then e) commits the transaction. Unfortunately, power is cycled after c) or d) but prior to e). B will reprocess that record again on bootup.

At that point, you give up and realise you need to change your requirement from "exactly-once" to "B is idempotent".

Switching to Kafka won't give guarantees around the edge cases, but B being idempotent will.

[-]

kaoD@reddit

Process A delivers messages to Process B

Ah, I considered "same machine but IPC" also a distributed system but I see that according to Wikipedia "distributed" efficiently means networked.

[-]

lelanthran@reddit

Ah, I considered "same machine but IPC" also a distributed system but I see that according to Wikipedia "distributed" effectovely means networked.

Even if it didn't mean "networked", change all of the occurrences of "Process" in my explanation to "Thread", and you'll still have the same problems. Change it to "Co-routines", and you'll still have the same problem.

As long as we specify "exactly-once" delivery, the implication is that there is a sender and a receiver, and even in a single-threaded program, without an async runtime, you still cannot guarantee exactly-once semantics.

[-]

josluivivgar@reddit

I mean why would you need "exactly one" as in when something fails you have to abandon the message?

there must be an use case for this, but I can't think of one, maybe on time critical things where processing the missed message can cause major delays? what you usually want is no repeated messages, which would imply that once a transaction/ack/db write is done you do not need to send the message again.

which is exactly why the problem is hard in distributed systems, because you need both sides to acknowledge that the message is obtained on both sides.

being idempotent is a way to solve it because then B just doesn't even bother with any repeated messages or just acknowledges it to shut up A without doing any work

you could also solve it by reverting the way the system works, B can be the requester and the one who has to handle what get's called (so A doesn't send anything until B says send me the next one) and it can also ask send me the same one again. but that's only reasonable in 1 to 1 communication, hence why stuff like kafka exists.

I think kafka is useful, but I do agree that it's overcomplicated in it's feature set and how to choose those options, at least from my experience it was a pain to set up to our liking, but once you got that done, it was useful and worked well.

[-]

stult@reddit

Pretty damn difficult to guarantee that in a non-distributed system too.

That makes no sense. Why--or perhaps more importantly, how--would a system even deliver messages in a non-distributed way in the first place? What does it even mean to send a message in a non-distributed system? As soon as you are sending and receiving messages between autonomously functioning computing nodes, the system is by definition distributed. If the system components sending and receiving messages were not physically or logically separated from each other into distinct computing nodes (i.e., your system is non-distributed), there wouldn't be any need for messages because messages are only required to coordinate independent processes or components.

[-]

audentis@reddit

User input in a non-distributed system can be considered an incoming message.

[-]

stult@reddit

I would argue that if user inputs are considered messages, then the human-machine interface taken as a whole forms a distributed system because the user is not hard-wired in and thus operates independently on physically and logically separated compute infrastructure (i.e., their brain). For a message to be a message in any meaningful sense of the word, it must convey information from a sender to a recipient. If the sender and recipient exist within a monolithic program that runs single-threaded (i.e., is very definitely not a distributed system), there isn't any real distinction between sender and recipient so it doesn't make sense to talk about messages at all (which is probably one of the reasons the message-passing interpretation of object-oriented design never quite caught on).

[-]

lelanthran@reddit

If the sender and recipient exist within a monolithic program that runs single-threaded (i.e., is very definitely not a distributed system), there isn't any real distinction between sender and recipient so it doesn't make sense to talk about messages at all

Tell me, have you never seen co-routines before?

Both sender and receiver run within a single thread and can pass messages between each other.

(I've seen your other reply to me already)

[-]

lelanthran@reddit

Short answer, no, you can't guarantee exactly once delivery in a distributed system.

Pretty damn difficult to guarantee that in a non-distributed system too.

As soon as you are sending and receiving messages between autonomously functioning computing nodes, the system is by definition distributed.

If that was the case GP wouldn't have qualified the exactly-once delivery with "distributed system". If your logic holds, then the system is not distributed, then "exactly-once" also makes no sense.

The reality is that "distributed" as a qualifier is well-recognised meaning "different nodes in a network", not "different threads in a program".

[-]

stult@reddit

The reality is that "distributed" as a qualifier is well-recognised meaning "different nodes in a network", not "different threads in a program".

That's objectively not true. In distributed algorithms literature, a distributed system is defined as a collection of independent processes that coordinate via message passing or shared memory primitives. And that is quite intentionally designed to cover interprocess communications (hence the shared memory primitives) because there is no relevant logical difference between a distributed system that passes messages over a network and one that passes messages between processes locally.

This definition also goes beyond characterizing a distributed system as merely any system with some amount of concurrency. So the mere act of multithreading or multiprocessing does not render a system distributed. The concurrent processes must also be independent, meaning a process can fail without the other processes necessarily being able to detect that failure immediately, and they are unable to share a reliable global clock. So if you spin up a bunch of threads to parallelize some workload using a standard scatter-gather type pattern, there's still a parent thread that handles the scattering and gathering, and monitors how each thread progresses, thus rendering them not independent.

If that was the case GP wouldn't have qualified the exactly-once delivery with "distributed system".

They qualified their description of exactly-once as related to distributed systems precisely because it doesn't make sense to talk about exactly-once in a non-distributed system and also to distinguish from non-computing systems where exactly-once delivery does exist practically, e.g. physical mail delivery. Regardless, how someone offhandedly characterized this issue in a brief comment where that wasn't the main focus of their point isn't persuasive evidence either way.

If your logic holds, then the system is not distributed, then "exactly-once" also makes no sense.

Yes, exactly so. Except, as mentioned above, in the case of non-computing delivery systems like snail mail.

[-]

Evening-Gur5087@reddit

That's why you have to implement idempotency correctly in addition

[-]

zshift@reddit

You can mitigate this with application-specific transaction IDs. On disk save, add a where clause for the transaction id not existing. If the db saved and the application failed to ack, then the second message processing will see the existing transaction id and ignore the message as a duplicate. This requires the sender generate transaction IDs in some form, and can guarantee that the same transaction won’t be generated with two different transaction IDs (eg, user presses “submit” multiple times on a GUI).

[-]

jasie3k@reddit

Yeah that's idempotent processing but it requires some work on the consumer side.

[-]

Crafty_Independence@reddit

RabbitMQ does that job a lot more simply, as do many other options. Using Kafka solely for this use case is over-engineering.

[-]

ViewTrick1002@reddit

As long as you keep yourself in Kafka's bounded box. Like any database.

Or implement two-phase commit with your side-effects. Which needs to spread throughout the entire side-effect tree.

[-]

fnork@reddit

Every guide to Kafka ever written strongly discourages exactly-once and in-order delivery, so touting this as a killer feature is just as stupid as it sounds. Also you're going to be sold Kafka streams right off the bat if you ever try to adopt Kafka. It's a trap.

[-]

worksfinelocally@reddit

Which guide discourages in-order delivery? If you have multiple partitions and design the partition key so that all data for the same key always lands in the same partition, then you can still achieve in-order processing without impacting performance. There are many cases where in-order processing is required for the same session, user, or payment in payment processing systems, so I doubt it is ever discouraged.

[-]

throwaway490215@reddit

A message queue that guarantees exactly-once and in-order delivery while maintaining reliability and availability is a simple problem?

This can be done in a few hundred lines of python code. Kafka takes those few hundred lines and makes every line configurable and adds in standardized terminology.

Then it adds an additional few thousand lines of code that do niche things like compaction, quotas, replication, etc, and makes those all configurable as well.

Its not so much whether the system requires that, its whether the organization is better off with that.

I've yet to experience an organization that would not have been better off re-using whatever database infra they already had. (Though at the same time I've seen people choose Kafka specifically to avoid dealing with the existing org)

So while I dont agree with the parent comment, adding Kafka should be done with the same level of skepticism as somebody suggesting rewriting in Rust.

[-]

lelanthran@reddit

This can be done in a few hundred lines of python code.

I'd very much like to see those few hundred lines of Python. You can't guarantee exactly-once, but you can make the recipient idempotent per message.

[-]

CodelinesNL@reddit

The confidence of 'developers' like you in being so absolutely wrong on things they know nothing about should be a study case.

Thinking this is a "simple problem" is textbook dunning-kruger.

[-]

mexicocitibluez@reddit

Edit: "Getting the last word in" and then blocking someone. What a miserable little person you are.

Gotta be the most pathetic behavior on Reddit.

[-]

Blecki@reddit

[peruses the stack of consumers he had to write in c# because the existing suite of sinks couldn't handle batched deletes]

Hmm yes I am naive and inexperienced with this product I use every day.

[-]

CodelinesNL@reddit

[peruses the stack of consumers he had to write in c# because the existing suite of sinks couldn't handle batched deletes]

What are you talking about? That C# has a shitty ecosystem when it comes to Kafka tooling isn't Kafka's fault. That's on you for sticking to that ecosystem.

[-]

Blecki@reddit

C# has a great library published by confluent themselves? Huh?

[-]

CodelinesNL@reddit

Okay; show me the issue: https://github.com/confluentinc/confluent-kafka-dotnet

Which one was it?

[-]