I Replaced Redis Locks with Database Atomicity and You Should Too

[-]

OP has no idea what they are talking about. Both solutions are fast, both are highly scalable (redis more, but irrelevant unless OP is in a FANG sized company), neither will save OP if they don't know how to implement a basic distributed lock.

[-]

Old_Pomegranate_822@reddit

I find it hard to believe that the time difference is measurable when compared to the time taken to run the jobs they were processing. You don't win anything by shaving a few milliseconds off a job that lasts a minute

[-]

PiotrDz@reddit

Yeah but the title says "replace the locks", so there is one important factor to consider.

[-]

pdpi@reddit

"Faster" only matters insofar as you're not sacrificing basic correctness.

[-]

PiotrDz@reddit

What do you mean by "correctnes". Redis lock behaves similarly to DB lock. Maybe it is not stored on disk in consistent manner, but when connection to Db/redis instance is lost lock is effectively lost. I don't think that the consistent disk storage is so useful.

[-]

pdpi@reddit

Correctly implemented, redis locks are fine. The point is that it's more complexity, and more things to fuck up. If you're synchronizing access to one service (the DB) by using another service (Redis) instead of using the first service's own sync primitives, you're adding a whole bunch of new failure modes you didn't have to begin with.

[-]

Treebro001@reddit

This should be the core takeaway of the entire thread I'd say.

[-]

PiotrDz@reddit

I don't think it matters. If you want your locks to be also a part of db transaction then it is a different problem. I am thinking about locks in more generic term : synchronising access between clustered instances of an app.

[-]

TldrDev@reddit

Vibe coders in shambles.

[-]

dpark@reddit

I agree the lock in SQL is the better option, but I still want to understand where the bug was in the Redis solution. That flow looks like it should work. How did they have two workers processing the same task if only one lock succeeded?

“The winner sometimes processed tasks that were already being handled”

This implies at least two winners.

[-]

ProtectHerEgao@reddit

It sounds like their main issue is managing distributed locks rather than an issue with redis itself. If I had to guess, their redis is sharded while their database is not which leads to these issues. The author mentions distributed locks at some points. If their database is sharded, I would imagine there to be similar issues.

If their redis is sharded, they need to use Redlock or another distributed locking mechanism instead of just writing it to the master shard. Replication lag or failover situations might cause the lock to be acquired by two processes.

Their ghost lock issue can be easily fixed by setting an TTL on the key. Something that redis supports natively.

I also have some doubts about putting higher loads on the database especially some high frequency like locking.

Databases are not magic and redis isn't incapable. Things just have to be designed properly.

[-]

non3type@reddit

I generally agree with everything you’re saying but I can help but think that if you need an acid compliant transactional database you start with a traditional rdbm. This felt like they chased the new shiny without any awareness of its strengths/weaknesses going in.

[-]

IQueryVisiC@reddit

If you want ACID, at the start of your project, do you set isolation of the RDBMS to "serializable" or what is really ACID. Is snapshot or any of the weaker forms ACID? Do they behave like in OPs scenario? What is a practical example of the effect of isolation? Does serializable lock the whole db, even other tables and records?

[-]

GergelyKiss@reddit

That's a very good point but doesn't matter in OPs scenario because the usage is so simplistic: one update, no reads, on one specific row (hopefully identified by its PK even).

Whether a DB engine supports ACID and how it does exactly is vendor-dependent, but all the ones I've seen so far would do an update like this atomically, even with the most lenient isolation level.

Real problems start when there are selects and inserts, and lots of them in one transaction...

[-]

IQueryVisiC@reddit

two inserts (on different tables) are the default example for a transaction to make sure that the amount of money in a bank stays constant ( invariable? ) and accounts don't drop below 0 . Isolation level can be changed in MS SQL server. It does not depend on the vendor=Microsoft. Row lock yeah, seems to be just a performance optimization.

[-]

ZirePhiinix@reddit

This is typically how most problems arise. Misunderstanding the tech and using it wrong.

[-]

robberviet@reddit

I agreed on the sharding. About 12 years ago I worked with a PHP codebase that use memcached lock for sharded mysql (10), only for payment transactions. It workded fine. It also used TTL. Not sure if memcache is sharded but i guess not.

That was right after graduation so I didn't understand much, only now I understood what it does.

[-]

Treebro001@reddit

Redis lock doesn't wait for the transaction to actually commit on the db end before releasing, this is not the case for a db row lock. (At least this is what I've had issues with in the past with bad redis locks in similar situations)

Redis locks should never be used to help ensure transactional integrity within a db because of this.

[-]

TheMoonMaster@reddit

Yeah, the article implied (as read) that they abandoned redis because they didn’t understand the problem, which isn’t a good message to send.

That omission makes this post lose a ton of its value. What was the mistake? How can others avoid it? Why did a db solve it when you clearly said the logic was sound?

Unfortunate because I was really curious

[-]

wtfbbq81@reddit

You're not my supervisor

[-]

codesnik@reddit

there's nothing in ACID that would guarantee your "rowcount" to be correct all the time. Use transaction, select for update, and maybe "skip locked" instead of selecting exact record by id.

[-]

Bayakoo@reddit

Isn’t the issue that we are using locks on system A to safeguard access to a system B. You can’t about potential edge cases in such solution.

Since the exclusive access system already supports locks then use those locks - and you guarantee 100% no edge cases (except application bugs)

[-]

kernel_task@reddit

This is a classic producer/consumer problem and you are just reinventing the wheel. Go with Pub/Sub or something and you can distribute the work much more efficiently than a bum-rush of lock contention. You can then impose quotas on the producers to solve your fairness issue.

[-]

Soccer_Vader@reddit

I totally agree the SQL lock is the better option than re-inventing the wheel, but I have some questions:

"The winner sometimes processed tasks that were already being handled". This speaks problem in a architectural level, not redis? How could this have happened? Was there two winner? How were there two winners, when your system explicitly mentions only one wins?
"leaving behind zombie locks in Redis". Again architectural problem. Why was there no TTL? No fallbacks? If it was a monolithic single instance you could easily add a cleanup process on startup to see if there are any orphaned tasks?
"The database engine handles all the locking, isolation, and consistency for us." So does redis. Your Update statement and redis setnx statement is identical? What changed? You still don't have TTL, background tasks to reconsolidate orphaned tasks, and cleanups. If your program crashes, which apparently happens through coding and not system failure(you don't have top level error handling?) your method will come falling down again, what is your plan when that inevitably happens.

[-]

Santrhyl@reddit

Postgre will release the lock as soon as the transaction is done or if the connection is closed.

So no need for a TTL but you should consider what happen if my cron crashes.

I agree they missed that

[-]

Soccer_Vader@reddit

Postgres will release the lock as soon as the transaction is done or if the connection is closed.

I don't think they were using Postgres.js in built transaction lock tho? These are the example queries:

UPDATE tasks SET status = 'processing' WHERE id = %s AND status = 'pending'

SELECT COUNT(*) as count FROM tasks WHERE user_id = %s AND status = %s

As far as I know you need to use explicitly use SKIP LOCKED when using the SELECT query to not get rows that are currently in another transactions.

What they are using is simple atomicity of Postgres that Redis already guarantees, that is why my belief is Redis was never the problem was their architecture, and when they migrated to using Postgres, they probably refactored, which "fixed" this problem, but they will probably re-introduce them once they start tinkering with this piece of code again.

[-]

Santrhyl@reddit

Yeah my bad, I read to many locks comments and forgot they didn't use that ... (advisory lock)

[-]

xeio87@reddit

Did I miss something? How did they solve tasks that crash on the SQL version without a timeout on the processing lock?

Or I guess the real problem is they were locking at the user level (on Redis) rather than task level and don't care about if individual tasks never finish?

[-]

Santrhyl@reddit

Postgre will release the lock as soon as the transaction is done or if the connection is closed.

[-]

xeio87@reddit

None of the examples in the article actually used database locks near as I can tell, just update...where and row count checks to see if the update grabbed the "lock".

[-]

Santrhyl@reddit

Oh yes you’re right :-) too many locks comments :-)

[-]

crap-with-feet@reddit

Let me add two more lessons to learn from this: 1. Do a little research up front. 99% of the time your problem is not only already a solved problem but there’s probably a standard for the solution. 2. The “new shiny” (Redis locks in this case) is most likely a solution looking for a problem or a solution for a problem different from your problem. If the new solution wasn’t designed for your specific problem, trying to adapt it to fit will often end in tears. The designers of the new thing haven’t tested it with your problem.

“New” isn’t automatically bad but neither is “old”. Old exists because it works and has survived the test of time.

[-]

jimbojsb@reddit

I mean Redis locking has been going on with extreme success for what…15 years? Longer? This is just a classic “distributed systems are hard” problem and unfortunately we’ve proven time and time again you can’t teach that, everyone has to fail at it in their own special way.

[-]

ub3rh4x0rz@reddit

Yeah I keep chuckling at the repeated mention of redis locking as the "new shiny" when its one of the most boring, well documented ways you can accomplish locking (and distributed locking with redlock).

"Do everything in postgres" works a lot better and longer than many people think, but redis is standard fare and I'd put them both in the category of "you almost certainly already use them, and you can use them for this other thing and get pretty good results, so don't add a new db". Redis is a better fit for basic locking IMO, but not as a replacement for properly designing your relational database schema and using its builtin features properly

[-]

Sweaty-Link-1863@reddit

That cat looks like it just debugged production code

[-]

PritchardBufalino@reddit

The author completely fucked up his implementation of a distributed lock then calls them bad. Intellectually irresponsible

[-]

ImNotHere2023@reddit

I read "user level locking" and instantly knew everything that came after would be garbage designs. Was not disappointed.

[-]

OutOfDiskSpace44@reddit

Our celebration was short-lived. Within a week, we discovered a new problem: one startup uploaded their monolithic API spec with 1,000 endpoints (yes, we've all been there). Guess what happened? All our job instances started fighting over that user's tasks, completely ignoring everyone else.
...
We replaced 50 lines of complex Redis locking logic with 5 lines of SQL and immediately solved our duplicate processing problem. Sometimes the best engineering solution is the boring one.

Not going to comment on the technical aspect here...the bigger question for me is how long this took, sounds like a few weeks, and it isn't clear what scale they're operating at to know if that's a good trade-off in time to some kind of solution.

[-]

RoomyRoots@reddit

Kinda funny this trend of pushing NoSQL solutions and people returning to DBs as they find that they just werk.

[-]

angellus@reddit

The problem was not Redis locks, in fact the author is not even using Redis locks. The probably is they tried to roll their own solution and fucked it up because they forgot to add a timeout to their "lock".

Having an extra layer of production on top of atomic database queries is not a bad idea. Especially in those cases where you must not have race conditions. Redis locks are also great when you need distributed locks that automatically get cleaned up.

[-]

inputwtf@reddit

Author clearly doesn't know about redis-py lock