The AWS Lambda 'Kiss of Death'

[-]

Tiny_Effect_7024@reddit

I've heard of other issues related to just trusting Lambda to do everything correctly... it seems like a perfect solution at first.

Do you regret using Lambda, or do you think it's still worth it?

[-]

tkyjonathan@reddit (OP)

Its fine if you give it that tx_isolation session variable.

[-]

Tiny_Effect_7024@reddit

if you had to do it all again would you still use Lambda?

[-]

Skull0Inc@reddit

Great analysis, breakdown and write-up on this. Thanks for post: could save a fellow developer some future stress.

[-]

darkklown@reddit

Lambdas should be limited, they should just read and write to a queue for other workers to process. If they are doing more than that you're misusing them.

[-]

jWoose@reddit

“The connection is being reused, and some of those connections start a transaction, then forget to close it.”

This feels like the real problem. I would want to know why lambda is “forgetting” to close the transaction. Computer don’t tend to forget. There is a bug here.

[-]

jWoose@reddit

I know. I’m suggesting there is a bug in the user written code.

[-]

Leaking transactions is pretty bad. But it also seems like the author doesn't know the difference between the isolation levels. Or, at the very least, what isolation level their code needs, and why it needs it.

[-]

heresyforfunnprofit@reddit

Lambda functions can die with no apparent cause or error. Can’t trace it, can’t duplicate it, can’t log it. All you can do is setup another function to check it.

[-]

jWoose@reddit

Is this really true? There has to be a reason. Memory exhaustion, uncaught exception, etc. I can't imagine an AWS feature this widely used has issues where the lambda just dies for no reason. There is a reason. It sounds like more likely the use case doesn't fit lambda and someone is trying to shove a square peg into a round hole.

[-]

OldschoolSysadmin@reddit

Can confirm this is a thing though. That exact problem forced us to migrate an entire stack out of Lambda.

[-]

admalledd@reddit

Yea, in no world should you be leaving transactions open. We have SQL monitoring to alert us if we have any transaction open for more than certain units of time (for us, configurable/semi-dynamic since we have some known processes that take hours syncing large ish datasets). Further, then we have alerts on transaction conflicts, stalls taking more than $TIME, etc.

I get that such monitoring can be annoying/not-wroth to setup, but surely the instant you see stale transactions you escalate that its a upstream (be it your app, library, SDK/framework, etc) bug? IE, it is normally not something you should change any database settings for.

[-]

rotinom@reddit

My guess is the instances are being killed after hitting the timeout.

[-]

glenrhodes@reddit

The provisioned concurrency workaround is the standard fix but it gets expensive fast. For anything latency-sensitive I've ended up doing a scheduled ping every 5 minutes just to keep at least one container warm, which feels dirty but works. The real problem is that Lambda's pricing model was designed for bursty infrequent workloads and people keep trying to use it for always-on services.

[-]

AdUnlucky9870@reddit

This is a great war story and the root cause is something I've seen bite multiple teams: Lambda's execution model fundamentally clashes with how most ORMs manage database connections.

The core issue: Lambda containers are frozen between invocations, not destroyed. So a connection that started a transaction in one invocation can sit there with an open transaction for minutes or hours while the container is warm but idle. Meanwhile InnoDB is dutifully maintaining MVCC read views for that stale transaction, and the undo log grows unbounded.

The READ-COMMITTED fix works because it releases the read view after each statement instead of holding it for the entire transaction. But there are a few more defensive patterns worth knowing:

Always explicitly commit/rollback in a finally block. Don't rely on connection pool cleanup. In Lambda, "cleanup" might never happen.
Set wait_timeout and interactive_timeout aggressively low (30-60s) on the MySQL side for Lambda users. This kills zombie connections before they can accumulate damage.
If you're on RDS, use RDS Proxy. It handles connection pooling at the infrastructure level and properly manages transaction state between Lambda invocations. It's not free, but it's cheaper than 80GB undo logs and production freezes.
Monitor innodb_history_list_length as a first-class metric. Set alerts at 10k, page at 100k. By the time you notice query slowdowns, the undo log is already catastrophic.

The broader lesson: serverless doesn't mean stateless when databases are involved. Every Lambda-to-database architecture needs to be designed with the assumption that connections will be abandoned mid-transaction.

[-]

programming-ModTeam@reddit

Your post or comment was removed for the following reason or reasons:

This content is very low quality, stolen, or clearly AI generated.

[-]

necrobrit@reddit

What are the ORM footguns that lead to connections with hanging transactions? I'm struggling to think of any examples where you'd get in trouble.

E.g. something like this Python psuedocode should be safe

db = # some global db object with a connection pool
def handler(event, context):
    with db.transaction() as tx:
        tx.execute(...)
        return {"statusCode": 200}

My imagination is failing me haha.

[-]

peterzllr@reddit

Wouldn't it be better to commit early (autocommit if it's a single query) to solve the problem of idling transactions? Just letting the isolation level might lead to other kind of errors.

[-]

tkyjonathan@reddit (OP)

Running commits more frequently would be a good idea. I hear things like resetting the connection every now and then also helps (conn.reset()).

But having that DB session variable - tx_isolation=READ-COMMITTED - pretty much covers a lot of the issues.

[-]

thatnerdd@reddit

I'm seeing some snarky comments that aren't explaining themselves, so I really have to weigh in: Lowering the isolation level of the database is a potential source of application bugs. Actually, any isolation level less than true Serializable isolation is dangerous, and the lower it goes, the more dangerous it becomes.

I'm going to try to explain why.

The reason why switching to Read Committed isolation is helping in your performance is that there are a bunch of isolation anomalies that are (potentially) occurring, and your database is doing work (or holding transactions open) in order to prevent these anomalies. Let's call that work "transaction hygiene." In this case, you're risking a non-repeatable read anomaly.

Here's a simple case of a non-repeatable read anomaly:

The first transaction performs a read, sending this back to the application.
A second transaction commits a write, modifying some of the data the first transaction has already read.
The first transaction performs a write based on the information from its previous read, and commits.

If the isolation level is READ COMMITTED, you're saying that's fine. The database does no hygiene work and simply commits the first transaction even though the data it was based on assumed it was not. If the isolation is REPEATABLE READ, then transaction hygiene in this case involves the database doing some combination of the following:

If the second transaction successfully COMMITs, the first transaction has to abort. The read has already occurred. The database will send a signal that this has happened to your driver/ORM, and a retry error will be thrown. Good application code will catch the retry error, and start a new transaction.
If the second transaction performs a write but hasn't yet committed it, the database may force the second transaction to wait for the first transaction's COMMIT before the second is allowed to COMMIT. The history will show that the first transaction occurred, then the second.
If the second transaction has priority but is still open, the first transaction can wait for the second transaction to COMMIT or ABORT so that the first transaction can ABORT or COMMIT, respectively.

This is a simple case, and it gets more complicated as more read/write operations occur. There are some tricks the database can do to keep things running smoothly while maintaining transaction hygiene (like ordering transactions in a way that prevents a conflict) but in many cases, delays or retries become inevitable.

And because these are transactions, that means that code logic assumes that the data hasn't changed mid-transaction. By lowering your isolation, you're flagging those isolation anomalies as fine. You're saying it's fine if that data changes between the read and the write.

And it is fine! I don't know your code logic! But if so, if the anomalies are not a problem, you shouldn't have been using a transaction in the first place. Because if the code doesn't need an ACID transaction, there's no reason to make the database spend the effort ensuring transaction hygiene. If two or more smaller operations (each of which is transactional on its own) would have caused no bug in the application, it should be done without transactions. So either you shouldn't have been using a transaction at all, or else you did need the transaction, and you are silently creating bugs by using the lower isolation level. This applies to anything less than perfect isolation: Serializable isolation.

With Serializable isolation, the database contract is this: any transaction occurs as if the application were the only thing touching the database for the duration of the transaction, from BEGIN to COMMIT.

When using any isolation level other than Serializable, you have a set of anomalies you may be silently triggering, and each of those anomalies could be triggering bugs. To prevent those bugs, here's what you want to do:

Figure out every potential transaction anomaly (part 1, part 2) that could be triggered when any two or more of your transactions overlap. Check every possible race condition outcome. Make a list of all possible anomalies. If this seems time consuming, it is. Extremely.
For every anomaly you are triggering, determine whether or not it could trigger bugs in your application. This will probably be significantly more time-consuming than step 1.
In cases where bugs are a concern, refactor your code to harden it against the anomalies, ensuring that the isolation anomalies don't trigger bugs in a running application. Often these refactors will allow you to break up a transaction into smaller parts, making less work for the database.
Go back to step 1 every time a new query is added to the application, and every time there is a schema change.

Or you can avoid all of this and just use Serializable isolation!

This will cause a few potential issues. First, the database will have to work harder to ensure transaction hygiene, and there will be a combination of delays (to wait for other transactions) and transaction retry errors that get thrown (when a transaction gets into a dirty state and needs to ABORT to ensure transaction hygiene). The application will be affected by these.

So you'll need to include timeouts and try/catch clauses in your code to keep it moving in the face of these issues. Log those events. Each such event is going to have a performance impact, but it may also be preventing a bug that would have been triggered silently with a lower isolation. By examining your logs, you can determine if the timeouts or retries are common enough to cause performance problems or affect your customer. Investigate those.

For each, you'll have a decision to make. Perhaps your code logic doesn't require such a long transaction (in which case you can break up the transactions, reducing database resource overhead associated with transaction hygiene). Or perhaps you do need the isolation with your current code logic and paying the cost is cheaper than refactoring. The hardest case is the one where you need the transaction but the performance price has become too high, and you need to rewrite your back-end logic to remove (or at least reduce the size of) the transaction. In some extreme cases, this may even require a change of schema in order to improve performance. That's still better than an isolation bug.

This is why people are reacting with horror when you talk about lowering the isolation.

Source: I worked at several database companies, and have opinions about this.

[-]

ptoki@reddit

Leaving uncomitted transactions is VERY bad idea and not addressing that but doing some workarounds like connection resets is just putting makeup on a zit.

Find out why lambdas hog the connections/transactions and fix that. I suspect they just dont finish in time and aws kills them while db does not detect that (connection is reused). This is very poor design either by lambda coder or aws.

[-]

rotinom@reddit

I’m not sure how Lambda is even pooling connections. There is a hard kill on all Lambdas after 15 minutes.

My guess is they think that they are connection pooling, but instead they are leaving dangling and enclosed (therefore uncommitted) connections. Hell. If they are arguing that they need pooling, they shouldn’t use Lambda at all. That’s the wrong tool for the job.

The hard kill adds all sorts of complexity (like this) and they’d be better off using ECS IMHO. Pseudo-serverless but they control when the instances are killed. You don’t need to worry that the control plane will pull the rug out from under you.

[-]

Worth_Trust_3825@reddit

I’m not sure how Lambda is even pooling connections.

It's not. Lambdas are cgibin-like solutions. Depending on whether you're java or something else you'll either get 1 container or multiple containers per request. To properly pool connections you need either rdsproxy or pgbouncer

[-]

danted002@reddit

The 15 minutes is a soft limit. If you have a support contract and a good reason to keep the lambda alive for fore then 15 minutes you support can extend the limit on your account.

[-]

rariety@reddit

A single execution has a time out of 15 minutes, but the Lambda environment (that multiple executions can use) can last for hours before the service recycles them.

[-]

genesis-5923238@reddit

Basically Lamba runs on a light VM which is spun-up each time you increase your Lamba concurrency. After the Lambda function has completed, the runtime will keep running for some time expecting new calls to the Lambda. You can set some resources when the runtime is started, and re-use them when your entry point function is used. This is a common pattern to avoid creating all resources for each Lambda function call, and instead only when a new runtime is created.

So I guess here the DB connection is created with this pattern. The core issue seems to be that transactions are left open for way too long, which is not addressed by the post.

[-]

ptoki@reddit

yeah. I find lambdas ok but way too many uses arent what lambda should be. Practically lambda is supposed to be quick hit and run and that 15 minute timeout is way too high (I mean its ok but typical use should be less than 1-3 minutes).

I agree that if you need fancier processing ec2/ecs is better but then you need to put more effort into event handling - lambda do that for you mostly...

Anyway, I find this post one of the "we fought the battle and won" while its really "they fought the wrong windmill and lost"

[-]

danskal@reddit

If they are arguing that they need pooling, they shouldn’t use Lambda at all

Totally this. Connection pooling and lambda seems like a very high-friction approach. Like trying to tow a caravan with a racing motorbike.

[-]

Pinball-Lizard@reddit

But having that DB session variable - tx_isolation=READ-COMMITTED - pretty much covers a lot of the issues.

Yes, it covers them. Like wallpaper covers a crack.

[-]

tkyjonathan@reddit (OP)

The database is much happier since.

[-]

Pinball-Lizard@reddit

Specious argument.

[-]

Kusibu@reddit

When you're running inside somebody else's system you can't rewrite, sometimes "it is demonstrably no longer on fire right now" is what you can get.

[-]

GergelyKiss@reddit

Exactly my thoughts... I'm no InnoDB expert, but isolation levels are not meant to resolve this problem, the fact that it helped sounds very implementation-specific and accidental.

If you use a connection pool, then either use autocommit, or have a proper transaction boundary around each unit of work - leaving connections in an inconsistent state when giving them back to the pool is asking for trouble.

[-]

fubes2000@reddit

Less relevant to Lambda than it is to pooled connections in general.

[-]

GhostPilotdev@reddit

Exactly. This is a connection pool hygiene problem wearing a Lambda costume.

[-]

fubes2000@reddit

But ig "AWS bad" gets the clicks..

[-]

spidermonk@reddit

Yeah this is why people put http proxies around services that use long running tcp connections for serverless.

[-]

TL-PuLSe@reddit

Yup. As someone who works deeply with innodb on the reg, seen all of this plenty of times.. but never quite to this level

[-]

hipsterdad_sf@reddit

The real lesson here goes beyond Lambda. Connection pooling in any environment with ephemeral compute is fundamentally at odds with how most ORMs and connection libraries are designed. They assume long lived processes that gracefully shut down, and serverless gives you neither of those guarantees.

The fix in this article (lowering isolation to READ COMMITTED) works but it's masking the actual bug, which is transactions being left open. That's the kind of fix that survives in production for years until someone inherits the codebase and has no idea why the isolation level was changed.

What I've found works better: treat every Lambda invocation as if the connection might be poisoned. Reset the connection state at the start of each handler, set explicit statement timeouts, and add a middleware that logs any connection that's been idle for longer than your expected handler duration. That last one is the early warning system that would have caught this before it became a full outage.

The broader pattern here is that most production database issues are not query performance problems. They're connection lifecycle problems. And they're incredibly hard to catch in staging because you need real concurrency patterns to trigger them.

[-]

maybes_some_back2002@reddit

This is a great reminder that autoscaling compute does not mean autoscaling dependencies. most outages in serverless systems are not about CPU or memory. they are about downstream limits like database connections, rate limits, or maybe external APIs

Serverless works best when you design for backpressure and controlled concurrency from day one

[-]

intheforgeofwords@reddit

That same bit of advice - setting transaction level to read committed - applies to other DBs as well. I used to do that simply as a safety precaution when running queries against databases with large numbers of writes in MS SQL.

[-]

klausness@reddit

Yes, but sometimes you need repeatable read to ensure correct results. You shouldn’t just change repeatable read to read committed willy-nilly. Yes, transactions should always use the lowest possible isolation that gives you correct results. So if you’re using repeatable read where you don’t need it, you should definitely change it. But if you actually need repeatable read, changing it is a very bad idea.

[-]

intheforgeofwords@reddit

As always - trust, but verify

[-]

_no_wuckas_@reddit

Nice write up! (And appreciate the actual rather than AI slop nature of the content, best as I can tell.)

[-]

Kusibu@reddit

I'm pretty sure it's been chatbot-assisted in a few places, but I do get the impression it is an actual problem worked around by actual people.

[-]

unapologeticjerk@reddit

I'm pretty sure ur a chatbot.

[-]

Kusibu@reddit

??????????? there's 3 different lengths of dash and random bolding there's no way this guy didn't use at least some chatbot

[-]

unapologeticjerk@reddit

Hang on, I need to have ChatGPT translate your comment from bot to human. Beep boop.

[-]

Kusibu@reddit

Okay, that's a more fair reading of it, yeah.

[-]

travelinzac@reddit

Step 1: don't use lambdas

[-]

kjeft@reddit

This is why you have tcp keepalives configured, as well as idle in transaction timeouts.

Clients. Never. Behave.

[-]

CallMeKik@reddit

“We forgot to close our DB connections” is not a kiss of death it’s just kind of a footgun most of us will hit anytime we have a large number of parallel processes connecting to the same persistence layer without properly managing the connection pool. Nice write up but the title was clickbaity!

[-]

Zwets@reddit

I'm not sure that the author even knows what the difference between a "hug of death" (overloading a server, with mass legitimate requests that all happen at the same time due to thousands of clients all reacting to the same trigger) and "kiss of death" (a small number (or even just 1) legitimate requests, causing a cascade of updates/locks by adding 2 or more conflicting tasks, or a cyclical dependency that triggers the server's internal logic to overload itself) and why these therms are named as they are.

Unclosed connections are not legitimate requests, therefor they aren't "kiss", or "hug", or "snuggle" or anything affectionate. Unclosed connections are waste traffic, ergo if you really wanted to name them it'd be a "[fecal/urinal term] of death". Considering unclosed connections are caused by not cleaning up, I would like to suggest "The AWS Lambda crusty flakes of death".

[-]

Pas__@reddit

the connections were legitimate (pool keeping them hot) the problem is that the pool manager library did not make sure that when a connection is returned to the pool it is reset to a "pristine" state.

it's like a server collecting plates and glasses in a restaurant, then just chunking them behind the counter without washing them.

it's the suffocating smell of leftover soggy SQL salmon from requests long gone!

[-]

CallMeKik@reddit

I really appreciate the level of effort you put into writing this.

[-]

rayreaper@reddit

In addition to what others have said, it's more likely long-lived/uncommitted transactions causing the issue. I'd want to know why Lambda is doing this, as that should be the first thing to check because it doesn't sound right.

Architecture-wise, you might have been better off leaning toward an event-driven approach rather than direct database connections, since that's what Lambda is best suited for. Amazon RDS Proxy is recommended for large numbers of Lambda connections, and you could use that as a short-term “bandaid” while diagnosing it, but it wouldn't address the root cause if the real issue is long-lived or uncommitted transactions.

[-]

Rhoomba@reddit

The recommended solution changes the semantics of your DB queries. Can your application handle the different anomalies of READ-COMMITTED?

If you can't answer that question or have no idea what it means then you are already in trouble.

Fixing your broken client that holds long lived transactions is the correct solution.

[-]

tkyjonathan@reddit (OP)

How does it change the semantics of your DB queries?

[-]

Rhoomba@reddit

https://en.wikipedia.org/wiki/Isolation_(database_systems)

Perhaps unsurprisingly, if you don't have REPEATABLE-READ then you have non-repeatable reads.

Two reads of the same row in a transaction can give different results. Does this matter for your app? Maybe it is fine, but this is something you need to deliberately consider, otherwise you don't understand what it is doing.

[-]

tkyjonathan@reddit (OP)

So, it doesn't change the semantics of your query, first of all.

Secondly, if you are destroying the database, then that is a problem in-and-of itself.

Thirdly, if you have a long transaction, where you are expected to read the same rows multiple time and you are concerned about 'phantom reads', then start your transaction with:

START TRANSACTION WITH CONSISTENT SNAPSHOT;

..and that will solve your issue.

Otherwise, READ-COMMITTED should be the default in MySQL/MariaDB and it is the default in Postgres and MS SQL.

[-]

gazofnaz@reddit

AWS best practice is to use RDS Proxy in front of any RDS instance accessed by a Lambda. I'd love to see if the issue reproduces when a proxy is used, when all other variables are kept the same. In theory, this is one of the big problems RDS Proxy is designed to mitigate.

[-]

AuxxAiCRM@reddit

Great read

[-]

Acceptable-Yam2542@reddit

connection pooling in serverless is just pain with extra steps.

[-]

jivedudebe@reddit

Seemed badly written transaction management in your software

[-]

Relic_Warchief@reddit

Maybe I missed it. What's the difference between REPEATABLE-READ VS READ-COMMITTED?

Is it that in the former, multiple connections can reuse the same snapshot instead of grabbing a fresh state of the db like in the latter?

[-]

tkyjonathan@reddit (OP)

It is explained in the article, but basically, REPEATABLE-READ holds a snapshot in the InnoDB undo log for reads and writes when you start a transaction and releases it when you COMMIT. The idea being that if you read something at the start of the transaction and then again in the middle and then at the end, you can expect the same rows to give the same results, despite if other applications have changed them.

The problem of long history of snapshots is compounded when you have pooled connections and one session forgets to COMMIT.

READ-COMMITTED doesn't hold snapshots for reads. So if your session just doesnt hold snapshots as often, then it wont have this issue.

[-]

LevelIndependent672@reddit

rds proxy handles the connection explosion part but frozen lambdas with open txns still bite u if ur not explicitly committing on every exit path ngl