Is this use of Postgres insane? At what point should you STOP using Postgres for everything?

[-]

visicalc_is_best@reddit

The fact that you intentionally shot yourself in the foot with a beautiful, well engineered gun does not mean you should change the gun.

[-]

These are not my technical decisions. This is also like the 4th incident we’ve had related to database/app performance. Part of this is that we were not thinking about indexes carefully. I’m just wondering at what point does the default “just use SQL” is no longer the answer, especially in this NewSQL age where Spanner and FoundationDB exis

[-]

WarWizard@reddit

Postgres isn't the problem. I work on a data acquisition system that has Postgres in the loop... this system records data, typically 10-25 hz and can have over 5000 datapoints (125k points every second). It stores meta data, events, logs, etc.

If you think you'll magically have no issues swapping to a different DB tech stack, you are just going to have a bad time in addition to wasting a bunch of time and effort to migrate.

[-]

cajunjoel@reddit

Just gonna say those numbers are impressive. :)

[-]

WarWizard@reddit

People much smarter than me put it together; I just work there! lol

[-]

zenware@reddit

I think you really want the answer to be something different, but it’s probably “sometime after you are a $1bn company”.

Basically any tool you could use is going to create some level of educational and maintenance burden across the whole team at least and possibly the whole org. So changing a tool when you should be increasing profitability is a straightforward waste of engineering resources. You may need some extra education time for the team so that better decisions start getting made on average, or a specialized contractor to dig you out of a specific mess.

[-]

TimMensch@reddit

I agree with the above comment.

You need better technical architecture design, not different tools. Your problems aren't rooted in the fact that you're using SQL, but in the fact that no one on the team seems to really understand the implications of the current architecture or how to improve its design.

Switching to a better tool for queuing won't fix your problems. Switching to a better use of queuing might, but that doesn't require switching away from Postgres.

Solve the underlying problems first. Only switch away from Postgres if by switching away you'll solve the underlying issues. Be sure you're using the right locking semantics as well; there are several options for how to lock rows or a table. Understand what you actually need and why you need it.

Good luck.

[-]

teerre@reddit

Although that's certainly be true, switching to an out-of-the-box tool is particularly useful for inexperienced teams. The tool will likely guide you to the correct data setup, even if you don't know it, because it was built with such assumptions in mind

Most pub sub systems have some kind of notion of batching, for example, which might be all OP needs

[-]

TimMensch@reddit

The solution is to understand the problem. If they do the analysis and decide that batching is right, then choosing a tool that does batching the way they need could be a good solution. Or rewriting their current solution to work in batches could be an effective answer. Or ditching pub/sub and queuing all tasks in a table that gets processed every 15 minutes could be the answer.

Or maybe locking in a more narrow manner would fix everything.

Switching to a new tool without knowing what the tool does better is effectively throwing the dice and hoping the new tool fixes the problem. That's like trying to get to a specific square on a chessboard via a random knight's walk. You'll get there eventually, but you'll get there a whole lot faster if you understand how knights move, where you are, and where you're trying to arrive.

Inexperienced teams don't get better by throwing the more often. They get better by doing a deep dive into problems and actually solving them. Otherwise they become "experienced" but not very effective teams, repeating the same year of experience over and over.

[-]

teerre@reddit

There's no "throwing dice" here. It's using a tool that was made for that specific job and therefore nudges you in the right direction. Of course you could understand the problem, but that's besides my point

[-]

TimMensch@reddit

It comes down to throwing dice because choosing a tool without knowing its capabilities can end up with the wrong result.

Say they decide to use Kafka. It's the most widely known dedicated message queue after all.

And what happens? Their batches sometimes get executed more than once, because Kafka doesn't have guarantees about deliveries being atomic. So now they have two problems.

Three, since by all accounts, Kafka server management is non-trivial.

If you don't understand what you need or why you need it, choosing a tool is a roll of the dice.

[-]

Whitchorence@reddit

yeah I suppose that's true if you don't bother even reading the most introductory explanation what the tool does before deciding to use it

[-]

TimMensch@reddit

Or if they don't understand the problem they're trying to solve and do don't have any criteria to look for in the tool docs?

[-]

Whitchorence@reddit

At-least-once delivery is not exactly an obscure detail that's going to be buried deep in the docs. It's a fundamental design principle.

[-]

TimMensch@reddit

And it's only relevant to them if they understand what they're reading and what their goals are! I actually chose to not use Kafka exactly because it wasn't suited to use as a job queue, and no, it was not obvious at first glance. I had to dig for it and know what I was looking for.

The recommendation above was to use a tool because "purpose built tools are good." Not to use a tool because it solves the exact problem they're having.

My argument is that first they need to understand what their actual needs are. You're arguing that they will pick a tool based on their actual needs. How will they do that if they don't understand their actual needs? So are you actually agreeing with me that they need to first understand what will solve their problem? In which case, why do they need the tool to begin with since they can just solve it using the tools they have?!

This isn't rocket science. You don't do software engineering by switching out random tools and seeing if they improve the situation after the fact.

Either they learn what their actual bottlenecks are and can therefore fix them without installing a new tool, or they don't understand their bottlenecks and they're rolling the dice.

Understanding is crucial to software engineering. Experiments are fine--profiling is critical--but you need to have a grasp of what's happening or you're gluing together popsicle sticks to build a bridge for trucks, and you're going to end up with disasters more often than not.

[-]

RiPont@reddit

It's using a tool that was made for that specific job and therefore nudges you in the right direction.

It's a tool made for a specific job to solve someone else's problem. If you don't understand the problem in the first place, you're not going to be able to use someone else's tool for your problem.

[-]

teerre@reddit

There's no single "problem". There are several "problems" that OP's team has different levels of understanding. Understanding that "this looks like a pub sub" and "implementing an efficient pub sub architecture" are not comparable

[-]

RiPont@reddit

If doesn't matter if you use a tool for pub-sub if you are blocking on the pub-sub in your own logic.

[-]

teerre@reddit

It doesn't, but like I said, a dedicated tool will have its architecture, its api, its documentation, its support, all geared towards you disincentivizing you from doing that

[-]

RiPont@reddit

That's pretty rose-colored glasses you've got on, there.

Who, on your team, is going to maintain and support that tool? Do you think you can just set it up and forget about it?

If the team can't use Postgres properly, how are they going to set up and maintain and use Kafka properly? Worse, are they going to start using Kafka for the every-problem-is-a-nail scenario?

Yes, Kafka is better at what it does than anything your team could whip up in a reasonable amount of time. But if you don't actually understand the problem you have in the first place, how can you know that Kafka will be a net benefit for the problems you're having?

I am absolutely NOT a build-don't-buy advocate. But throwing OTS tech at a problem does not make the problem go away. It does, however, vastly increase your IT complexity and/or cloud budget.

[-]

teerre@reddit

I'm not OP. Your questions don't really make sense. They are also completely besides the point. The discussion here isn't about the particularities of supporting Kafka or whatever other technology. The discussion here is if there's a place for adopting a new specialized tool given the right circumstances in a team, in particular, lack of specialized architectural knowledge. The answer is yes. There's a place to adopt new tools even if your current stack could theoretically support some workflow. That's all

[-]

RiPont@reddit

Or maybe locking in a more narrow manner would fix everything.

To start with...

INSERT into a "RequiresUpdate" table, rather than updating a single flag that will have contention. The polling process can do an SQL query to condense it into a boolean. And probably clean up the obsolete entries in a timely manner.

[-]

FluffySmiles@reddit

Insert with unique key and on conflict do nothing. All problems solved.

[-]

lost12487@reddit

For sure, but if you've got a team of engineers that primarily write programs in a "standard" language like Java or C# and you've then asked them to build this complicated pub/sub logic in PL/pgSQL is it really that surprising that they've shot themselves in the foot by not understanding the implications of the architecture?

[-]

thekwoka@reddit

Feels like theyd shoot themselves in the foot regardless of it being postgre or not.

[-]

RiPont@reddit

That won't change by throwing an entirely different system at it. People can and do shoot themselves in the foot with Zookeeper or RabbitMQ or any number of things.

If they're not able to do the smallest amount of performance troubleshooting on their use of Postgres, then switching to something else isn't going to help.

The way they're using Postgres has a Big Giant Lock. You can use a Big Giant Lock in any system, and that's your problem, not the system's problem.

[-]

TimMensch@reddit

I didn't pick up that they were using PL/pgSQL code.

For very narrow use cases I can get behind that, but in general I like to keep code that isn't declarative SQL out of my database. Easier to run tests on it. But embedding some of the pub/sub logic in the DB isn't de facto bad. My favorite "easy pub/sub" library uses Redis, but I think it also embeds a tiny amount of Lua in Redis as part of the implementation. So that alone isn't enough to make me think the team is not using a solid design.

On the other hand, I swear a significant fraction of the industry codes by superstition and luck. So hearing about a team that doesn't understand what they're doing is not only not surprising, it's completely expected. Especially when you consider that teams that know what they're doing don't generally need to post questions about how to fix things that are failing.

[-]

canbooo@reddit

IMO, it is basically never SQL (exceptions like Google scale confirm the rule)

[-]

thekwoka@reddit

Or the issue with the query sanitization built into postgres being broken, which let the Chinese hackers compromise the entire US money database...

[-]

pheonixblade9@reddit

as someone who found a bug in the spanner query planner while working at google... it's still basically never SQL.

[-]

Ok-Dimension-5429@reddit

Making the team smarter and more clueful can be nearly impossible. Reaching for a new technology may be easier and more effective.

[-]

TimMensch@reddit

Point.

Though raising the average skill of a team is not impossible, except as much as convincing the ones that hold the purse strings to pay for a sufficiently skilled consultant.

I've been that consultant in the past. The right consultant can make a huge difference. Finding the right consultant is it's own challenge though...

[-]

Embarrassed-Tear-146@reddit

have you considered any specific case studies or data on switching from postgres?

[-]

3legdog@reddit

Im sensing a LOTR parody here...

[-]

tmarthal@reddit

It’s not Postgres, it’s your message handling. Use a message broker instead of a database to pass messages, sounds like you need guaranteed ordering.

[-]

Western_Objective209@reddit

Okay you raised good points and no one is giving you solid engineering advice, as is often the case on this sub.

You are using 1 DB to do too many things, so you have tight coupling of async state across systems which is extremely fragile. You are using postgres as a badly deisgned event bus; you need a separate control plane.

So, you can make postgres your control plane. Have a queue table that takes in jobs, and processors pick them off. It's not a great fit for this, something like redis is better, but if your team wants to use postgres for everything it should work. I've worked on projects where they used oracle DB as a control plane, and it scaled pretty well

[-]

smerz-@reddit

The other fancy tools have the same flaws. Indexing is universal

[-]

nooneinparticular246@reddit

It sounds like they’re using Postgres wrong. If I were you I’d consider letting the incidents continue and let them figure it out.

[-]

evergreen-spacecat@reddit

I agree with your colleagues that you won’t need another database. Even ChatGPT at massive scale use a single Postgres for writes (multiple replicas for reads). You are not even in the same ballpark of scale. However you do need a totally redesigned system of async processing.

[-]

Ecksters@reddit

I'm a huge fan of Postgres, but this is a bit misleading, the article that gave everyone this idea explicitly said that they already were starting to shift any write-heavy workloads to systems better suited for it.

That being said, I do agree that Postgres is adequate for far more needs than people give it credit.

[-]

Whitchorence@reddit

That being said, I do agree that Postgres is adequate for far more needs than people give it credit.

Is it? I feel like the pendulum has gone way in the other direction with like "if you're not literally 3 companies in the world you should only ever use Postgres"

[-]

dudeaciously@reddit

I agree with @visicalc_is_best. (Great handle. The younguns don't get it.)

Why you would "need" to update 1 M rows as a side effect to an insert makes no sense. Thinking about transaction isolation and locking, now this weird "requirement" needing a large enough platform to perform it, all just does not add up.

Please describe in more detail the business requirement.

NoSQL is not the best for transaction consistency. The use case of Facebook, Twitter etc. is that two sessions need not display the exact same results at the same time. So updates are not made to collide against retrievals. That logic fails on ATM machines etc., where we bet our lives on ACID transactions.

[-]

RedTuna777@reddit

I would also consider just using another postgres server. You got the expertise, so changing tools would be costly but silo your tasks maybe and throw another server on the fire instead

[-]

NewFuturist@reddit

"Someone accidentally wrote an infinite loop so we switched to a language with no loops" - said no software developer ever.

[-]

dbxp@reddit

You system is only as reliable as its weakest component, you can't patch around a DB with fundamental issues

[-]

jl2352@reddit

I would second the other comments who are making the point that how you use it is the issue, not what you are using.

More than that, as databases go Postgres is surprisingly resilient when misused. There are plenty of mainstream DBs that when misused, will not get slow like you describe. No no no, they just crash. Bringing production fully down.

[-]

T0c2qDsd@reddit

Unless you're working at a lot larger scale than I think you are, Postgres is a perfectly good solution. Even 100 million rows (even moreso if they occurred because of an exceptional case -- a bug -- instead of intentional behavior...) is not the type of scale you should be worrying about your DBMS at. Like, you will get a lot better ROI for your time and energy by focusing on which operations are triggering full table scans, and run that sort of thing on read only mirrors (and like... don't do RW transactions that require full table scans, full stop).

The point where "just use Postgres" (or any other single node DBMS) stops being good advice is very limited.

The first case is "you can't keep your workload on a single node", no matter how much you're willing to pay for it -- this is scale in the realm of "you expect to have hundreds of gigabytes of in-flight transactions at the same time".

The second case is if you genuinely need the kind of reliability and uptime guarantees that single node systems cannot reliably offer (e.g. seamless multi-region failover with zero data loss or incorrect transactions/transaction reconciliation requirements/etc.) If you aren't losing a LOT of money every minute you have any downtime or have the sorts of issues like being unable to tolerate losing a few minutes of transactions during a failover or w/e (for example, you're handling banking transactions), traditional replication & failover is almost certainly fine.

Something to understand is you don't get to avoid the challenges of lock contention / event triggers causing loops / needing to think deliberately about indexes / etc. just by using a NuSQL DBMS. FDB and Spanner are incredibly good solutions... for problems that are not the ones you're describing, which sound a lot more like "We didn't build protections against (x) into the system and (x) happened" or "We held a perfectly good database wrong" than "Our single node database literally cannot manage the load".

NuSQL systems also come with their own challenges, like avoiding hotspotting -- but those challenges are *compounded* on top of the almost all of the same challenges that traditional DBMS systems come with. As someone pretty senior who has been working on and with NuSQL databases professionally for a lot of my career -- they are technical marvels, super cool, and you probably shouldn't use them if you're able to use or continue to use a single node database.

[-]

DeathByClownShoes@reddit

Isn't this the exact problem that Kafka solves?

[-]

LawBlue@reddit (OP)

That’s what I was thinking! Except we have been explicitly barred from using other subsystems because “it’s too complicated.”

I don’t think this concern is invalid FWIW. some of my past tech leadership in my career has expressed disdain for Kafka before.

[-]

roger_ducky@reddit

Kafka is a good use but, if you’re the one maintaining the cluster, it is a pain to maintain.

[-]

LawBlue@reddit (OP)

So doing one of the managed solutions would probably take away the operational burden right?

[-]

PabloZissou@reddit

Try NATS it is simple to operate and you can get very good performance and replication. Production setup takes 5 minutes and it rarely fails (I am consuming 10K msgs per second no problem with it)

[-]

roger_ducky@reddit

Well, yes. At the cost of more “overhead” and possible privacy headaches. There are always tradeoffs. Don’t know your specific situation and budget so can’t make the decision in a rational way from here.

[-]

new2bay@reddit

What privacy concerns are you alluding to? I don't see how running an in-house instance is inherently any more or less private than using, say, AWS MSK.

[-]

roger_ducky@reddit

It depends on what specific laws you need to follow.

Probably not matter initially, but certain countries require all your infrastructure for that country be within their borders. Or at least the data.

[-]

Xgamer4@reddit

I mean, your problem is that the team is misusing Postgres. The fact that they're misusing it in a way that better aligns with systems like Kafka or pubsub doesn't necessarily mean they'll use Kafka or pubsub anymore competently. I'd be legitimately worried that introducing the "proper" subsystem for the process flow is just going to lead to two very misused systems instead of one, and no problems are solved.

[-]

shawmonster@reddit

Managed temporal is also an option, seems like a good fit

[-]

admiral_nivak@reddit

Plug Debezium into Postgres, then stream events off of Kafka. Kafka really is not that hard. We have run it in production for 8 years with almost zero hassle. It’s a pain to setup properly initially, but once it’s running you won’t regret it.

[-]

Batmanbacon@reddit

“it’s too complicated.”

If your setup is simple, I'm not sure what those people imagine under "too complicated"

[-]

zrail@reddit

Postgres makes a fine queue if you hold it right. Check out LISTEN/NOTIFY:

https://oneuptime.com/blog/post/2026-01-25-use-listen-notify-real-time-postgresql/view

[-]

Cell-i-Zenit@reddit

FOR UPDATE SKIP LOCKED works pretty well and is extremely easy to use.
If you want crazy write speed, but dont need durability you can also disable writing to WAL

[-]

anonyuser415@reddit

FYI https://news.ycombinator.com/item?id=47640722

[-]

zrail@reddit

JFC. Thanks.

[-]

anonyuser415@reddit

np, hate this is the new web

[-]

eraserhd@reddit

Do not.

LISTEN/NOTIFY is not reliable for multiple reasons.

It may be possible to build good queues on top of PostgreSQL, but please, for the love of god, at least understand all of the technical requirements of queues before you try it. And then don't try it.

Do you need "at least once" or "at most once" semantics. (LISTEN/NOTIFY is "at most once", which is usually NOT what you want.)
You can't have "exactly once semantics", can you deal with processing a message more than once, e.g. idempotency? You need to be able to. This is likely the big flaw in the OP's system.
Temporal decoupling: This is the primary driver of wanting a queue - for async processing in other words.
What happens if a process dies in the middle of processing an entry? Is the queue locked indefinitely until manually unlocked? If not, is it because you have an arbitrary timeout that you can exceed and double-process items? wrt Postgres, another way to ask is what happens between SKIP LOCK and a successful or failed processing attempt?
What happens when multiple consumers attempt to poll the queue? Do they all get the same message? Do they get round robin messages? Do messages need to get repeatably partitioned?
How is historic data removed from the queue table(s)? Can it be removed? If it can't, how will you keep the system running long term?

I feel like I'm missing half the points.

In any case, I inherited a system with a "queue" table ... actually I think it was 3 related tables. I spent months fixing it.

A process would poll the queue, select up to 100 messages ready to process, process each one in sequence, then write the results of processing to a queue_results table. The polling query selected all entries EXCEPT all successful entries LIMIT 100. The sort/exclude went to disk and blew up our system multiple times.
Multiple processes by several developers added foreign keys to join the db object being worked on to the queue table, so old queue entries could not be deleted without deleting other important data objects.
If more than one consumer tried to poll a queue, they would get duplicate work items.
Error retries were naive, such that a batch of slow, erroring items would prevent additional items from being processed until the slow retries were exhausted.
The system frequently relied on having a reliable delivery order for messages. For example, it would send a DO X then a DO Y then a DO Z, where Y couldn't be done until X was done and Z couldn't be done until X was done. This could not scale.

[-]

zrail@reddit

I honestly didn't read that article before I posted it, but I have used [Que](https://github.com/que-rb/que) at an admittedly small scale for about a decade with no real problems. Que combines `LISTEN/NOTIFY`, polling, advisory locks, and stored procs to get a reliable durable queue.

That said, the applications that I have worked on at real scale have all used Sidekiq Enterprise which uses a specific set of Redis operations to get reliable queue operations. If you're using Ruby it should probably be the default choice. I don't have any opinions on non-Ruby things.

[-]

Megamygdala@reddit

My workplace uses a lot of triggers in MSSQL and its an absolute pain to track down bugs when they occur in the DB. They also place like 40% of core business logic in SQL procs so I'll never argue against moving business logic out of the DB

[-]

dreamingwell@reddit

We use triggers and business logic in the DB. It’s great.

You need more tests.

[-]

Megamygdala@reddit

Can't imagine anything worse than writing tests for SQL (didnt even know people did this because our testing is pretty barebones and mostly QA). Not to mention at enterprise scale there's thousands of tables and thousands of procs all calling each other, its impossible to track down a trigger

[-]

dreamingwell@reddit

LLMs are great at writing tests.

TestContainers is a fantastic system for throwaway docker containers for testing.

DB logs work great for debugging. And you can even use an LLM to search the logs. Super fast debugging.

The advantage of DB triggers is uniform business logic - no matter the application using that DB.

[-]

pheonixblade9@reddit

triggers are a red flag in DB design for me. hidden behavior.

[-]

magick_bandit@reddit

They can make sense for audits, but that’s about it.

[-]

pheonixblade9@reddit

if you need audits, build it into the DB or use log shipping.

[-]

smerz-@reddit

And in my experience performance wise they offer nothing extra

[-]

Megamygdala@reddit

100% I hate whoever decided to use triggers / stored procs

[-]

vinny_twoshoes@reddit

this made me shudder

[-]

ThinkLettuce2459@reddit

relying too much on triggers might complicate things more than it helps in the long run

[-]

confusedanteaters@reddit

Relatable. We got views dependent on user defined functions dependent on views dependent on user defined functions.

A simple UI get request taking \~60 seconds to load? Implementing a simple feature in the UI that queries data taking 2 weeks? Well, that's just "normal".

[-]

SikhGamer@reddit

Don't use triggers
Don't use stored procs
Don't use views
Don't use it as a message queue (despite all the cool blogs saying that you can, doesn't mean you should)
Keep everything append only

You don't mention what scale you are, but if it's only a few thousand a second then something is wrong somewhere.

Also, forget suggesting an alternative until it has blown up a couple of times.

[-]

Far_Office3680@reddit

How do you keep everything append only? Version fields? Date fields?

[-]

slotix@reddit

This is not “Postgres is insane”, this is “trigger-driven workflow orchestration is a footgun.”

Postgres can handle a lot. Hidden recursive side effects, lock contention on shared state, and non-idempotent async reprocessing are the actual problem.

I’d stop updating a central flag from arbitrary table triggers and move to an explicit work model:

write a durable event/job row
dedupe by app + processing kind + version
workers claim work explicitly
processing is idempotent
downstream writes do not implicitly enqueue more work unless that’s a deliberate rule

That still works fine on Postgres. The issue is not the database, it’s invisible control flow.

[-]

dutchman76@reddit

I'm confused why you need a trigger at all, I'd just have my application track the status of "needs processing" etc. Every minute polling shouldn't hurt performance. I've used MySQL for queues like that before and it was no issue

[-]

spez_eats_nazi_ass@reddit

triggers get smelly fast. I love sprocs- as long as all they are doing is simplifying data logic or functioning as a db interface. Biz logic? Hard no unless it can be argued there is no other way. Absolutely must be managed in source control.

[-]

cloroxic@reddit

Your concerns are definitely warranted. The fact that it’s updating the main table constantly just for events is wild imo. Yes, startups do need to be careful to not overcook things for scale they have not yet achieved, but that is not a good justification for poor architecture decisions.

Using Postgres built-in pub/sub system is better than triggers like this, but there are so many better options nowadays. Using Inngest or Temporal would solve this and offload off your database.

[-]

jedberg@reddit

Using Inngest or Temporal would solve this and offload off your database.

Sure, but it would also send all their data to a 3rd party, and add a new service, both of which OP said they aren't keen on.

[-]

cloroxic@reddit

In their own words, at this stage, they should be open to third-party so they can grow more rapidly and then build their own system when at the necessary stage of growth.

Cooking the db now because they are hard-pressed to do it all themselves will just kill the company. 5-10 seconds is an eternity for attention spans of users nowadays.

[-]

mpanase@reddit

I don't allow any logic in PostGres

It can be done? Yes

Will it balloon up and become unmaintainable, impossible to debug, impossible to monitor, cost tons of money to run, only be editable by very few people who should be doign other stuff, ...? Yes

I don't negotiate with terrorists. No logic in PostGres

[-]

UncleSkippy@reddit

I don't allow any logic in PostGres

After too many years of working with DB managers who used triggers everywhere, I am at the point where I will never use them unless absolutely necessary.

Side note: it has never been absolutely necessary.

DB triggers are "hidden code" and quickly lead to unmaintainable distributed complexity. The temptation to use them can be overwhelming, but once that door is open, it is VERY difficult to close it.

Just say no to DB triggers unless there is absolutely no other solution.

[-]

MinimumArmadillo2394@reddit

I worked at a startup using firebase with db triggers + cloud functions as the entire backend.

It was genuinely ass to work with

[-]

azuredrg@reddit

Yeah, when db logic fails, it can be very invisible too.

[-]

exploradorobservador@reddit

I made the mistake of writing data processing logic in PL/pgSQL. That was a nightmare

[-]

taco_tuesday_4life@reddit

I'm surprised I had to scroll this far, a lot of others don't have a concern of embedding logic into the db like that. Seems like an anti pattern for most cases.

[-]

annette7inker4454@reddit

the spacing makes it hard to read

[-]

Leading_Yoghurt_5323@reddit

“we don’t need it at our scale” is usually what teams say right before they keep rebuilding a less runable version of a queue inside their database

[-]

crazyeddie123@reddit

"I have to have queue items created for successful transactions but not for unsuccessful transactions. Hey, what if creating the queue items was part of the transaction?"

[-]

cajunjoel@reddit

In 20+ years, I can count on one hand how many times I've used triggers in my database.

The fault is not with Postgres. It is an insanely performant database out of the box, which means you need to look elsewhere for the problem.

You know you have a trigger loop. And you know you need to remove it.

The first step to solving a problem is admitting you have one. :)

I think you need some sort of control table to track what needs processing by the async script. The trigger can still manipulate the control table, but something else can and should poll the control table to determine what needs processing. This way you decouple the trigger from the async script and probably solve many of your problems.

That will also satisfy your database zealots and allow you to keep most of your existing structure.

I'm not suggesting a message queue, because I don't think that would fit your the model, but I could be wrong.

[-]

rgv1993@reddit

Why not just use a message queue?

[-]

BosonCollider@reddit

If it isn't too big and it is used for short lived state only you can host a second postgres instance on tmpfs.

[-]

03263@reddit

You're the fuckers that try to get me to sing and dance instead of just submitting a resume?

[-]

couch_crowd_rabbit@reddit

You all got any advice for me as I refactor the torment nexus my startup built?

[-]

rupayanc@reddit

your concerns are warranted but the issue isn't Postgres, it's using the DB as an event bus with nothing preventing a processing loop. the transactional outbox pattern is the right call, and you don't need Temporal for that at your scale.

[-]

kagato87@reddit

Triggers INSIDE the database? No. They should be moved to the application instead. Whatever is updating the tables should also be setting that flag.

SQL triggers are for when you DON'T have access to the code or developers capable of making the changes, to add event based logic. If you can do it in a service or worker process, do it there.

Many reasons for this:

First and foremost is what you've encountered here. They generate extra queries and can make locking chains longer. Worse, a trigger can interrupt an insert or update, delaying the write. It can get bad fast, as you've seen, and you don't even have a loop causing problems...

Second, if you ever move beyond postgres and into MSSQL or worse, Oracle, the SQL processors are your most expensive processors (because they need the OS license plus the SQL license). You generally want to push processing as far downstream as is practical (as long as you're not sending more data down the pipe to do it).

SQL is my primary domain. I optimize and architect it on my team. I will say with conviction: Triggers are a bandaid solution until the code can be updated to deal with it. Whether that's the code itself setting the flag, or moving the writes into a stored proc that also sets the flag, it should be code driven, not trigger driven.

[-]

SkellyJelly33@reddit

Never? Not even for storing history/audit tables that just keep track of what's changes? I have been happy using them for that purpose at my current job

[-]

kagato87@reddit

Even then. Your DACs should handle the audit logging, and nobody should have direct write access through the database - stand up a simple CRUD app that goes through those DACs. Heck, most DBAs will be very hesitant to let anyone even have read access...

A trigger in SQL interrupts the write. Even for auditing it carries the risk of looping and write amplification. Triggers can go off when you don't expect them to.

And it gets worse. Triggers happen while the lock is active, and because they're the same transaction the original lock doesn't release until the triggered action also completes. For the audit example, this can actually cause unrelated queries to block each other, and increases the chance of encountering a deadlock.

By contrast, if the application instead makes two writes, the two writes can be independent (or not, there's a flag for that). And, more importantly, it's not going to go off when you don't need it to. Like, say, ETL or Retention.

SQL isn't like normal procedural languages. SQL's behavior will not make sense until you've beaten your head on the desk at least half a dozen times. More, if your first foray into it is backed by significant dev experience.

[-]

Yellowbrickshuttle@reddit

We use an audit setter inside our code that overrides the save to the database. If the entity being saved as auditable set the audit properties, this is useful if you need user context as the db doesn't know this

[-]

thekwoka@reddit

I don't see anything here that suggests the use of postgres is the issue.

It's just the actual process, which would kind of suck with any system.

[-]

Whitchorence@reddit

"use postgres for everything" is as much of a stupid and thought-terminating cliche as "sql isn't webscale." hopefully people get tired of that soon

[-]

Az4hiel@reddit

Have you tried like... breaking the dependency loop? Why are you even suggesting the problem is related to the underlying technology (which is accidentally postgres)? I can guarantee you that any messaging/processing/persistence solution can be ground to a halt with with the good old async job spawning recursively more async jobs.

[-]

LawBlue@reddit (OP)

That’s a good point. The dependency loop is enforced in SQL trigger functions right now, which is why I’m worried. It’s not in business logic or application logic. It’s just directly in the DV. The tech architect right now is very attached to it. It’s hard to read PRs for it and we’re enforcing new triggers for this on every table.

I foresee a world where 50 tables are around and there’s a lot of cognitive load to remember what the triggers do. My juniors are not SQL strong so I am balking at committing to this strategy.

[-]

Az4hiel@reddit

What does "very attached" mean? Is there some aspect here that makes this approach attractive in some dimension?

Listen, from the vibe of your messages I feel frustration - and I get it but... Have you talked to the guy with empathy and with the assumption that he might know something that you don't? Have you explained the problem calmly and clearly? Outlined the risk? Are you on the same page in terms of facts and is there just a difference of opinions? Have you explored different approaches or at least talked about possible different approaches? Do you know what other possibilities even are? Do they still need to be discovered? What's exactly going on?

I mean I am not sure if reddit is the right place for this but in general the whole thing still seems vague and abstract with critical parts of the context still missing.

[-]

LawBlue@reddit (OP)

Yes, I’ve raised these issues before. The primary issue is that the adding of the async processing and triggers were done unilaterally in response to perf issues caused by library usage. The triggers were initially too wide and the logic for triggering downstream processing was not PR’d — it was merged in on the weekend without any review.

It is better now after we went back and fixed it, but I’m wary that it’s very delicate. We had a major slowdown for a week because of a big refactor that went in (approved by the architect) where the triggers were not getting caught. My intuition is that our usage of SQL triggers is too careless and is turning into a footgun considering our Eng practices

[-]

mixedCase_@reddit

Would that problem not have occurred if your team had written the business logic in, say Ruby? or Rust? Or Haskell?

Get your code review process in order first. Worry about Postgres and SQL later.

And if you feel uncomfortable with SQL, grab some books and an LLM, make yourself comfortable.

If you're lacking tests for the SQL procs and triggers, add them, it's not too hard when you actually build a harness to run them. Again, have an LLM help you welcome to the new world the wastes of time of yesteryear are now viable.

[-]

JrSoftDev@reddit

You're making essential points, it would be funny if the response was guy just making sure he is creating enough problems today in order to keep his job relevant tomorrow 😂

[-]

WiseHalmon@reddit

Just user another posgres instance 😂

[-]

LawBlue@reddit (OP)

Are you serious? Isn’t Postgres not good for multiple-writer and high-volume workloads?

I do agree that using multiple instances would decrease resource-coupling and reduce the blast radius if only certain parts of the app get overloaded

[-]

dektol@reddit

Who's going to manage this new infra you're proposing? If you can't make Postgres work and your team thinks you can slow down and give it a try. I know you don't want to hear this but 95% of the time it's a skill issue.

[-]

Tired__Dev@reddit

Postgres not good for multiple-writer and high-volume workloads?

Postgres Timescale is.

[-]

WiseHalmon@reddit

Yes. And your team told you they don't want added complexity. I'm just jokingly suggesting this to fix your current problem given what you've told me

[-]

realdevtest@reddit

The team will love the idea

[-]

NoInitiative4489@reddit

reminds me of that one south park episode

[-]

Ecksters@reddit

Your team should check out dedicated libraries for using Postgres as a job queue if they want to continue down this path, they use LISTEN/NOTIFY to avoid polling so much, and SKIP LOCKED to allow concurrent job processing efficiently. I've seen them in most languages, pg-boss is one for NodeJS, for example.

[-]

throwaway_0x90@reddit

Nobody has ever convinced me that putting business logic in the DB is a good idea

[-]

ninetofivedev@reddit

So… I think database triggers are almost always the wrong answer.

[-]

magichronx@reddit

Personally I hate the idea of having any kind of business logic living in SQL-land. It's basically hidden logic that's difficult to keep eyes on, difficult to test, and will likely cause many unnecessary headaches.

If you need to handle some kind of event queueing logic then do it in a real programming language, not a query language

[-]

single_plum_floating@reddit

Your backend is as coherent as your entire startup concept. So its whatever.

But yeah a queue solves most of the problem...

[-]

RedditNotFreeSpeech@reddit

Bad engineering is bad engineering

[-]

Mendon@reddit

I don't think Postgres is the issue here as much as it is the architecture of your updates and background processing. Draw boundaries around what process impacts what, see where you might need optimistic locks or semaphores, and then architect your solution around that. Race conditions are tough no matter the tech.

We use postgres as an event log for instance, and after fixing a tricksy bug around transaction write race conditions, it's very nice we don't have to own or manage a kafka cluster. We use a background processing library that persists into postgres and it also works great. Those other tools are amazing, but sometimes it's worth keeping focused rather than risking a new dependency.

[-]

carroteroo2@reddit

Can you elaborate on that tricksy bug? Im just curious...

[-]

Mendon@reddit

Basically this: https://blog.sequinstream.com/postgres-sequences-can-commit-out-of-order/

That's not me, just the article we used to fix a condition where consumers were skipping messages.

[-]

carroteroo2@reddit

Many thanks

[-]

jmking@reddit

Yup - whether you use Postgres or MySQL or Oracle or whatever they'd be having similar or same issues.

Any simple pub/sub service would immediately solve this.

[-]

nullbyte420@reddit

Yeah pg is great, I also use it like this. So happy not to need a kafka cluster!

[-]

Acceptable_Durian868@reddit

Triggers are so easy to get wrong, to the extent that I'll almost always work around their need with a different solution, like CDC or application level events.

[-]

LawBlue@reddit (OP)

Thanks for the responses everyone! As I suspected, Postgres is fine, the way that we’re using it is not fine.

I think the trigger setup was the real issue here. Blaming the DB is not acceptable here, but resource isolating parts might be a good strategy as well in case we get into this issue again.

Unfortunately I don’t think the company politics will allow me to accomplish advocating for a move away from this easily. Frankly, this has all been good food for thought for perspective…

[-]

GrizzRich@reddit

This sounds insanely hard to even reason about much less manage at any scale.

[-]

vansterdam_city@reddit

you are acting like proposing the adoption of a brand new tech is some trivial thing. assuming you have a 24/7 uptime SaaS, you need a healthy on-call roster to all learn this new tech and support it. what you are proposing is a super non trivial business investment.

why can't you implement an outbox pattern in another postgres table instead of triggers if triggers are the performance problem?

[-]

LawBlue@reddit (OP)

because triggers has been a top-down mandate we also do not have an on-call roster. We would not host, we’d ofc go with a managed cluster. Ain’t nobody want to deal with the infra when we’re this small

[-]

vansterdam_city@reddit

Ok, based on everything you’ve said I don’t see a technology problem here. You use words like “mandate”, “shot down”, “hard blocked”. This is a leadership alignment problem.

Who is the person blocking you? Can you explain the reasons why they might be doing this?

[-]

LawBlue@reddit (OP)

There’s a technical architect. They’re blocking other methods because they’re convinced that “triggers are the appropriate way to get consistent on-write signals for downstream side effects at our scale.”

However, they deployed and pushed it unilaterally without review. When it was deployed, it caused an outage because the trigger had been applied to all tables, causing a processing loop. The table updates caused processing to get triggered, which updated the table, which caused processing to get triggered, etc.

It felt bad that we could even just generically apply the trigger to all tables…

[-]

chuch1234@reddit

Top-down mandate? Is the person mandating triggers also working on the system?

To be clear through, 'update causes trigger which causes update which causes trigger...' could happen with any system if the person designing it doesn't take loops into account. So the first thing is to just fix that bug. Then you have to identify "x decision directly led to this problem and literally anything else could not have had this problem". If that's not true then it may not be worth the cost to make a dramatic change to your architecture.

[-]

MorgulKnifeFight@reddit

Using Postgres like this is suboptimal - you know this would be simple to setup in a pub/sub async task system with Redis or another MQ.

Lots of open source solutions here like Celery etc.

You don’t have to be “big scale” to use async tasks - it’s a fundamental and proven solution.

[-]

new2bay@reddit

Last time I used Celery, it was an absolute nightmare. Has it gotten any better in recent years?

[-]

MorgulKnifeFight@reddit

I’ve used it in production for many years myself, and I avoid all the pain points. I do hear what you mean though - it has that reputation for a reason.

I know there are some newer task runners within the FastAPI ecosystem, I haven’t worked with any of them personally but I am planning to check them out.

[-]

jedberg@reddit

The use of Postgres is fine, they way your company is using it, not so much.

I would suggest you check out DBOS for an example of durable execution using only Postgres, which would fit with your company's "use Postgres for everything" ethos.

I would also look where SKIP LOCKED might apply, that could be a quick fix to the immediate locking problems.

[-]

dbxp@reddit

The event queuing is a valid use case and might work but I have a feeling you've got other issues in your DB which you would still have issues with just in a different form.

It's almost midnight here and I'd have to look at your telemetry to be sure. I work with SQL Server but it sounds like you may have table locking due to poor indexing on the main applications table. The loop of triggers seems problematic just from a business logic perspective never mind the technical issues you're seeing, I think this needs picking apart or you'll end up in an infinite loop sooner or later.

Really the long term solution is to hire a Postgres DBA as it sounds like your team is missing skills on that side.

[-]

amejin@reddit

Maybe instead of a flag use a datetime column and make a priority queue for processing?

[-]

cuterebro@reddit

What if you, idk, make two timestamps, updated_at and processed_at, and kinda select entities which were updated after processed?

[-]

Great_Distance_9050@reddit

Redis + Redis Streams is the way to go for a startup. Low effort infra setup, and maintenance. Easily can handle the scale a startup would need. Generally always my starting point for streaming data similar to postgres being a starting point for a db.

[-]

executivesphere@reddit

Is durability required for your use case? If so, how do you achieve it?

[-]

whitehouse3001@reddit

You are not stupid, it sounds like it's time to branch into using more appropriate technologies to handle events and async processing.

[-]

VictoryMotel@reddit

This is all to process job applications?

I would probably put files in a directory myself. Maybe a GUI that loads and saves Json files to be fancy.

[-]

alicem0onpie6123@reddit

your concerns seem valid. relying heavily on triggers can get messy quickly

[-]

Throwitaway701@reddit

I worked in the civil service. I wish they had this problem, there's nothing they won't use excel for instead

[-]

AnimaLepton@reddit

Like everyone is saying, it's not really a Postgres/scale limitation you're solving for, it's about the application access pattern and guardrails.

But yeah, I'm another person that'd recommend either Redis with their streams data type, or a lightweight Redis-backed worker setup as a cleaner approach. At the extreme end, you could even do a Debezium/RDI or other CDC type pattern to keep Postgres as the source of truth while driving processing off Redis.

[-]

aFqqw4GbkHs@reddit

The glaring, blinking red flag here is all those triggers ... that's an architectural mistake regardless of the db choice. I haven't seen a system like that in more than 20 years, or once we had reliable queueing and pub/sub tech.

[-]

TheseHeron3820@reddit

Isn't this exact use case the reason why Change Data Capture was invented?

[-]

terrany@reddit

OpenAI uses a single primary postgres instance, if you’re bigger than their traffic maybe there’s a point

[-]

GoonOfAllGoons@reddit

use a global trigger on the Postgres database when any other table gets updated to update this “needs processing” flag.

I'm more of a MSSQL guy, but this sounds nasty.

Narrow it down to when it really needs processing; this looks like killing a housefly with a howitzer.

[-]

wrex1816@reddit

If you can't explain why you use something or do something a certain way besides "Well, uh, that's what I heard everyone is doing", then you shouldn't be in a decision making position, your company needs to hire people with the experience they need.

[-]

hippydipster@reddit

Updating data requires processing the data.

Processing the data causes updates to the data.

Which requires processing the data.

Which updates the data.

And around and around we go. What's this have to do with postures or queues or database triggers or application logic? There's a fundamental problem here. Probably at some point someone's going to suggest "hey let's make parallel methods to update data that don't run the triggers" (or some version of that), and then maybe you'll get things working, but it'll be a breath away from a mistake grinding it to a recursive halt always.

The real solution would seem to be think very hard about your data model and these processing jobs, try to figure out how to make processing jobs that would never update data thats used to trigger new processing.

[-]

TheEclecticGamer@reddit

Look into dbt? Maybe with fifo queued changes?

Dbt with the interim tables is helpful for debugging the steps, and it will let you use basically the same logic you have in the triggers The fifo queue will stop concurrent processes on the same object.

[-]

corny_horse@reddit

I've been a data administrator and/or engineer for a little over 10 years now, and... I'm going to be honest... I've never seen a database trigger that I thought was great engineering.

[-]

SessionIndependent17@reddit

I'm stuck on why an essentially-internal system for such a narrow HR purpose would be allowed to be entwined with other things in such a way to be able to impact customer-facing systems.

[-]

okayifimust@reddit

“we don’t need it at our scale”.

You don't have a "using postgres for all the wrong things" problem, you have a "working with idiots" problem.

If you have millions of rows for... anything, then you are at the scale where you can't ignore best practices anymore.

And you're having issues and blockages.

Any argument about whether it was okay at the start is moot - and it's not like a message queue is arcane magic, or a godless abomination, or something. It's standard, not niche, not bizarre.

When I read your title, I was looking forward to singing the praises of weird graph databases: then I didn't understand all of the optimisations you speak of - but all of that is besides the point.

Postgres is good for a lot of stuff, but not absolutely everything fits the shape of a relational database. Of course it doesn't.

[-]

ecethrowaway01@reddit

Am I just stupid or are my concerns warranted?

It sounds like you're failing to gather alignment. Are other people not agreeing that the 5-10s delays are a problem? If they are, what solutions are they proposing?

[-]

xpingu69@reddit

Okay then refactor it what's stopping you

[-]

Boring_Pay_7157@reddit

This is what happens when companies kick out ops and let pure devs architect the system. postgres really shouldn't be used as a pub/sub system, there are dedicated tools for that.

[-]

Independent_Two7517@reddit

why's the calendar marked on saturday?

[-]

CerealkillerNOM@reddit

Postgres has excellent performance. I used both PostgreSQL and MySQL in Adtech, managing significant load... Your problem is likely somewhere else.

[-]

headinthesky@reddit

How big is the instance? Are they cheapening out? Do you have read replicas?

[-]

LawBlue@reddit (OP)

No read replicas. The problem is the write traffic. We want consistent updates and processing of apps, especially when humans are awarded $$$ for manually reviewing or processing them sometimes. The volume for that subsystem is growing and is expected to be 2x next year

[-]

headinthesky@reddit

It's not the dumbest thing I've seen. We're designing a job queue which uses pg and polling, but it's very quick. I'm not a DBA but maybe there's some optimizations

But an event driven queue that's not in postgres is the right answer. If you're really inclined, build a small poc and show them the hard data of how much more efficient it is. Yeah, it needs more infra, but maybe you can get away with using Redis instead of kafka

[-]

breek727@reddit

Rabbit typically has a lower overhead, an not convinced there’s a stream needed, especially as op is going off a flag that I assume when processing gets picked up will be doing stuff with the current state of affairs,

[-]

headinthesky@reddit

I haven't used rabbit in a long time, I remember it being a pain to manage in HA mode, but hopefully that's improved now. It's also a solid route

[-]

breek727@reddit

Yeah there’s a pretty sold k8s controller for it now, with quorum queues across the cluster, Push pop you still have ack issues in the case of failure during processing though I think

[-]

headinthesky@reddit

Yeah, there are some solidified implementations using Lua scripts to handle that.

But if you're going from scratch, rabbit is probably the better choice. Redis would be good if you are already using it or need the other features

[-]

LawBlue@reddit (OP)

That’s a good point. I was thinking serial processing of the incoming events in order is not necessary 90% of the time. Some side effects like SMS back to recruiters or hiring manager — some of these should be done in order or skipped to only reflect the latest state. I’m just worried that if you do not start with in-order processing, it’ll be harder to get it later as we scale

[-]

breek727@reddit

Theres a few patterns you can use like a state machine etc that would allow you to architect an ordered state of actions without forcing the queing itself to be ordered.

IMHO if you can architect away from ordered queing you’ll be making your scaling challenges easier rather than harder down the line

[-]

andreortigao@reddit

If using a pub/sub is overkill for your needs, you may want to look at different approaches to replace the triggers, for example you could use a view that checks the tables for an update timestamp. This would require minimal refactoring in you codebase and be pretty straightforward to implement in the db.

[-]

unconceivables@reddit

That's exactly what I was going to suggest. Triggers are a really dumb idea here (well, almost everywhere), and it's much easier to do this in other ways. They're polling anyway, may as well just poll for timestamps. Don't need a view either.

[-]

andreortigao@reddit

I'd use a view just in case new tables need to be added or removed down the line, I wouldn't need to touch the code that polls the data.

[-]

unconceivables@reddit

I guess that's just a difference in philosophy, I'm in the code-first camp and make all database schema changes from code. I find it much easier to keep my sanity that way.

[-]

Hairy_Assistance_125@reddit

You lost me at DB triggers. Not a fan.

[-]

jcpj1@reddit

I recommend checking out the pgmq plug in. It gets you the queueing semantics of a service like sqs, without having to add more infra

[-]

DarthCalumnious@reddit

Sounds like the problem might be that you want differential updates when data changes, but there is no discipline around the data dependencies needed to do the updates.

Maybe the right answer is an abstraction over postgres that knows what data needs to be read to do the updates.

It's probably not helpful, but an immutable DB like datomic wouldn't have the locking problems and would probably make the data dependencies crisper.

[-]

need-not-worry@reddit

If you poll it evet minute why would it cause a resubmit? Wouldn't it wait till the next minute?

[-]

rco8786@reddit

> Currently working at a startup

Use Postgres.

[-]