How do you migrate big databases?
Posted by CiggiAncelotti@reddit | ExperiencedDevs | View on Reddit | 96 comments
Hi first post here, I don’t know if this is dumb. But we have a legacy codebase that runs on Firebase RTDB and frequently sees issues with scaling and at points crashing with downtimes or reaching 100% usage on Firebase Database. The data is not that huge (about 500GB and growing) but the Firebase’s own dashboards are very cryptic and don’t help at all in diagnosis. I would really appreciate pointers or content that would help us migrate out of Firebase RTDB 🙏
MocknozzieRiver@reddit
I have been involved in or lead 5+ zero downtime database migrations on services that handle millions of requests a second and millions of records with no or negligible problems (issues only the engineers notice). My current project is a database migration from Cassandra to DynamoDB on the biggest service yet. We've developed an internal library to do it.
Most replies here talk about the same idea we've done. The library we wrote handles dual writing without additional latency, self-repairs, and reports standardized metrics/logs which helps you know for sure everything is in sync.
You need some feature flags: * dual write (true/false) * dual read (true/false) * is new DB source of truth (true/false)
We have a few extras: * read repairs (true/false) * delete repairs (true/false) * synchronous repairs (true/false)
So, if dual writes are on, on every database write it also writes to the secondary database in an async thread. If the secondary write fails the request still succeeds but it publishes metrics/logs saying the dual write failed. If the write produces output, it also records metrics/logs on whether the data matches.
If dual reads are on, on every database read it reads from both databases in parallel and gathers metrics/logs on whether the data is matching. If the secondary read fails the request still succeeds but metrics/logs are published. If both succeed but the data from primary and secondary are not matching and read repairs and dual writes are on, it repairs the data (meaning it may create, update, or delete the data). The way it repairs the data depends on if synchronous repairs are on. If it's off (which is the default) it repairs in an async thread. And it won't do delete repairs (when the primary DB does not have data the secondary does meaning needs to be deleted from secondary) unless delete repairs are enabled.
So the rollout works like this; 1. turn dual writes/dual reads/read repairs on, keeping the data in sync (in applications with large traffic you must do a percentage rollout) 2. do the data migration--because of what happens during a read when dual reads/dual writes/read repairs are on, you could just retrieve every item in the database. It ends up checking both sources, comparing them, and migrating if they're different. The longer you wait between steps 1 and 2, the less you need to migrate. 3. flip the "is new DB source of truth" flag to true 4. check metrics--at this point it should not be reporting mismatches 5. turn off dual writes/dual reads/read repairs. 6. BURN THE OLD DB WITH FIRE!!
personalfinanceta5@reddit
How do you think about correctness for something like this approach? Is this guaranteed to converge to something 100% correct or is this a good enough solution?
Naively seems like a setup that could lead to inconsistent data. To try to make up a corruption case, imagine two async writes issued to the new db that are delayed significantly and run out of order. If these incorrectly ordered writes overlap with the table scan step of the migration running, then don’t get touched again couldn’t that leave the entries permanently out of sync?
The general case of migrating even relatively simple tables across databases that don’t support transactions (across databases) is something that has generally seemed very challenging and interesting to me.
MocknozzieRiver@reddit
> How do you think about correctness for something like this approach? Is this guaranteed to converge to something 100% correct or is this a good enough solution?
It would be a "good enough" case, unless you did a second pass of the migration (which would still be "good enough," but there's additional certainty; in the second pass, it should report that all are matching). If you didn't do something like that, once you migrate, match percentage metrics are telling you whether active user's data is matching.
> Naively seems like a setup that could lead to inconsistent data... permanently out of sync
Yes, that is a possible situation. 😄 We have had it happen before: Say you're going to do a delete. The code gets the entity to see if it exists before deleting. The read triggers an async read repair because they don't match. The delete happens, and then the read repair ends up bringing the entity "back from the dead."
There is a chance it would be permanently out of sync if you were to turn off the read repair flag after this happened or if the user never did another operation that repairs them. But the out-of-sync entity should be repaired on a listing read (e.g. if they did a listing operation, the extraneous entity would be cleaned up). Our services typically do way more reads than writes so there's typically ample opportunity for things to be repaired.
For the migration I'm currently working on, I have also been thinking of the idea of passing additional information on the feature flag to disable read repairs on certain endpoints (we use LaunchDarkly for our feature flags). I haven't done it before, but for example, it would make sense to have a rollout rule where read repairs are always disabled on a delete endpoint to prevent the situation above from happening.
Keep in mind that the migrations I've done have been in tables with several million to a billion distinct items for services with several hundred thousand to a million requests per second, so it makes it virtually impossible to 100% guarantee everything was migrated correctly with zero downtime. 😅 But it's totally a fun and interesting challenge!!
_sagar_@reddit
Qq: why moving from Cassandra to DynamoDb, isn't a cost an issue? Have you guys also evaluated other DB choices for migration? Curious to know.
MocknozzieRiver@reddit
The choice of DynamoDB was done a long time ago (either before I joined or when I was a very new employee), but I'm guessing it's mostly because everything else we use is AWS. Maybe we have a deal or something. Also our Cassandra DB is self-maintained, so we're paying for the AWS infra it runs on and the team that maintains it, but DynamoDB wouldn't need a team to run it.
All that to say I don't know but I can guess lol.
CiggiAncelotti@reddit (OP)
You are a Gem 💎 🙌 Thank you so much for such a detailed and well thought out comment! I will save this and present this plan to the team soon 🫡 I don’t know if this is considered okay here, but would you mind if I DM you later on if I have some more questions?
MocknozzieRiver@reddit
Absolutely!! I will try to answer in a timely manner but I am busy with this data migration hahaha! It also comes with redesigning the table for DynamoDB sooo that's also challenging. And I'm trying to buy a house and plan a wedding 😂😭 (everything at once I guess lmao)
eastern-ladybug@reddit
Many folks mentioned dual writes to both databases. But make sure to not do it at application layer. You don't want to mess with failure cases where only one db writes succed or writes happen in different order. You want to use something called Change Data Capture to stream writes from current database log sequentially to new database. That will make your life easy.
Double_Meaning_4885@reddit
This should be the top rated comment. There are so many edge cases this solves.
UnC0mfortablyNum@reddit
Without downtime it's harder. You have to build something that's writing to both databases (old and new) while all reads are still happening on old. Then you ship some code that switches the reads over. Once that's up and tested you can delete the old db.
That's the general idea. It can be a lot of work depending on how your db access is written.
zacker150@reddit
Instead of just reading from the old database, read from both, validate that the resulting data is the same, and discard the result from the new system.
That way, you can build confidence that the new system is correct.
Fair_Local_588@reddit
This. Add alerts when there’s a mismatch and let it run for 2ish weeks and you’re golden.
Complex_Panda_9806@reddit
I would say have an integrity batch that compare with the new database instead of reading from both. It’s pratically same but reduce useless DB reads
Fair_Local_588@reddit
An integrity batch? Could you elaborate some more?
Complex_Panda_9806@reddit
It might be called something else somewhere else but the idea is to have a batch that, daily or more frequently, queries both databases as a client and compare result to check for mismatch. That way you don’t have to read the new DB everytime there is a read to the old (which might be costly if you are handling millions of requests).
Fair_Local_588@reddit
Oh I see. Yeah how we’ve (usually) handled the volume is just to pass in a sampling rate between 0% and 100% and do a best-effort check (throw the comparison tasks on a discarding thread pool with a low queue size) and then keep that running for a month or so. Ideally we can cache common queries on both ends so we can check more very cheaply.
I’ve used batch jobs in that way before, and they can be a better option if it’s purely a data migration and core behavior doesn’t change at all. But a lot of migrations we do are replacing certain parts of our system with others where a direct data comparison isn’t as easy, so I think I just default to that usually.
That’s a good callout!
Complex_Panda_9806@reddit
I will definitely consider also the low queue size. It might help not overload server because even with the batch you still have some peak time usage you need to consider. Thanks for the tip
tcpukl@reddit
It's never going to always be 2 weeks. Depends on usage.
Capaj@reddit
no you're not, in 2 weeks you find 100s of mismatches :D
hibikir_40k@reddit
It's typically a multi step affair, where you fire metrics on discrepancies and return the old value.
EnotPoloskun@reddit
I think that having script which runs through all records once and check that they are the same in both dbs should be enough. Having double read on every request + compare logic looks like total performance killer
briank@reddit
You can do the read check async
craulnober@reddit
You don't need to do the check for every request. You can do random sampling.
zacker150@reddit
The point is to make sure that all your queries are right and that there's no edge case that your unit tests missed.
TopSwagCode@reddit
This. Making 2 database queries won't kill performance. Run both at the same time, so you don't call one, wait and then call next. Then the only real overhead is ram usage to keep both results in memory and do comparison.
forbiddenknowledg3@reddit
This. Feature flag + scientist pattern.
GuyWithLag@reddit
This.
You also get for free a performance gadget identifying regressions or wins in execution speed.
pheonixblade9@reddit
I would also add that it's a smart idea to have a job running in the background that compares both DBs and makes sure the data is correct between the two both before and after the write switch over. extra insurance.
CiggiAncelotti@reddit (OP)
The biggest shitty problem with Firebase RTDB is that you can’t confirm the actual schema of the models and God forbid if you access/read a node(Firebase RTDB is hierarchical) with alot of data(>10MB) you are doomed. I did consider the double writes and for rollback what I thought would be the best is to keep double writing, but I don’t quite understand how to automate checking from both databases whether we missed something or not
pm_me_n_wecantalk@reddit
You need a third system here
write to both dbs. It should be known to system that after X date data is being written in both system s
read from fire base
add a third party (call checker / auditor etc) which runs at regular intervals and verify if the data written between T-1 and T in fire base exist in new db or not. If it doesn’t then it should either page or do the write.
It’s general idea. There is lot more to unpack here which can’t be done without knowing more details
UnC0mfortablyNum@reddit
Rollback in my view is just turning off a feature flag. This is a long process and you need to give it time to discover any issues. Have a feature flag for writing to new db and a separate feature flag for reading. Maybe do that on a per table/domain basis. It's a big job.
CiggiAncelotti@reddit (OP)
Thank you so much again I will try this and hopefully have a better update in the future 🙏
casualPlayerThink@reddit
Maintenance mode on when it is least used; upscale DB; restart.
Or switch to a DB provider where you can CD/CI (like canary or blue/yellow) deploy DB also.
A few companies where I worked used sync solutions (either managed from the DB or via replica, or developed a sync data & validation to know what was delegated.
[tl;dr]
I have seen a large project from the nordic, where a Bank spent millions of dollars on Oracle DB, and when it crashed, they spent 4 days just to restart the database itself, a migration or backup ran for weeks (no jokes).
They hired a company that pushed their data into a custom NoSQL with servers of tremendous memory, then they wen't for a maintenance during a weekend and under 8 hours, they managed to import all the data into the new database solution, then they developed a sync-to-disc solution to write the nosql database into sql.
Perhaps you are either on the wrong tier or the wrong database engine. Do you know why it does crash? Do you know the bottlenecks? Firebase & CO db-s tend to be just hype train that sounds great, but in reality, it is just a waste of money with no benefits. (One of the companies that I worked with spent 15K USD per month on MongoDB databases, instead of like 10 dollars for a Postgres w/ normalized data...)
lisnter@reddit
As many comments have suggested you can write an infrastructure that writes to both the old and new databases. I did this for a very risk averse corporation (Fortune 500). The system wrote to both but used the old DB as system of record for several months, then we switched to the new one as system of record but with a check against the old DB. After several more months we turned off the dual write code and just went with the new system. Total time in his mode was 6 months. This was after a full QA period proving out the quality of the new system before it went into production.
Very risk averse corp.
Was migrating from old mainframe green-screen to modern (15+ years ago) Java infrastructure.
CiggiAncelotti@reddit (OP)
Congratulations! Seems like one big achievement under the belt 😄
No-Row-Boat@reddit
This topic is way to complex to handle on a reddit thread.
Have done many database migrations (and storage clusters), 50T Cassandra migration by rewriting a seed driver so the cluster was migrated fully without downtime was one of the most beautiful. Migrated oracle datawarehouses that had 500T in it.
It all depends on the requirements of the organisation, what time span it needs to be completed in and how important uptime is.
Best is: search for someone in your network who has done this before. Involve your developer community. Spread risks and make proper project planning.
Ensure you have backups in some form or another.
These moments are also great to ask the question if the data needs to be pruned, retention period is sufficient and don't do this alone.
craulnober@reddit
One thing that's often overlooked is horizontal scaling. Especially if you have a date based partitioning it can be completely ok to create multiple firebase instances. Remember, you are an engineer, you just need to solve the problem, you don't need a perfect solution.
CiggiAncelotti@reddit (OP)
That is very smart! I did consider like a fall through option for date based partitioning like an interface would check if the data exists in new database if not, find on old database and write to new database. It would probably be a big headache because of the undefined schemas where someone writes on /notes/372/isActive while /notes/372 is empty
Rascal2pt0@reddit
Write to both databases at the same time and backfill old data into the new one. The old db stays as the read source till you transition. You can leverage hashes and timestamps to ensure you don’t out of order writes. Update where hash = expected hash where the hash is the current versions. So it will avoid backfilling over top of an already migrated or updated record in the mean time on the new storage.
engineered_academic@reddit
You are probably going to write something like AWS's DMS
marketlurker@reddit
Why are you migrating? You have to answer that question first. In this instance, would it be easier/cheaper just to scale up your hardware?
CiggiAncelotti@reddit (OP)
Firebase is a Managed DB service by Google, and it doesn’t scale at all. If anything once you hit the limits, maximum you can do is shard the databases, but even after all that the graphs and alerts are so cryptic you can never find the issues from the database. All it is really good for is Realtime updates
marketlurker@reddit
Thank you.
Fair_Local_588@reddit
You can use this as a loose framework for any data migration really.
uuggehor@reddit
This is the answer I’d go with, as mentioned, applies also to restructuring (shit schema to improved one) etc. Make it easy to fallback, and reserve week or two for the inconsistencies that might appear after the first switch over, before deleting the old implementation.
rks-001@reddit
For a large critical database, if it is a shit schema and you want to migrate to an improved one, I would do that as a separate exercise post migration. Migration is hard as it is. Having the same schema on both ends makes the lives just a bit easier.
PajamasArentReal@reddit
How do you keep identities straight between both dbs? Replication code carries over ID from old?
Fair_Local_588@reddit
If you don’t have any foreign keys based on the autoincremented PK I think you just ignore that field and allow a different ID to be generated. If you do, then I don’t know off the top of my head.
Impossible-Ear669@reddit
If you are looking at migrating the service as well as the database then something like the strangler fig pattern would work for you. https://martinfowler.com/bliki/StranglerFigApplication.html
BeenThere11@reddit
Build a system with feature flags which write to both databases Have the feature flag which reads from a database old or new.
On a weekend , migrate all data from old database to new database while the system is down.
Now switch feature flags to read from new system while the write is to both systems.
If all is well , after 2 weeks switch off the write to the firebase
After 3 months , remove feature flags from code and all traces to firebase db.
Keep a dba handy for the migration.
Most likely postgres is your option.
Also thjnk about moving old data to a monthly yearly db etc.
higeorge13@reddit
I wouldn’t perform migrations during weekends, especially if other vendors are involved. You might experience extremely slow support responses in case something happens. I have had this bad experience with aws rds btw and i take working hours migrations anytime.
Open_Technician121@reddit
I wouldnt advocate for working on weekends. Sets a bad precedent
CiggiAncelotti@reddit (OP)
Ofcourse not working on the weekend until something has caught fire. This is just me going maniac over Pingdom alerts every few hours
hkf57@reddit
for an entire db migration? chalk it up to sunk cost of tech debt and give the team a few days TOIL. not like you're moving db stacks every weekend.
Watchful1@reddit
You could read from both and compare the results. Could even do it for a percentage of traffic and ramp up to 100%.
CiggiAncelotti@reddit (OP)
Thank you so much for the detailed response and the idea for feature flags, that actually does seem pretty useful 🫡🤲🏻
age_of_empires@reddit
AWS Database Migration Service
EnderMB@reddit
Additive, always. NEVER delete or move anything unless there is a replica that you've already shifted reads and writes to, and can confirm all is working and can handle full load.
Darealm@reddit
Like you would eat an elephant - one byte at a time.
saposapot@reddit
In all honesty, you get a good DBA with some experience on that….
500GB DB, assuming it’s useful data and not “junk” is already serious enough to warrant having dedicated persons dealing with it.
CiggiAncelotti@reddit (OP)
I am not sure if DBA for Firebase (NoSQL) even exist
Sss_ra@reddit
Postgres has jsonb. I believe DB2 and Oracle also have json/bson, I might be a good idea to ask some DBAs to confirm.
Content-Particular84@reddit
I think you can approach it like this. 1st solve your scaling issue by duplicating the database, The duplicate becomes your historical snapshot. In the current live read & write DB, delete all historical records that passed an acceptable period, i.e Data that are above 1yrs. (Banks do this, that's why statement requests are different from transaction history in interface/API calls)
Stage 2: Data migration. - Map the data to the database of your choice and migrate the historical snapshot first.
Stage 3: Due to the fact that you don't want downtime. You can proceed with the earlier advice of using feature flags and doing dual writes to the two DBs.
Stage 4: - Migrate the missing data between historical snapshot and the first live writes. - Then enable the flags for read only from new DB. - if everything is good, disable old DB
Voila
Alpheus2@reddit
Spin up a new DB, write to both. When stable, read from both. When that’s working, set preference to new db.
Stop writing to old DB. If that’s ok then start migrating old data to new DB. Stop reading from old DB. Archive old DB. Take old DB offline.
Do it all again because you need a bigger new DB than you thought.
01010101010111000111@reddit
My first thought would be to identify the root cause of your problem. 500gb of data is practically nothing today and you should not be experiencing any issues whatever UNLESS something is extremely inefficient.
I highly recommend taking a snapshot of your current system, logging your queries for a bit and then doing 5x load testing on whichever solution you want to evaluate.
p_bzn@reddit
Other option from provided ones: 1. Shut app down on Sunday night 2. When no data goes in migrate data calmly 3. Restart operation in some hours
This may be the fastest and safest if nature of the application allows.
You can build double sourced repositories, feature flags, what not, but don’t forget to evaluate the simplest solution first.
CiggiAncelotti@reddit (OP)
We could have done that if Firebase RTDB was not an ass, it’s a NoSQL whose access patterns and Data storage is like no other. And there’s no migration tools like Sql dump available for it. The Vendor lockin is pretty deep😭😂
p_bzn@reddit
Sorry to hear man, then no easy solution. Although, it’s rough to migrate NoSQL overall.
Then feature flagging and repository which pushes data into two databases, and then waits on both of them. Here it’s better be safe than sorry - data corruptions are bitch to fix.
CiggiAncelotti@reddit (OP)
We once had that where a colleague deleted some data from the DB directly. I had to spend 14 hours pre-gpt to build a streaming jq script that would read through the backup json file of 500GB (Notepad and all kinds of text editors just give up at this point) and pick out that data which in itself took hours to just run and find the said data, so yes I am aware from the burns 😭😭😂
SamplingCheese@reddit
Red Panda, Red Panda Connect, and Debezium.
Two streams for each table is the simplest for replication.
You can do any data manipulation you need there too.
This of course, requires a working cdc solution. Debezium exists and works but is pretty clunky imo. If this is migration only, that shouldn’t be an issue as you’ll most likely throw it away afterwards.
Stephonovich@reddit
Is this MySQL or Postgres? If it’s the latter, set up logical replication to the new DB with copy_data=true, which will snapshot all existing data, and then stream WAL for new data. Be aware that this will read old data into the buffer, which will impact performance. Best to do this off-peak. Or, better yet, snapshot your existing DB and launch a clone. Before you snapshot, create a replication slot so it starts holding WAL. Once the replica is up, subscribe to the source DB’s publication. It’ll pick up the changes pretty quickly. Now launch the new DB, and do a full load from the clone.
If it’s MySQL, despite it having logical replication long before Postgres, inexplicably it doesn’t have the ability to snapshot and stream existing data. Use MySQL Shell to do a parallel dump and restore.
CiggiAncelotti@reddit (OP)
Firebase RTDB is NoSQL 🙁
Stephonovich@reddit
Welp. Good luck.
TransCapybara@reddit
take snapshots and dump the snapshot, load into the new db in stage, then scale test it. continue to do that with snaps until you can cut over seamlessly.
midwestrider@reddit
U-haul
Can't beat the TB/s, and the price can be as low as $19.99
*In town
AssistFinancial684@reddit
1 byte at a time
both-shoes-off@reddit
Is there the concept of full and differential backups there (I haven't used Firebase). We've migrated data ahead of time by doing full backups to move the bulk of it, and then moved the changes (diff) in a short planned maintenance window to reduce down time.
Beyond that, the approach may be similar to what others are saying where you develop a parallel write solution to two databases, stand up some sort of replication, or some combination of full backup and one of these options.
loosed-moose@reddit
You don't
cjthomp@reddit
Very carefully.
Material_Policy6327@reddit
Carefully and after hours
CiggiAncelotti@reddit (OP)
🫡😂 100%
0x11110110@reddit
doing this right now at work actually. in our case we had to develop a repository layer interface with two implementations (one for old DB one for new). then, a feature flags that will make one the primary and the other secondary. for any results or errors that come back we log to splunk if there's a mismatch. right now we're slowly rolling this out to customers and monitoring for any mismatches in the data and releasing patches
CiggiAncelotti@reddit (OP)
How do you check for mismatches, and the repository layer because Firebase RTDB doesn’t allow like middlewares as far as I know
0x11110110@reddit
we check at the time of read or write and do a deep compare of the results
ummaycoc@reddit
You use an unladen swallow.
DragoBleaPiece_123@reddit
RemindMe! 2 weeks
RemindMeBot@reddit
I will be messaging you in 14 days on 2025-04-13 01:59:08 UTC to remind you of this link
CLICK THIS LINK to send a PM to also be reminded and to reduce spam.
^(Parent commenter can ) ^(delete this message to hide from others.)
dablya@reddit
They keywords to search/chargpt for are going to be “cdc”, “change data capture”, “live sync” and… that’s all I can think of. I doubt debezium has firebase support, but it might be something you can implement…
CiggiAncelotti@reddit (OP)
Thank you so much, While talking to Claude these did came up when I mentioned “Designing Data intensive applications” (Shameful disclosure I haven’t still read it, it’s a huge book 😂😭)
dablya@reddit
That book is awesome! And a second edition is scheduled to come out at the end of this year. And Spotify has it as a audio book.
CiggiAncelotti@reddit (OP)
Damnn Thank you so much for the spotify recommendation 🙏 This is definitely going to be on my list now🙏🤲🏻
metaconcept@reddit
You need to have some variant of:
or
There's a lot of variants of this and you need to work out as a team how best to do it.
CogitoErgoNope@reddit
I am not familiar with that database but, contrary to what was said, it is pretty doable without needing any downtime:
I don't know if it is possible for the database you are using. But that is pretty much how any serious database would allow big migrations. They might call things different names but they all solve the same problems pretty much the same way. If you search for "replication" or "master-slave replication" you will usually find what you need.
CiggiAncelotti@reddit (OP)
Thank you so much such a detailed response, I don’t understand alot of it right now🙏I will read about them and get back here
AustinYQM@reddit
You can export your data as a giant json file then from there write a script to move it to something else
CiggiAncelotti@reddit (OP)
We did try that. The thing is the giant json file is not even easily parseable or able to be kept in memory by NodeJS. our safest bet was to stream through Jq for which we need to know exactly what node we are accessing and make sure it’s still a small size otherwise jq malfunctions even after all that still takes a lot of time to get to that node