A major Postgres upgrade with zero downtime

249 points by stopachka 5 months ago

pilif 5 months ago

> But we had less than a terabyte of data

I really wonder how an in-place `pg_upgrade` of such small amounts of data would take 30+ minutes.

My experience from a less mission-critical situation where 5 minutes of maintenance are absolutely acceptable is that an in-place `pg_upgrade` with `--link` of a 8 TB database takes less than a minute and will not accidentally lose data or fail to properly configure schema search paths or whatever other mess the article was talking about.

I understand that 30 minutes of downtime are not acceptable. But if it's 5 minutes or less, I would seriously consider an offline upgrade using `pg_upgrade`

And if it takes 30 minutes to hard-link less than 1 TB of data files, you should seriously consider changing hosts because that's absolutely unacceptable performance.

stopachka 5 months ago

> 30 minutes
The Lyft team reported 30 minutes for their 30TB database. Our db took about 15 minutes. In the essay we wrote:
> So we cloned our production database and tested an in-place upgrade. Even with our smaller size, it took about 15 minutes for the clone to come back online.
nijave 5 months ago

I don't think pg_upgrade takes the whole time. Some of it is overhead of AWS managed database service where it's creating a snapshot before and after, applying new config, spinning for no apparent reason
timacles 5 months ago

Yeah we just did it with the --link option on a 6TB database and it took like 30 seconds. Something has to be off with their OS settings or disk speeds.
The main challenge with that is running an ANALYZE on all the tables though, that took like 30 minutes during which time the DB was unusable
- pilif 5 months ago
  
  These days it does analyze in stages where it does multiple passes with increasing stats sampling.
  From personal experience, most of the queries become useable after the first stage has completed which on my 8TB database took less than 5 minutes
  - timacles 5 months ago
    
    We did use the --analyze-in-stages option, I think our data model is just not optimal. We have a lot of high frequency queries hitting very large tables of .5 to 1 billion rows. Proper indexing makes them fast but until all the stats are there, the frontend is unusable.
- zonkd1234 5 months ago
  
  Was it unusable because cpu/io was maxed out during ANALYZE?
  - pilif 5 months ago
    
    Analyze itself isn’t the problem.
    After pg_upgrade, no stats will be available for the optimizer which means that any query will more or less sequence-scan all affected tables.
  - timacles 5 months ago
    
    No, its the lack of stats on the tables, any query hitting a medium to large table would be extremely slow.

jedberg 5 months ago

This is impressive! I know others are questioning the "no downtime" bit, but that is why service level objectives exist -- because it really depends on the customer experience.

If you managed to have a cutover with no noticeable dip in business metrics (aka the users didn't notice) then I'd call that a no-downtime upgrade!

Very clever on the improvement over Lyft's methods. Thanks for the writeup. Now maybe someone can get it down from 3 seconds of pausing. :)

honestSysAdmin 5 months ago

> then I'd call that a no-downtime upgrade!
It'd be really convenient for me, well not me but others, if we could tell our customers this. However, those of us running DBaaS do have to offer an actual no-downtime upgrade.

briffle 5 months ago

Logical replication is a great tool, and we are using it for our next DB upgrade coming up in the next few months. It just has a few limitations I wish they would address, especially since logical replication is quickly becoming one of the more popular ways to upgrade databases with minimal downtime.

I understand that logical replication handles partial replication, and the data warehouse use case, and I understand WHY many of the choices they made for it are there.

I just wish that there was a flag you could set in a subscription that would enable it to be a complete 1:1 copy of the database, including DDL, sequences, etc, without having to do all the workarounds that are necessary now.

Currently if a dev adds a column to a table on the publisher, and forgets to do it on the subscriber, there is no error, no indication, until it actually tries to replicate data for that table.

Then you have to be monitoring for that error, and go figure out what other new tables, columns, etc, might have been added to the publisher that are missed on the subscriber. Its a huge opportunity for problems.

svilen_dobrev 5 months ago

dump schema on both, and compare before copying? or that may have false positives?

darth_avocado 5 months ago

Pause all writes > let 16 to catch up > resume writes on 16

Isn’t that….. downtime? Unless you mean downtime to be only when reads are also not available.

stopachka 5 months ago

We count downtime if a service is unavailable and drops requests. In this case, since the pause took about 3.5 seconds, we were able to service all requests.
- darth_avocado 5 months ago
  
  Interesting. If that was the case, I’m curious why you had to write custom code. Would RDS proxy not work?
  The way we usually upgrade is create a new upgraded cluster, replicate all data to new cluster, pause writes and redirect RDS proxy to the new cluster. And that usually takes a few seconds.
- dboreham 5 months ago
  
  By that definition you can have any length pause and still claim zero downtime.
  - nijave 5 months ago
    
    Only if nothing has timeouts. Likely components somewhere in the stack have timeouts so you're bound by that
dspillett 5 months ago

Not if the process doesn't take long, so the most the user (or consuming service) sees is a slower response than usual. That I would class as a temporary performance degradation. If the degradation is significant it might still not be acceptable, of course, but I'd not call it downtime.

honestSysAdmin 5 months ago

Zero-downtime Postgres upgrades have been kind of normalized, at least in the environments I have been exposed to, with pgcat

  https://github.com/postgresml/pgcat

Izkata 5 months ago

Indenting with 2 spaces is for code formatting, which is why the URL isn't a link. Don't indent it if you want a link:
https://github.com/postgresml/pgcat
stopachka 5 months ago

Is there some resource that explains how to do a major version upgrade with pgcat? Would love to take a look
- Nelkins 5 months ago
  
  Probably something like the steps listed in this blog post: https://www.instacart.com/company/how-its-made/zero-downtime...
- honestSysAdmin 5 months ago
  
  As far as I know, there is not. I could probably write something up.
wswope 5 months ago

This is really cool/useful to know about - thanks for dropping the link!

n_u 5 months ago

They say the "stop the world" approach that causes more downtime is

  Turn off all writes.
  Wait for 16 to catch up
  Enable writes again — this time they all go to 16

and instead they used a better algorithm:

  Pause all writes.
  Wait for 16 to catch up. 
  Resume writes on 16.

These seem pretty similar.

1. What is the difference in the algorithm? Is it just that in the "stop the world" approach the client sees their txns fail until "wait for 16 to catch up" is done? Whereas in the latter approach the client never sees their txns fail, they just have a bit more latency?

2. Why does the second approach result in less downtime?

stopachka 5 months ago

> in the "stop the world" approach the client sees their txns fail until "wait for 16 to catch up" is done? Whereas in the latter approach the client never sees their txns fail, they just have a bit more latency?
Yes, this is the main difference. For "stop the world", we imagined a simpler algorithm: instead of a script, we could manually toggle a switch for example.
However, by writing the script, the user only experiences a bit more latency, rather than failed transactions.
- n_u 5 months ago
  
  > If we went with the ‘stop the world approach’, we’d have about the same kind of downtime as blue-green deployments: a minute or so.
  > After about a 3.5 second pause [13], the failover function completed smoothly! We had a new Postgres instance serving requests
  > [13] About 2.5 seconds to let active queries complete, and about 1 second for the replica to catch up
  Why is the latter approach faster though? It seems in the "stop the world" approach wouldn't it still take only 1 second for the replica to catch up? Where do the other ~59 seconds of write downtime come from?
  - stopachka 5 months ago
    
    In the "stop the world approach", I imagined our algorithm to be a bit more manual: for example, we would turn the switch on manually, wait, and then turn it back on.
    You make a good point though, that with enough effort it could also be a few seconds. I updated the essay to reflect this:
    https://github.com/instantdb/instant/pull/774/files
    
    n_u 5 months ago
    
    Did you test the "stop the world" approach? I wonder how the write downtime compares. It seems the 1 second of replication lag is unavoidable. The arbitrary 2.5 seconds of waiting for txns to finish could be removed by just killing all running txns, which your new approach does for txns longer than 2.5 seconds already.
    > ;; 2. Give existing transactions 2.5 seconds to complete.
    > (Thread/sleep 2500)
    > ;; Cancel the rest
    > (sql/cancel-in-progress sql/default-statement-tracker)
    Then you have 2.5 seconds less downtime and I think you can avoid the problem of holding all connections on one big machine.
    > Our switching algorithm hinges on being able to control all active connections. If you have tons of machines, how could you control all active connections?
    > Well, since our throughput was still modest, we could temporarily scale our sync servers down to just one giant machine
    > In December we were able to scale down to one big machine. We’re approaching the limits to one big machine today. [15] We’re going to try to evolve this into a kind of two-phase-commit, where each machine reports their stage, and a coordinator progresses when all machines hit the same stage.
    I guess it depends on what your SLO is. With your approach only clients with txns longer than 2.5 seconds started before the upgrade see them fail, whereas with the "stop the world" approach there would be a period lower-bounded by the replication lag time where all txns fail.
    Cool work thanks for sharing!
    Edit: I feel like a relevant question regarding the SLO I'm not considering is how txns make their way from your customers to your DB? Do your customers make requests to your API and your application servers send txns to your Postgres instance? I think then you could set up a reasonable retry policy in your application code and use the "stop the world" approach and once your DB is available again the retries succeed. Then your customers never see any txns fail (even the long-running ones) and just a slight increase in latency. If you are worried about retrying in cases that are not related to this upgrade you could change the configuration of your retry policy shortly before/after the upgrade. Or return an error code specific to this scenario so your retry code knows.
    Then you get the best of both worlds: no downtime perceivable to customers, no waiting for 2.5 seconds, and you don't have to write a two-phase-commit approach for it to scale.
    If your customer sends txns to your Postgres instance directly, this wouldn't work I think.

wenbin 5 months ago

Awesome!

The best we’ve achieved is 0 downtime for read operations and less than 1 minute downtime for write ops [1]

Achieving 0 downtime for write ops is super hard!

[1] https://www.listennotes.com/blog/a-practical-way-to-upgrade-...

upghost 5 months ago

I can't believe they took the time to make such an amazing write-up. With formatting and everything. Normally I just grab whatever broken crayon is closest and scribble on the back of an overdue bill "don't fsck up next time"

xyzzy_plugh 5 months ago

The title is pretty misleading. They're not even running Postgres, but AWS Aurora, which is Postgres compatible, but is not Postgres.

Also, pausing queries does count as downtime. The system was unavailable for that period of time.

stopachka 5 months ago

> The title is pretty misleading. They're not even running Postgres, but AWS Aurora, which is Postgres compatible, but is not Postgres.
For what it's worth, every command ran works on normal Postgres. Hence we didn't think it mattered to mention Aurora specifically in the title.
> Also, pausing queries does count as downtime.
If a query takes a bit longer to respond, I don't think that counts as downtime. From the perspective of the user, they couldn't distinguish this migration event from some blip of slightly slower queries.
- scottlamb 5 months ago
  
  > If a query takes a bit longer to respond, I don't think that counts as downtime. From the perspective of the user, they couldn't distinguish this migration event from some blip of slightly slower queries.
  It comes down to defining Service Level Objectives (SLOs) that are meaningful to your users. For one system I worked on, latency was important, and so one SLO was "99.999% of <a certain class of> requests with a deadline >=1s should succeed with latency <1s", so if this affected more than 0.0001% of requests in <time interval defined in our SLO>, we'd have called it an outage. But I've also worked on systems with looser SLOs where this would have been fine.
  - nijave 5 months ago
    
    Not only that but I think you also need to take upstream systems into account. With a reasonably robust frontend that handles transient issues and retries reasonably, I think it's ok to say "no downtime"
- RadiozRadioz 5 months ago
  
  Completely depends on what the "user" is. Are they a human, or a machine that explicitly requires timings within a particular threshold?
- lionkor 5 months ago
  
  It depends if it feels like an outage
- awesome_dude 5 months ago
  
  > If a query takes a bit longer to respond, I don't think that counts as downtime
  "We're sorry that your query took 7 hours to be responded to, but it wasn't an outage - honest"
  - stopachka 5 months ago
    
    We would count 7 hours as downtime too. Our pause was less than 5 seconds.
    
    libraryofbabel 5 months ago
    
    Nice job, then! Technical downtime that’s virtually undetectable to users is a big win. In fact, “less than 5 seconds of downtime” in the title would actually make me want to read the article more as I tend to be suspicious of “zero downtime” claims for database upgrades, whereas <5s is clearly almost as good as zero and actually quantified :)
    
    _flux 5 months ago
    
    On the other than "less than 5 seconds of downtime" might give the impression that new queries sent within that time period would be rejected, while zero implies this doesn't happen, i.e. that it's undistinguishable from normal operation for the client.
    And being even more precise in the title would just make it less titley :).
    
    awesome_dude 5 months ago
    
    Yeah - a quantifiable amount in the headline would change the likelihood of the article being taken seriously - it goes from "No downtime? I call BS" to "Less than 5 seconds, that seems reasonable, and worth investigating"
    
    ElijahLynn 5 months ago
    
    Less than 5 seconds seems pretty reasonable to me to call it zero down time.
    
    tossandthrow 5 months ago
    
    5 seconds pause on queries would make our app server drop connections and throw errors under cyclical high load - which would result in a incident.
  - paulddraper 5 months ago
    
    Strong energy of "someone brushed up against me and that's assault" going on here
    
    awesome_dude 5 months ago
    
    [flagged]
SahAssar 5 months ago

AWS Aurora Postgres is a forked Postgres with a different storage engine. Sure you are technically correct, but there are many things called "Postgres compatible" that are very much less Postgres that AWS Aurora Postgres (like for example CockroachDB).
- nijave 5 months ago
  
  Iirc AWS explicitly calls out they still use upstream Postgres query engine and some other parts. It very much _is_ Postgres but not 100% pure upstream Postgres.
  - SahAssar 5 months ago
    
    Yep, for example that is how they advertise protocol, feature and language compatibility.
unethical_ban 5 months ago

They reduced their potential downtime from 60s to what I assume is only a few seconds (they don't state in the article).
If there is not noticeable user impact or unavailability of services (this is unique to each service in existence) then there is no downtime.
- stopachka 5 months ago
  
  > they don't state in the article
  Thank you for pointing this out. I updated the essay to mention how long the pause took explicitly:
  After about a 3.5 second pause [^13], the failover function completed smoothly! We had a new Postgres instance serving requests, and best of all, nobody noticed.
  [^13]: About 2.5 seconds to let active queries complete, and about 1 second for the replica to catch up
  - metadat 5 months ago
    
    What is the [^13] notation? Is it different than a *?
    
    Izkata 5 months ago
    
    They copy/pasted from the article, that's how they're formatting footnote links. Article has 15 footnotes and that's number 13.
paulddraper 5 months ago

> They're not even running Postgres, but AWS Aurora
But everything described is also PostgreSQL compatible.
> downtime
Context switching pauses execution too FYI.

sgarland 5 months ago

I have to wonder – are they using a connection pooler? I'm leaning towards no, since what they did in code can be natively done with PgBouncer, PgCat, et al. That would also explain the last footnote:

> The big bottleneck is all the active connections

For anyone who is unaware, Postgres (and Aurora-compatible Postgres, which sucks but has a great marketing team) uses a process per connection, unlike MySQL (and others, I think) which use a thread per connection. This is inevitably the bottleneck at scale, long before anything else.

I did feel for them here:

> We couldn’t create a blue-green deployment when the master DB had active replication slots. The AWS docs did not mention this. [emphasis mine]

The docs also used to explicitly say that you could run limited DDL, like creating or dropping indices, on the Green DB. I found this to be untrue in practice, notified them, and I see they've since updated their docs. A painful problem to discover though, especially when it's a huge DB that took a long time to create the B/G in the first place.

stopachka 5 months ago

> are they using a connection pooler
We use Hikari [1] an in-process connection pooler. We didn't opt for pgbouncer at al, because we didn't want to add the extra infra yet.
> since what they did in code can be natively done with PgBouncer, PgCat, et al.
Can you point me to a reference I could look at, about doing a major version upgrade with PgBouncer et al? My understanding is that we would still need to write a script to switch masters, similar to what we wrote.
> The big bottleneck is all the active connections
The active connections we were referring too were websocket connections; we haven't had problems with PG connections.
Right now the algorithm we use to find affected queries and notify websockets starts to falter when the number of active websocket connections on one machine get too high. We're working on improving it in the coming weeks.
I updated the footnote to clarify that it was about websocket connections.
> I did feel for them here:
Thank you! That part was definitely the most frustrating.
[1] https://github.com/brettwooldridge/HikariCP
- sgarland 5 months ago
  
  I’m not sure about a reference, other than their docs [0]. Basically, you’d modify the config to point to the new servers, issue PAUSE to PgBouncer to gracefully drain connections, then RELOAD to pick up the new config, then RESUME to accept new traffic.
  This would result in client errors while paused, though, so perhaps not quite the same. To me, a few seconds of downtime is fine, but everyone has their own opinions. EDIT: you could of course also modify your client code (if it doesn’t already) to gracefully retry connections, which would effectively make this zero downtime.
  ProxySQL (which I think now supports Postgres) has a global delay option where you can effectively make clients think that the query is just taking a long time; meanwhile, you can do the same sequence as outlined.
  If you had HA Bouncers (which hopefully you would), you could cheat a little as you eluded to in the post, and have one still allow read queries to hit the old DB while cutting over writes on the other one, so the impact wouldn’t be as large.
  [0]: https://www.pgbouncer.org/usage.html
  - stopachka 5 months ago
    
    > you’d modify the config to point to the new servers, issue PAUSE to PgBouncer to gracefully drain connections, then RELOAD to pick up the new config, then RESUME to accept new traffic.
    The function we wrote effectively executes these steps [1]. I think it would look similar if we had used PgBouncer. I could see it be an option though if we couldn't scale down to "one big machine".
    [1] https://github.com/instantdb/instant/blob/main/server/src/in...
  - Izkata 5 months ago
    
    > This would result in client errors while paused, though, so perhaps not quite the same.
    What? Docs say:
    > New client connections to a paused database will wait until RESUME is called.
    Which fits what I remember when I was testing pgbouncer as part of automatic failover ages ago, if the connection from pgbouncer to the database dropped it would block until it reconnected without the app erroring.
    
    sgarland 5 months ago
    
    I stand corrected! It may also depend on the application itself, timeouts, etc. I’ve seen errors before when doing this, but now that I think about it, it was on the order of a handful of connections out of thousands, so it was probably poor client handling, or something else.
    
    LtdJorge 5 months ago
    
    I thin he means already established connections, but not sure.
    Edit: not true, actually. PAUSE will wait for the connections to be released (disconnected in session pooling, transaction ended in transaction pooling...)
nijave 5 months ago

Curious what you don't like about Aurora? We've found it to generally be better than the older PG offering since it uses clustered storage, you don't pay storage per replica. Additionally, you can pay 30% more per instance for unlimited IOPs
Serverless is generally a non starter unless you have a really really spikey workload
- sgarland 5 months ago
  
  As a disclaimer, I generally dislike most managed offerings of anything, because I don’t think you get nearly the value out of them for the price hike (and performance drop). For DBs especially, I don’t see the value, but I’m also a DBRE with extensive Linux experience, so the maintenance side doesn’t bother me.
  For Aurora in general, here’s a short list:
  * Since the storage is separated, and farther than even EBS, latency is worse. Local, on-hardware NVMe is blindingly fast, enough that you can often forget that it isn’t RAM.
  * I’ve yet to see Aurora perform better; MySQL or Postgres variants. My 13 year old Dell R620s literally outperform them; I’ve tested it.
  * The claimed benefit of being able to take a DB up to 128 TiB is a. an artificial limit that they’ve made worse by denying the same to RDS b. difficult to reach in practice, because of a bunch of gotchas like fixed-size temporary storage, which can make it impossible to do online DDL of large tables.
  * For the MySQL variant, they removed the change buffer entirely (since storage is distributed, it was necessary for their design), which dramatically slows down writes to tables with secondary indices.
  * It’s not open-source. I can and have pored through Postgres and MySQL source code, built debug builds, etc. to figure out why something was happening.
  - gtaylor 5 months ago
    
    I’ve never been on a team that migrated to Aurora PG for raw query perf. It is slower than a bespoke setup that is optimized for raw latency, but Aurora is going to hold up under much higher traffic with much less fuss. It also has an excellent snapshot/restore facility.
  - nijave 5 months ago
    
    Lack of local storage is a fair criticism. I understand balancing reliability with performance but there's some more middle ground like allowing NVMe storage on replicas but not the primary.
    I don't know much about the MySQL variant.
    Aurora isn't open source but I'm also not sure there's a compelling reason. It's highly reliant on AWS ability to run massive scale storage systems that amortize the IO cost across tons of physical devices (their proprietary SAN).
    If you have dedicated staff, managed services are definitely less compelling. We have 2 infrastructure engineers to run 15+ platforms so we're definitely getting a lot of leverage out of managed services. We'd have to 5x in size/cost to justify a specialist.
    
    zokier 5 months ago
    
    Aurora has "Optimized Reads" feature these days that allows using local nvme storage: https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide...
    
    sgarland 5 months ago
    
    I know, but IMO if you’re wanting higher performance, then Aurora is the wrong approach. Just learn how to manage Linux; it’s not that hard.
    
    nijave 5 months ago
    
    Yeah but it's only tiered and/or temp files.
- dalyons 5 months ago
  
  Aurora has been excellent in my experience. Many operational problems (eg managing replica lag) disappear

mkleczek 5 months ago

I wonder why they didn't use synchronous_commit option. That would eliminate replication lag and allow for real zero-downtime.

stopachka 5 months ago

The script we wrote though already has a method to detect exactly when the replica has caught up [^1]. Had we enabled synchronous commit, we'd had the same kind of performance as now.
A few changes we could do to improve this even further:
Right now the algorithm waits a) 2.5 seconds for transactions to complete, and b) cancels the rest.
We could make a) the amount of time we wait more exact, by actually subscribing to active transactions and waiting for them to complete. This way in most cases we'd only have to wait up to 500ms.
We couldn't do that when we wrote this essay, because we didn't differentiate read connections from write connections. We do now [^2]
We could improve b) how many transactions we have to cancel, by having the function make "attempts" throughout the day. For example, the function could try to wait 500ms for transactions to complete, but if it's a particularily busy period and we have too many straggling transactions in process, the function could abort the migration and try again another time.
[^1]: https://github.com/instantdb/instant/blob/main/server/src/in...
[^2]: https://github.com/instantdb/instant/pull/743

jillyboel 5 months ago

> The next few hours was frustrating: we would change a setting, start again, wait 30 minutes, and invariably end up with the same error.

Sounds about right for cloud services.

nijave 5 months ago

I haven't it tried it, but in another post recently someone mentioned enabling synchronous replication after the clone is up to date

https://www.postgresql.org/docs/current/warm-standby.html#SY...

paulgb 5 months ago

When I see instantdb in the domain on here I always know it's going to be a good read :)

Nicely done!

jatins 5 months ago

> all Instant databases are hosted under one Aurora Postgres instance

Doesn't the increase the chances of one bad tenant taking the database down for all other tenants?

stopachka 5 months ago

> Doesn't the increase the chances of one bad tenant taking the database down for all other tenants?
We have safe-guards in place to prevent and isolate bad actors. It's more tricky then if resources weren't shared, but that is the trade-off for multi-tenancy.