Finding and fixing a data-corruption bug with the help of the community
Thursday, October 24 at 13:40–14:30
Room: Olympia B
We started rolling out PostgreSQL 16 earlier this year when we suddenly saw a very small percentage (0.15%) of services alerting on data-corruption. Luckily, we traced the corruption to a faulty FSM (free space map), which is easily fixable without too much downtime. This talk describes how we could leverage the help of the community in finding, mitigating and then fixing the bug.
We will do a deep dive on how PG writes data to disk and what we did to fix the issue in the end. We will also learn on how to fix these specific issues without downtime or VACUUM FULL with a new function we proposed to expose in pg_freespacemap.