
Amazon RDS PostgreSQL Consistency Issues Found
Amazon RDS PostgreSQL multi-AZ violates Snapshot Isolation, exhibiting G-nonadjacent cycles and Long Fork anomalies across versions 13.15-17.4, potentially impacting data consistency, especially on read-only nodes.
Amazon RDS for PostgreSQL: Consistency Issues
Key Takeaways
- Amazon RDS for PostgreSQL multi-AZ clusters violate Snapshot Isolation, the strongest consistency model supported across all endpoints.
- Under normal conditions, clusters exhibited G-nonadjacent cycles (a violation of Snapshot Isolation) every few minutes.
- Observed phenomena include Long Fork anomalies.
- These issues were present in all tested versions (13.15 to 17.4).
- Amazon RDS for PostgreSQL might provide Parallel Snapshot Isolation instead.
Background
- PostgreSQL uses MVCC (multiversion concurrency control) to provide transaction isolation levels.
- RDS automates tasks like provisioning, storage management, replication, and backups.
- Multi-AZ deployments distribute database nodes across availability zones for higher availability.
- RDS uses synchronous replication.
- RDS provides a primary (read-write) and a reader (read-only) endpoint. Snapshot Isolation is the strongest consistency level supported across all nodes.
Test Setup
- Jepsen adapted its PostgreSQL test suite for Amazon RDS.
- Tests were performed on RDS clusters using
gp3
storage anddb.m6id.large
instances. - A single EC2 node ran tests against the primary and read-only endpoints.
- No fault injection or failovers were triggered.
- Workload involved transactions over lists of unique integers, stored as comma-separated values in a
TEXT
field.
Example of Consistency Violation
-
A specific two-minute test run showed a cycle of four transactions (_T_1, _T_2, _T_3, _T_4):
- _T_1 appended 9 to row 89, result
[4 9]
, observed by _T_2. - _T_3 appended 11 to row 90, result
[11]
. Overwritten by _T_4. - _T_4 appended 3 to row 90, resulting in
[11, 3]
. - _T_2 observed _T_1's append but not _T_3's.
- _T_4 observed _T_3's append but not _T_1's.
- _T_1 appended 9 to row 89, result
-
This cycle is G-nonadjacent and violates Snapshot Isolation because there's no consistent ordering of transactions based on timestamps.
Implications & Mitigation
- Amazon RDS for PostgreSQL multi-AZ clusters may offer weaker safety guarantees than single-node PostgreSQL.
- Consider examining transaction structures for Long Fork patterns.
- Verify invariants through targeted experiments.
- Using only the writer endpoint, or ensuring every safety-critical transaction includes a write, might help recover Snapshot Isolation.
- Read transactions may disagree about the order in which transactions were executed.
- Anomalies appear to involve queries against read-only secondaries