Monday, November 17, 2025

Core System Design Principles (Theoretical Foundations)

Core system design theoretical principles include:

  • Consistency
  • Availability
  • Partition tolerance
  • CAP theorem
  • PACELC theorem

These are usually grouped under Distributed System Principles.

1.     Availability –

 Every request gets a response — success or failure — without guarantee of the latest data.

Availability refers to the ability of a system to provide its services to clients even in the presence of failures.

Formula for Availability

Availability=Uptime/Uptime+Downtime

Where:

  • Uptime → the duration when the system is working and accessible
  • Downtime → the duration when the system is not reachable (due to failure, upgrades, network issues, etc.)

Availability is often expressed using the number of 9s, showing how reliable a system is.

If a service has:

  • 99.00% availability → it has 2 nines
  • 99.9% availability → it has 3 nines
  • 99.99% availability → it has 4 nines

More nines = higher reliability and much less downtime.

2.     Consistency

Consistency refers to the system's ability to ensure that all users see the same data, regardless of where or when they access it. In a consistent system, any update to the data is immediately visible to all users, and there are no conflicting or outdated versions of the data.

In distributed systems:

All nodes see the same data at the same time.

Types of Consistency Models

  • Strong Consistency - After an update is made to the data, it will be immediately visible to any subsequent read operations. In simple way —reads always reflect the latest write.

Ex: An example of strong consistency is a financial system where users can transfer money between accounts. The system is designed for high data integrity, so the data is stored in a single location and updates to that data are immediately propagated to all other locations. This ensures that all users and applications are working with the same, accurate data. For instance, when a user initiates a transfer of funds from one account to another, the system immediately updates the balance of both accounts and all other system components are immediately aware of the change. This ensures that all users can see the updated balance of both accounts and prevents any discrepancies.

 

  • Weak Consistency — After an update is made to the data, it is not guaranteed that any subsequent read operation will immediately reflect the changes made. The read may or may not see the recent write. – no guarantee.

Ex:  Another example of weak consistency is a gaming platform where users can play online multiplayer games. When a user plays a game, their actions are immediately visible to other players in the same data center, but if there was a lag or temporary connection loss, the actions may not be seen by some of the users and the game will continue. This can lead to inconsistencies between different versions of the game state, but it also allows for a high level of availability and low latency.

 

  • Eventual Consistency — Eventual consistency is a form of Weak Consistency. After an update is made to the data, it will be eventually visible to any subsequent read operations. The data is replicated in an asynchronous manner, ensuring that all copies of the data are eventually updated.

Ex: An example of eventual consistency is a social media platform where users can post updates, comments, and messages. The platform is designed for high availability and low latency, so the data is stored in multiple data centers around the world. When a user posts an update, the update is immediately visible to other users in the same data center, but it may take some time for the update to propagate to other data centers. This means that some users may see the update while others may not, depending on which data center they are connected to. This can lead to inconsistencies between different versions of the data, but it also allows for a high level of availability and low latency.

 

Strong consistency usually increases latency and reduces availability

 

3.     CAP Theorem

CAP theorem states that in a distributed system, during a partition (network failure), you can only guarantee two of the following three:

1.     Consistency (C)

2.     Availability (A)

3.     Partition Tolerance (P)

Because partitions are unavoidable, systems must choose between:

·        CP → Consistent & Partition-Tolerant (sacrifice availability)

·        AP → Available & Partition-Tolerant (sacrifice consistency)

CP Systems (prioritize consistency)

  • Reads/writes may fail during a partition
  • Always return up-to-date data

Examples:

  • Zookeeper
  • MongoDB with majority writes
  • HBase
  • Spanner (via Paxos/TrueTime)

AP Systems (prioritize availability)

  • Always return a response
  • Data may be stale or eventually consistent

Examples:

  • Cassandra
  • DynamoDB
  • Riak
  • CouchDB

Key Insight

You cannot avoid partition tolerance in real systems, so CAP is really about choosing C or A during a partition.

4.  PACELC Theorem — Definition (Advanced Version of CAP)

CAP only considers trade-offs during a partition.

PACELC expands this:

If there is a Partition (P), choose Availability or Consistency.

Else (E = Else), even without partition, choose between Latency or Consistency.

Interpretation

  • PA/EL → During partition choose Availability, otherwise choose Low Latency
  • PC/EC → During partition choose Consistency, otherwise choose Consistency (still prioritize C)
  •  

PACELC theorem was developed to address a key limitation of the CAP theorem as it makes no provision for performance or latency.

For example, according to the CAP theorem, a database can be considered available if a query returns a response after 30 days. Obviously, such latency would be unacceptable for any real-world application.

Why PACELC Was Introduced (Simple Explanation)

Problem with CAP

CAP theorem says:

During a partition, you must choose Consistency (C) or Availability (A).

But CAP does not answer an important question:

🟡 What trade-off does the system make when the network is healthy — i.e., when there is NO partition?

Modern distributed databases need additional trade-offs even when everything is working normally.

Example questions CAP doesn't answer:

  • Should the system prioritize low latency?
  • Or should it prioritize strong consistency for every read/write?

This gap led to PACELC.

What PACELC Adds

PACELC expands CAP by adding the Else (E) part:

If a Partition happens (P), choose Availability (A) or Consistency (C).

Else (E) when no partition, choose between Latency (L) or Consistency (C).

This clarifies two trade-offs:

1️During partition:

  • Pick A or C — same as CAP.

2️. During normal operation:

  • Pick L (low latency) or C (strong consistency).

Why is this needed? (Practical view)

Modern distributed databases (Cassandra, DynamoDB, Spanner, CockroachDB, MongoDB, etc.) operate across:

  • many regions
  • multiple data centers
  • hundreds of nodes

These systems must make latency vs consistency decisions all the time, even without any failures.

CAP says nothing about this.

PACELC explains:

  • Why some systems are always fast (latency-first)
  • Why some systems always enforce strong consistency (consistency-first)

Concrete Example

Cassandra / DynamoDB (PA/EL)

  • If partition → pick Availability
  • Else (normal) → pick Low Latency

→ gives fast, eventually consistent writes