Monday, November 17, 2025

Core System Design Principles (Theoretical Foundations)

Core system design theoretical principles include:

  • Consistency
  • Availability
  • Partition tolerance
  • CAP theorem
  • PACELC theorem

These are usually grouped under Distributed System Principles.

1.     Availability –

 Every request gets a response — success or failure — without guarantee of the latest data.

Availability refers to the ability of a system to provide its services to clients even in the presence of failures.

Formula for Availability

Availability=Uptime/Uptime+Downtime

Where:

  • Uptime → the duration when the system is working and accessible
  • Downtime → the duration when the system is not reachable (due to failure, upgrades, network issues, etc.)

Availability is often expressed using the number of 9s, showing how reliable a system is.

If a service has:

  • 99.00% availability → it has 2 nines
  • 99.9% availability → it has 3 nines
  • 99.99% availability → it has 4 nines

More nines = higher reliability and much less downtime.

2.     Consistency

Consistency refers to the system's ability to ensure that all users see the same data, regardless of where or when they access it. In a consistent system, any update to the data is immediately visible to all users, and there are no conflicting or outdated versions of the data.

In distributed systems:

All nodes see the same data at the same time.

Types of Consistency Models

  • Strong Consistency - After an update is made to the data, it will be immediately visible to any subsequent read operations. In simple way —reads always reflect the latest write.

Ex: An example of strong consistency is a financial system where users can transfer money between accounts. The system is designed for high data integrity, so the data is stored in a single location and updates to that data are immediately propagated to all other locations. This ensures that all users and applications are working with the same, accurate data. For instance, when a user initiates a transfer of funds from one account to another, the system immediately updates the balance of both accounts and all other system components are immediately aware of the change. This ensures that all users can see the updated balance of both accounts and prevents any discrepancies.

 

  • Weak Consistency — After an update is made to the data, it is not guaranteed that any subsequent read operation will immediately reflect the changes made. The read may or may not see the recent write. – no guarantee.

Ex:  Another example of weak consistency is a gaming platform where users can play online multiplayer games. When a user plays a game, their actions are immediately visible to other players in the same data center, but if there was a lag or temporary connection loss, the actions may not be seen by some of the users and the game will continue. This can lead to inconsistencies between different versions of the game state, but it also allows for a high level of availability and low latency.

 

  • Eventual Consistency — Eventual consistency is a form of Weak Consistency. After an update is made to the data, it will be eventually visible to any subsequent read operations. The data is replicated in an asynchronous manner, ensuring that all copies of the data are eventually updated.

Ex: An example of eventual consistency is a social media platform where users can post updates, comments, and messages. The platform is designed for high availability and low latency, so the data is stored in multiple data centers around the world. When a user posts an update, the update is immediately visible to other users in the same data center, but it may take some time for the update to propagate to other data centers. This means that some users may see the update while others may not, depending on which data center they are connected to. This can lead to inconsistencies between different versions of the data, but it also allows for a high level of availability and low latency.

 

Strong consistency usually increases latency and reduces availability

 

3.     CAP Theorem

CAP theorem states that in a distributed system, during a partition (network failure), you can only guarantee two of the following three:

1.     Consistency (C)

2.     Availability (A)

3.     Partition Tolerance (P)

Because partitions are unavoidable, systems must choose between:

·        CP → Consistent & Partition-Tolerant (sacrifice availability)

·        AP → Available & Partition-Tolerant (sacrifice consistency)

CP Systems (prioritize consistency)

  • Reads/writes may fail during a partition
  • Always return up-to-date data

Examples:

  • Zookeeper
  • MongoDB with majority writes
  • HBase
  • Spanner (via Paxos/TrueTime)

AP Systems (prioritize availability)

  • Always return a response
  • Data may be stale or eventually consistent

Examples:

  • Cassandra
  • DynamoDB
  • Riak
  • CouchDB

Key Insight

You cannot avoid partition tolerance in real systems, so CAP is really about choosing C or A during a partition.

4.  PACELC Theorem — Definition (Advanced Version of CAP)

CAP only considers trade-offs during a partition.

PACELC expands this:

If there is a Partition (P), choose Availability or Consistency.

Else (E = Else), even without partition, choose between Latency or Consistency.

Interpretation

  • PA/EL → During partition choose Availability, otherwise choose Low Latency
  • PC/EC → During partition choose Consistency, otherwise choose Consistency (still prioritize C)
  •  

PACELC theorem was developed to address a key limitation of the CAP theorem as it makes no provision for performance or latency.

For example, according to the CAP theorem, a database can be considered available if a query returns a response after 30 days. Obviously, such latency would be unacceptable for any real-world application.

Why PACELC Was Introduced (Simple Explanation)

Problem with CAP

CAP theorem says:

During a partition, you must choose Consistency (C) or Availability (A).

But CAP does not answer an important question:

🟡 What trade-off does the system make when the network is healthy — i.e., when there is NO partition?

Modern distributed databases need additional trade-offs even when everything is working normally.

Example questions CAP doesn't answer:

  • Should the system prioritize low latency?
  • Or should it prioritize strong consistency for every read/write?

This gap led to PACELC.

What PACELC Adds

PACELC expands CAP by adding the Else (E) part:

If a Partition happens (P), choose Availability (A) or Consistency (C).

Else (E) when no partition, choose between Latency (L) or Consistency (C).

This clarifies two trade-offs:

1️During partition:

  • Pick A or C — same as CAP.

2️. During normal operation:

  • Pick L (low latency) or C (strong consistency).

Why is this needed? (Practical view)

Modern distributed databases (Cassandra, DynamoDB, Spanner, CockroachDB, MongoDB, etc.) operate across:

  • many regions
  • multiple data centers
  • hundreds of nodes

These systems must make latency vs consistency decisions all the time, even without any failures.

CAP says nothing about this.

PACELC explains:

  • Why some systems are always fast (latency-first)
  • Why some systems always enforce strong consistency (consistency-first)

Concrete Example

Cassandra / DynamoDB (PA/EL)

  • If partition → pick Availability
  • Else (normal) → pick Low Latency

→ gives fast, eventually consistent writes

 

Friday, August 1, 2025

AWS cloud concepts

 Here’s a concise summary of AWS Cloud Architecture concepts from an interview point of view—tailored for technical roles like Cloud Architect, DevOps Engineer, or Backend Developer:


✅ 1. Core AWS Services to Know

CategoryKey AWS ServicesPurpose
ComputeEC2, Lambda, ECS, EKS, App RunnerRun applications
StorageS3, EBS, EFS, FSxObject/block/file storage
DatabaseRDS, Aurora, DynamoDB, RedshiftRelational and NoSQL
NetworkingVPC, ALB/NLB, Route 53, API GatewayPrivate network, DNS, Load Balancing
IAM & SecurityIAM, KMS, Secrets Manager, CognitoIdentity, secrets, encryption
MonitoringCloudWatch, CloudTrail, X-RayLogs, metrics, auditing
MessagingSQS, SNS, EventBridge, KinesisMessaging/event streaming

✅ 2. Design Principles

Be ready to discuss and apply:

PrincipleExample
ScalabilityUse Auto Scaling, Lambda for on-demand scaling
Fault ToleranceMulti-AZ RDS, ELB, Route 53 failover
SecurityLeast privilege with IAM, VPC isolation
Cost OptimizationUse S3 for storage, Spot Instances for batch jobs
AutomationUse CloudFormation or Terraform for IaC

✅ 3. Architecture Patterns

Explain how you design using:

🟦 Microservices

  • Containerized with ECS/EKS

  • Decoupled via API Gateway + Lambda or REST APIs

  • Communication via EventBridge/SQS

🟦 Serverless

  • Lambda for compute

  • API Gateway + Lambda + DynamoDB

  • S3 for storage, SNS for notifications

🟦 3-Tier Architecture

  • Web Layer: S3 + CloudFront

  • App Layer: ECS/EC2/Lambda

  • DB Layer: RDS/Aurora


✅ 4. Security Best Practices

  • Use IAM roles not long-lived credentials

  • Store secrets in Secrets Manager/SSM

  • Use VPC, NACLs, and Security Groups

  • Enable encryption at rest (S3, RDS) and in transit (TLS)


✅ 5. High Availability & Resilience

  • Use Multi-AZ for RDS, ECS

  • Use Auto Scaling Groups for EC2

  • Design for failure (e.g., fallback logic)


✅ 6. Cost Awareness

  • Monitor via AWS Cost Explorer

  • Use S3 Lifecycle policies and Intelligent-Tiering

  • Right-size EC2 instances and leverage Spot Instances


✅ 7. CI/CD and DevOps

  • Use CodePipeline, CodeBuild, CodeDeploy

  • Automate deployments with CloudFormation or Terraform

  • Store configs/secrets securely using SSM Parameter Store


✅ 1. How would you design a high-availability web app on AWS?

🔹 Architecture:

  • Frontend: Host static assets (HTML/CSS/JS) in Amazon S3, served via CloudFront for global distribution.

  • Backend: Use Amazon ECS (Fargate) or Auto Scaling Group with EC2 in Multi-AZ for fault tolerance.

  • Load Balancer: Use Application Load Balancer (ALB) across multiple Availability Zones (AZs).

  • Database: Use Amazon RDS (Multi-AZ) or Aurora for high availability and automated failover.

  • Storage: Store user uploads or logs in S3.

  • DNS: Use Amazon Route 53 with health checks and failover routing.

🔹 Key Concepts:

  • Redundant resources across AZs

  • Auto scaling and self-healing

  • Health checks and monitoring (CloudWatch)

  • Database backups & replication


✅ 2. How do you secure a Lambda function that accesses S3 and RDS?

🔹 IAM-Based Access Control:

  • Assign a dedicated IAM role to the Lambda function with:

    • s3:GetObject, s3:PutObject permissions for specific S3 bucket

    • RDS access via Secrets Manager (for DB credentials)

🔹 Secure Environment:

  • Use VPC: Place Lambda in a private subnet to connect to RDS in VPC.

  • Restrict outbound internet via NAT Gateway, if needed.

🔹 Secrets Handling:

  • Store DB credentials in AWS Secrets Manager or SSM Parameter Store with encryption.

  • Grant the Lambda role access to decrypt secrets.

🔹 Encryption:

  • Enable S3 bucket encryption (SSE).

  • Use SSL for RDS connections.


✅ 3. How do you handle blue-green deployment on ECS?

🔹 Blue-Green Strategy with ECS (Fargate or EC2):

  • Use AWS CodeDeploy with ECS and Application Load Balancer (ALB).

  • Register both blue (current) and green (new) task sets under the same service.

  • CodeDeploy shifts traffic from blue to green in a controlled manner.

🔹 Steps:

  1. Green version is deployed alongside blue.

  2. Health checks validate the green version.

  3. If healthy, traffic is rerouted via ALB listener rules.

  4. If failure occurs, rollback to blue.

🔹 Tools:

  • CodePipeline + CodeDeploy for automation

  • CloudWatch alarms for rollback triggers


✅ 4. What's your approach for a multi-region architecture?

🔹 Goals:

  • Global availability

  • Regional failover

  • Lower latency for users

🔹 Strategy:

  1. Frontend: Use CloudFront to serve static content from edge locations.

  2. Backend: Deploy app stacks in multiple AWS regions (e.g., us-east-1 and eu-west-1).

  3. Database:

    • Read Replicas in other regions (for RDS)

    • Or use Amazon Aurora Global Database

    • Or DynamoDB Global Tables for NoSQL

  4. DNS Routing: Use Route 53 latency-based routing or failover routing.

  5. Data Sync:

    • Use S3 Cross-Region Replication

    • EventBridge + Lambda for syncing data/events

🔹 Considerations:

  • Handle data consistency

  • Use infrastructure as code (CloudFormation/Terraform) across regions

  • Monitor each region with CloudWatch dashboards


✅ General AWS Architecture

1. What is the difference between scalability and elasticity in AWS?

  • Scalability means the ability to handle increasing workload by adding resources (either vertical or horizontal).

  • Elasticity means the automatic provisioning and de-provisioning of resources based on current demand (e.g., AWS Lambda auto-scales per invocation).


2. How do you design for fault tolerance in AWS?

  • Use multi-AZ deployments (RDS, ALB, EC2 ASG).

  • Replicate data (e.g., S3 replication, cross-region).

  • Use Auto Scaling for redundancy.

  • Implement health checks, retries, and graceful failover (e.g., Route 53 failover routing).


3. Explain the Well-Architected Framework's 5 pillars.

  1. Operational Excellence – Monitor, automate, and evolve procedures.

  2. Security – Apply least privilege, enable traceability, encrypt data.

  3. Reliability – Recover quickly from failures, test recovery.

  4. Performance Efficiency – Use the right resource types and scaling strategies.

  5. Cost Optimization – Avoid overprovisioning, use spot instances, monitor usage.


4. What is a VPC and why is it important?

  • A Virtual Private Cloud (VPC) is an isolated network within AWS.

  • It lets you control networking (IP ranges, subnets, route tables), security (NACLs, security groups), and connectivity (VPN, Direct Connect).

  • Essential for securing and segmenting your AWS environment.


5. How do you secure a web application on AWS end to end?

  • Use HTTPS with ACM.

  • Place app behind WAF and ALB.

  • Authenticate with Cognito or OAuth.

  • Encrypt data using KMS and S3/RDS encryption.

  • Apply IAM with least privilege, restrict S3 access, and use private subnets for backend resources.


✅ VPC & Networking

6. What's the difference between Security Groups and NACLs?

  • Security Groups: Stateful, instance-level firewall.

  • NACLs: Stateless, subnet-level firewall.

  • Use security groups for primary control; NACLs for coarse subnet rules.


7. How do you connect your on-premise data center to AWS?

  • VPN Connection: Encrypted IPsec tunnel over the internet.

  • AWS Direct Connect: Dedicated high-speed line for low-latency, secure data transfer.


✅ Compute

8. Difference between EC2, ECS, EKS, and Lambda?

ServiceDescription
EC2Virtual servers you manage
ECSAWS container orchestration
EKSManaged Kubernetes
LambdaServerless function execution without managing servers

9. When would you choose Lambda over ECS or EC2?

  • Choose Lambda for event-driven, stateless, short-lived functions (like API triggers).

  • Choose ECS/EC2 for long-running apps, custom networking, or stateful services.


10. How do you handle session state in a stateless EC2 or Lambda setup?

  • Use ElastiCache (Redis), DynamoDB, or S3 to store session state.

  • Never store session state on the compute instance itself.


✅ Storage & Database

11. When would you use DynamoDB over RDS?

  • Use DynamoDB when you need:

    • High throughput and low latency

    • NoSQL schema flexibility

    • Serverless scaling

  • Use RDS for complex relationships, joins, and strong ACID transactions.


12. How do you design a backup strategy for databases in AWS?

  • Enable automated backups and snapshots in RDS.

  • Use Point-In-Time Recovery (PITR) for DynamoDB.

  • Schedule cross-region backups using AWS Backup or Lambda automation.


✅ Security

13. What is the principle of least privilege in IAM?

  • Users and services should only get the minimum permissions needed to perform their job.

  • Prevents lateral movement and security breaches.


14. What are resource-based vs identity-based policies?

  • Identity-based: Attached to IAM users, roles, or groups.

  • Resource-based: Attached directly to resources (e.g., S3 bucket policy, Lambda function policy).


✅ Automation & DevOps

15. How do you automate deployments in AWS?

  • Use:

    • CloudFormation or Terraform for infrastructure as code.

    • CodePipeline, CodeBuild, CodeDeploy for CI/CD.

    • GitHub Actions, Jenkins, or GitLab CI can also integrate with AWS.


16. How do you troubleshoot a failing Lambda function in production?

  • Check CloudWatch Logs for stack traces.

  • Use AWS X-Ray for distributed tracing.

  • Validate IAM permissions and environment variables.

  • Check timeout, memory limits, or input payload issues.


✅ High Availability & Disaster Recovery

17. How would you implement DR (Disaster Recovery) for an RDS database?

  • Use:

    • Multi-AZ for automatic failover.

    • Snapshots for backup/restore.

    • Cross-region read replicas for regional disaster recovery.


18. What are different Route 53 routing policies?

  • Simple: Single IP or record.

  • Weighted: Distribute traffic by percentage.

  • Latency-based: Route to region with lowest latency.

  • Failover: Active-passive setup.

  • Geolocation: Based on user’s location.

  • Multivalue: Return multiple records for load balancing.


19. What is an active-active vs active-passive multi-region architecture?

  • Active-active: All regions serve traffic simultaneously; requires data sync.

  • Active-passive: One region serves all traffic; another is on standby for failover.


✅ ECS / Deployment

20. How do you handle blue-green deployment on ECS?

  • Use AWS CodeDeploy with ECS and ALB:

    1. Deploy new version (green) alongside old (blue).

    2. Register new tasks in ALB target group.

    3. Shift traffic after health checks.

    4. Roll back if issues arise.

Wednesday, July 23, 2025

What is Domain-Driven Design (DDD)?

DDD is a software design approach focused on modeling software based on the real-world domain it is intended to serve, using a ubiquitous language and breaking systems into bounded contexts.


💡 Key Principles of DDD

PrincipleDescription
Ubiquitous LanguageShared vocabulary used by developers and domain experts
Bounded ContextA logical boundary around a domain model (e.g., Order, Payment)
EntitiesObjects defined by identity and lifecycle (e.g., Customer, Invoice)
Value ObjectsImmutable objects defined by attributes, not identity (e.g., Address)
AggregatesCluster of domain objects with one root (Aggregate Root)
RepositoriesProvide access to aggregates (e.g., CustomerRepository)
Domain EventsRepresent something that happened in the domain (OrderPlaced)
ServicesDomain logic that doesn’t naturally belong to an entity or value object
FactoriesEncapsulate object creation logic

🧱 Building Blocks of DDD (Example in Java)

// Entity
public class Customer { private UUID id; private String name; private Address address; // value object } // Value Object public class Address { private String street; private String city; } // Aggregate Root public class Order { private UUID orderId; private List<OrderItem> items; public void addItem(OrderItem item) { ... } }

🔀 Bounded Contexts & Integration

Each bounded context represents a specific part of the domain with its own model. They integrate using:

MethodDescription
REST APIsSynchronous communication
Events (EDA)Asynchronous via Kafka/RabbitMQ
Shared KernelShared code in tightly coupled domains

📌 Real-World DDD Scenario (E-commerce)

Bounded ContextDescription
Order ContextHandles order creation and status
Inventory ContextManages stock and warehouse data
Payment ContextDeals with payment and refunds

“Each context has its own model, entities, services, and database. They interact via REST or Kafka events using a common ubiquitous language.”


🎯 Why Use DDD?

“We used DDD to model a complex domain with multiple teams working in parallel. It helped us isolate domain logic into bounded contexts, enabling clear ownership, better code organization, and easier scaling.”

✅ Benefits:

  • Aligns code with business

  • Improves maintainability

  • Scales well across teams

  • Enhances testability

❌ When Not to Use:

  • Simple CRUD apps or reporting systems

  • Domains not well understood


🛠️ Spring Boot DDD Best Practices

PracticeTool/Pattern
Package by domaincom.company.order, com.company.payment
Use interfaces for repositoriesJpaRepository<Order, UUID>
Handle events via listeners@DomainEvent, @TransactionalEventListener
Avoid anemic modelsPut behavior inside entities
Use @Service, not fat controllersKeeps domain logic clean

📘 Sample Package Structure

com.example.order
├── domain │ ├── model │ ├── service │ ├── repository │ └── event ├── application ├── infrastructure └── api

🧠 Interview Answer Template

“We adopted DDD to model our logistics and finance systems separately. We defined clear bounded contexts and used domain events to enable async communication. For example, the OrderPlaced event in the Order context triggers inventory reservation in the Inventory context. This allowed us to scale teams independently and maintain high cohesion within each context.”