What is fault tolerance?

Fault tolerance is the ability of a workload to remain operational with zero downtime or data loss in the event of a disruption. In a fault-tolerant environment, instances of the same workload are typically hosted on two or more independent sets of servers. Fault tolerance is the ability to withstand subsystem failure and maintain availability (doing the right thing within an established SLA).

What is high availability?

High availability (HA) is a system’s ability to function even when some components fail. It ensures continuous uptime by eliminating single-point failures over an extended period. With regard to AWS, a system has high availability when it has 99.999% uptime, also known as "five nines".

High Availability vs Fault Tolerance

A fault tolerant environment has no service interruption but a significantly higher cost, while a highly available environment has a minimal service interruption.

Fault tolerance relies on specialized hardware to detect a hardware fault and instantaneously switch to a redundant hardware component—whether the failed component is a processor, memory board, power supply, I/O subsystem, or storage subsystem. Although this cutover is apparently seamless and offers non-stop service, a high premium is paid in both hardware cost and performance because the redundant components do no processing. More importantly, the fault tolerant model does not address software failures, by far the most common reason for downtime.

High availability views availability not as a series of replicated physical components, but rather as a set of system-wide, shared resources that cooperate to guarantee essential services. High availability combines software with industry-standard hardware to minimize downtime by quickly restoring essential services when a system, component, or application fails. While not instantaneous, services are restored rapidly, often in less than a minute.

Many sites are willing to absorb a small amount of downtime with high availability rather than pay the much higher cost of providing fault tolerance. Additionally, in most highly available configurations, the backup processors are available for use during normal operation.

High availability systems are an excellent solution for applications that must be restored quickly and can withstand a short interruption should a failure occur. Some industries have applications so time-critical that they cannot withstand even a few seconds of downtime. Many other industries, however, can withstand small periods of time when their database is unavailable.

High Availability & Fault Tolerance in AWS

High availability guarantees continuous operability of systems for desirably long periods of time. A solid requirement for enterprises, high availability protects businesses against the risks brought by a system outage.

High availability for systems is represented through a sequence of “9’s”. A 100% availability translates to 0 minutes of downtime in a year, which is practically infeasible and an ideal benchmark. A three-nines availability, represented as 99.9%, allows 8 hours and 46 minutes of downtime per year. A four nine availability, 99.99%, allows 52 minutes and 36 seconds downtime per year, and a five-nine availability, which is the accepted standard for emergency response systems and mission-critical operations, provides about 5 minutes and 15 seconds of downtime per year.

AWS Regions and High Availability Zones

Amazon hosts its web services across multiple locations, with each AWS location consisting of multiple availability zones and availability ranging from 99.9% to 99.999%.
Each AWS Region runs in complete autonomy. This ensures the greatest level of fault tolerance and stability for user and application workloads.
All AWS Availability Zones (AZs) are configured to operate in such a way that they are able to provide inexpensive, low latency network connectivity to other Availability Zones in the same region as well. These are connected to multiple Internet Service Providers (ISPs) and different power grids.
Your application(s) can be safeguarded against failure in a single data center by deploying EC2 instances in various Availability Zones.
It is important to run independent application stacks in more than one Availability Zone, either in the same region or in another region, so that if one zone fails, the application in the other zone can continue to run.

AWS Services Used to Achieve High Availability

AWS delivers high availability through a scalable, load-balanced cluster or an active-standby pair, among other approaches. The majority of Amazon Web Services are designed to be fault-tolerant and have high availability. The following list includes some of them:

Amazon S3
SimpleDB
Amazon Relational Database (RDS)
Amazon Simple Queue Service (SQS)
Elastic Load Balancing (ELB)
Amazon Simple Notification Service (SNS)
Amazon Virtual Private Cloud (VPC)
Amazon Machine Engine (AMI)

Advantages of Using AWS High Availability for Web Applications

AWS high availability for web applications provides you with the following benefits:

A completely secured network that uses a Web Application Firewall (WAF) to prevent common web exploits.
AWS HA has provisions like Business Continuity (BC) and Disaster Recovery (DR) technologies to help businesses resume operations with minimal disruption.
For cases where instant hardware failure may arise or are about to arise, AWS Auto Scaling automatically detects this and launches a new instance.
AWS HA provides metrics on the cloud to closely monitor the application based on the number of users using the application or the memory consumed by the particular instance.
The deployment of new features or updates may be done without causing any problems for present users.

Q: A challenging AWS interview question

How would you design a highly scalable and fault-tolerant architecture for a web application on AWS?

Answer: To design a highly scalable and fault-tolerant architecture for a web application on AWS, you can consider the following key components:

1. Load Balancing: Use Elastic Load Balancing (ELB) to distribute traffic across multiple EC2 instances in different availability zones to achieve high availability and distribute the load efficiently.

2. Auto Scaling: Utilize Auto Scaling groups to automatically adjust the number of EC2 instances based on traffic demands. This ensures that your application can handle increased load and scale down during periods of low traffic.

3. Elastic Beanstalk or ECS: Use AWS Elastic Beanstalk or Amazon Elastic Container Service (ECS) to easily deploy, manage, and scale your web application without worrying about the underlying infrastructure.

4. Database: For a highly scalable and fault-tolerant database solution, consider using Amazon RDS with Multi-AZ deployment for automatic replication across availability zones, or Amazon DynamoDB for a fully managed NoSQL database.

5. Caching: Implement a caching layer using Amazon ElastiCache (Redis or Memcached) to improve performance and reduce load on your backend infrastructure.

6. Content Delivery: Utilize Amazon CloudFront, a global content delivery network (CDN), to distribute content to users with low latency and high data transfer speeds.

7. Data Backup and Recovery: Set up regular backups using services like Amazon S3, and implement disaster recovery mechanisms such as cross-region replication to ensure data durability and availability.

8. Monitoring and Logging: Use Amazon CloudWatch to monitor your application's performance, set up alarms, and collect logs for troubleshooting and analysis. Consider integrating with AWS Lambda for automated actions based on certain events or metrics.

9. Security: Implement appropriate security measures such as using Virtual Private Cloud (VPC) for network isolation, security groups for instance-level firewall rules, and AWS Identity and Access Management (IAM) for fine-grained access control.

This answer provides a high-level overview, but it's essential to dive deeper into each component and understand how they integrate to create a robust and scalable architecture.

(Continued..)

Compiled by: Azizul maqsud

Reference: https://www.ibm.com/docs/en/powerha-aix/7.2?topic=aix-high-availability-versus-fault-tolerance

https://hevodata.com/learn/aws-high-availability/

Fault Tolerance & High Availability in the Cloud : AWS caters perfectly with its High Efficiency

Both of them detect failure. But, a fault tolerant system typically has higher costs than HA, due to having physical redundancy!