High Availability

Are technical issues bringing down your website? Learn how to integrate disaster recovery into the design and test the solution for ease of operational management.

Selecting Highly Available Technology

The cloud provides cost-efficient technology to deliver high availability. However, user design still needs to leverage those capabilities.

Automated Hardware Deployment

Software automation has been around for many years. However, the cloud extends that concept to the servers, network, and firewalls. It supports rapid deployment and clearing of environments, making it useful for prototyping and sophisticated testing. Those tests include disaster recovery, fully loaded failover, and server right-sizing.

Cost Per Change

The cloud reduces the costs to update and tune solutions. That encourages software maintenance, security patching, and design tuning. These significantly enhance availability.

Web Server Software

The dominate web server software by market share is Apache, NGINX, and Microsoft IIS, as shown in the following chart. Apache is the most flexible because of the large number of configuration options. NGINX is faster and often used as a proxy in front of the webserver. IIS has higher running costs resulting from Windows licensing fees and the lack of capabilities in the operating system, like process forks that result in less efficient solutions.
Market Share of Web Server Software.

Operating Systems

Linux and its distributions have over 80% of the web server market, as shown in the following chart. The open-source concept behind Linux led to a more vibrant community. That resulted in transparent architecture, kernel tracing tools, lower resource utilization, and other capabilities that make it easier to fix and stabilize applications. However, Windows still has better support for less common software drivers. The free licensing agreement and large install base for Debian, CentOS, and Ubuntu make it practical to run production without support contracts. Windows and RedHat have associated fees while other systems lack the install base to maintain without a support agreement.
Operating System Market Share.

Server Replacement

:
Cloud solutions make it possible to replace a failed server on demand. It reduces cost by eliminating the need for server redundancy. It also reduces costs by supporting repeated prototyping and deployment testing.
Automated Website Repair.

Multiregional Data Centers

:
Many Cloud providers own data centers located around the world. The contractual agreements and identical user interfaces for each site make it easy to optimize deployment locality.

Multi-Site Redundancy

Redundancy provides multiple components with excess capacity to survive partial failures. They are robust to the broadest range of issues when they have share-nothing architecture because the failure of one node has no impact on the others. Always sending user load to every node reduces the percentage of users impacted by a crash. It ensures all recovery nodes are working and reduces the changes needed to complete failover. Each point increases availability. The design also has faster response times.
Architecture Diagram for a Multi-Location Website.

Data Center Connectivity

Adding network connections to a data center enhances availability. It simplifies network restructuring, managing router failures, and mitigating denial of service attacks. Datacenter connections are on the multi-route Border Gateway Protocol (BGP) network. Websites have anywhere from a single BGP connection to over 100, as shown in the following chart.
The number of Network Connections into a Data Center.

Recovering from Disasters

Designing websites for recovery makes them less likely to crash, restores operations more quickly, and increase the probability of success. It goes beyond taking backups to test the following scenarios.

Malware and Ransomware

Most sites lack the encrypted communication, server hardening, and secure hosting needed to deter basic attacks. Detection of malware can take months, resulting in corrupted backups. Content Management Systems (CMS) like WordPress make it challenging to separate application code from customizations and malicious content.

Human Errors

Human errors, especially during deployment, are the most frequent cause of outages. Automation leads to the rapid deployment of environments allowing for extensive prototyping and detailed testing. It is easy to check rollover to production, rollback from failure, disaster recovery, system overloading, and security hardening. Automation allows companies to move human errors from production to testing environment.

Hardware Failure

The probability of server failure in the first year is 5% and increases after that. Routers and other equipment can also fail. Virtualization simplifies hardware recovery by supporting rapid replacement. Examples include replacing a server, booting the disk image into a newer server model, and replacing disks while the website is running.
High Availability Checklist.

Software Failure

Accelerated software updates result from faster changes to business requirements and more frequent security patching. Stabilizing the solution requires more careful update controls. Automated testing provides a rapid way to detect failures. Monitoring detects evolving issues and reduces the time between an outage and its detection. Embedding rollback into the design recovers from failures that were not picked up with testing.

Data Center Outage

Datacenter outages result from internal and external issues. Planning to recover from these scenarios involves migrating operations to other data centers. Running standalone website instances in several regions offers the best protection.

Preventing Disasters

Disasters come from a variety of internal and external issues. Mitigating their risk requires contingency planning.
Planning for Disaster Recovery.

Recovery Testing

Recovery testing simulates failure and validates the resumption of operations. Techniques to simulate disasters include changes to network routing, shutting down a server, deleting disk content, and overloading systems. Cycling through repeated tests provides the working environment to tune the solution rapidly. Rerunning disaster recovery tests are more straightforward in the cloud due to the low cost and speed of deployments.
High Availablity Solutions in the Cloud.

Fixing Logged Errors

Logged errors are a leading indicator of outages. Addressing them in a non-crisis mode leads to higher quality solutions and reduces the noise when an outage does happen. It covers logs from operating systems, web servers, databases, routers, firewalls, and other components. Periodic reviews should look at stability, security, performance, and monitoring. If an outage does happen, there will be less clutter masking the underlying issues meaning the fix will take less time.
Mitigate Security Risks

Security Testing

Security testing should cover browser connections, protecting infrastructure, and security validation. Fixing vulnerabilities reduces the probability of an attack because many weaknesses are highly visible.

Overloads

Overload testing establishes safe limits for user loads. Tuning for an overload ensures the system recovers immediately when the load subsides. Instead of hanging, it quickly rejects new requests when there is insufficient capacity. That increases throughput by concentrating resources on successful completion. It also helps protect against Denial Of Service (DOS) attacks.
Building Technology for Endurance and to Recover from Overloads.

Endurance

Endurance testing verifies the absence of resource leaks such as memory and open files. Ideally, tests include both successful and unsuccessful requests. Leaks come from a variety of sources. For example; programs allocate and release memory, but it never goes back to the operating system. Testing for this condition establishes an appropriate setting for the life span of a webserver process.

Detecting Failure

The goal of monitoring is to identify issues before they cause an outage. They also reduce the time between failure and detection. Modern designs increasingly launch recovery scripts to minimize outage duration.

Availability

Measuring availability is the most common monitor. Simple checks can frequent to reduce the time between failure and detection. More involved checking ensures all the subsystems work, but calling them too frequently increases system load. Originating calls thousands of miles away ensure the data center connects onto the internet. More rigorous monitoring increases operational costs but should have the goal of minimizing.

System Resources

System resources include CPU, disk, and memory. Trigging alerts based on expected loads helps identify impending performance issues. Solutions that reduce costs by removing excess hardware capacity is more sensitive to load fluctuations and are in more need of this type of monitoring.
Forensic Analysis of Security Logs.

Security Monitoring

Security monitoring flags infrequent administrative activity. Typical examples are administrative login, content changes on the website, and updates to the DNS registrar. The goal is to detect uncommon but unexpected events that pose a significant risk.

Equipment

Equipment monitoring includes printers, routers, and firewalls in offices and data centers. It uses the Simple Network Management Protocol (SNMP). They provide a more meaningful alert and detect issues before the user calls the help desk.
Get the News Letter