How do servers operate 24/7, 365 days a year without stopping?

This blog post explores the principles behind servers operating stably 24/7, 365 days a year without interruption. Discover how hardware and software technologies enable uninterrupted service.

 

Many internet services are actively operating and developing these days. Representative internet services include social networking services like Facebook and Twitter. These services provide users with platforms for communication and information sharing, establishing themselves as essential tools in modern society. Users share news in real time and experience global connectivity through them. Online games and mobile games are also types of internet services. These services have evolved beyond simple entertainment into platforms where users worldwide compete and collaborate in real time. Mobile games, in particular, have gained explosive popularity due to their accessibility anytime, anywhere.
Many users have likely encountered situations where site access is slow, error pages appear, or messages indicate the server is under maintenance. This causes significant inconvenience to users, and if such issues occur at critical moments, it can lead to a loss of user trust. Users often describe these situations as ‘the server is down’. Why does the server suffer and eventually crash? And why does this result in users being unable to access the services they want?
To answer this question, we first need to understand the role of servers, the core of internet services. Servers are the central computer systems that provide services to users. They handle numerous user requests simultaneously, loading webpages and transmitting data. If a server malfunctions, users cannot properly use the service. For this reason, stable and reliable server operation is a critical factor determining the success of internet services.
Non-stop operation technology, as the name implies, is the technology that provides internet services without interruption 24 hours a day, 365 days a year. Users of internet services where non-stop operation technology is well implemented can access the service whenever they want. This maximizes user convenience while being essential for service providers to maintain stable revenue. The revenue of an internet service is proportional to the product of the service’s uptime and the number of users connected simultaneously. In other words, increasing the service’s uptime or the number of concurrent users are methods to boost the revenue of companies providing internet services. The latter depends on how marketing is conducted or what services are strategically designed, while the former is a challenge engineers must address.
Non-stop operation technology broadly falls into two categories: hardware-based and software-based. In internet services, the program running on the server computer is called the server application. Here, the server computer is the hardware, and the server application is the software. Hardware-based non-stop operation technology refers to techniques that prevent stoppage by performing specific tasks on the general server computer itself. Software-based non-stop operation technology refers to techniques that prevent stoppage by performing specific tasks on the general server application itself.
How can we create a server computer that never stops? One method is to connect CPUs or hard disks in parallel. Computers can only handle 0s and 1s. Therefore, the binary system is used to represent numbers. Additionally, each character corresponds to a specific number. This is called ASCII Code. The uppercase letter A is the number 66, and B is the number 67. Therefore, all characters and numbers can be represented using 0s and 1s.
Occasionally, computers experience unintended flipping of 0s and 1s. When this happens, the computer may freeze because the intended number or character changes. Components prone to this issue are the CPU and hard disk. Simply put, the CPU is the component that performs arithmetic operations, and the hard disk is the component that stores the results. Since the CPU may reuse data stored on the hard disk, both components must produce correct results to provide normal internet services.
Two are better than one. Connect two CPUs in parallel. For any given operation, perform the calculation on both CPUs and compare the results. If the results differ, it indicates an issue with one of them, so the operation is retried. Suppose the probability of an error occurring in one CPU is 10%. The number of cases where the results from the two CPUs differ are (true, false) and (false, true). As mentioned earlier, if the results of the two operations differ, they are re-executed, so these two cases are not problematic. However, if the result is (false, false), unfortunately, the computer will halt. Yet, the probability of such a halt is only 1%, which is 10% squared. Since the actual error probability of a CPU is far smaller than 1%, connecting two, three, or more CPUs in parallel makes it extremely rare for the computer to halt due to a CPU issue.
The same applies to hard disks. Data stored on a hard disk can also change from 0 to 1 or from 1 to 0 at a specific moment. Typically, a hard disk has a built-in function to determine whether data is normal or abnormal. It stores the number of consecutive 1s or 0s in a sequence of ten. When the computer reads this section, it compares the current count of 1s with the stored count. If they differ, it recognizes the data as abnormal. However, standard hard disks have no way to recover this. Therefore, server computers employing non-stop operation technology install multiple hard disks and store the same data on them. When the CPU requires the stored data, it changes the abnormal data to normal data before passing it to the CPU.
How should we build non-stopping server applications? First, use verification programs that can detect errors in the program early. Second, run the server application on multiple server computers. The first cause of server application stoppage is errors in the program. The second cause is updates to the server application due to added features in the internet service. Program errors are a persistent problem dating back to the earliest computers and are not unique to internet services. Therefore, specialized verification programs exist to detect program errors early. These verification programs can prevent server application crashes to a certain extent.
To add a new feature to an operational internet service, the server application must be stopped and restarted with the new feature applied. Since two server applications cannot run simultaneously on a single server computer, the order of shutdown and restart must be strictly followed. Users cannot access the service during the period between shutdown and the moment the new application starts. But what if we run the server application on multiple server computers? This solves the problem because while one server application is down for the new feature upgrade, another server application can handle the workload. However, this presents the challenge of implementing seamless communication between multiple server computers.
One technology used for this is ‘load balancing’. Load balancing is a technique that distributes the load across multiple servers, evenly distributing tasks to all servers to prevent any single server from becoming overloaded. This technology is particularly crucial for large-scale internet services. For example, during events where millions of users access the system simultaneously, failure to implement proper load balancing significantly increases the risk of server crashes. Therefore, load balancing is an essential element for achieving non-stop operation technology.
To implement non-stop operation technology, all the techniques mentioned above must be fundamentally applied. Services like Facebook, Twitter, and Instagram, which many people use, already have all these techniques fully implemented. In fact, even that isn’t perfect. Even if CPUs or hard disks are connected in parallel, the possibility of errors still exists, albeit at a lower rate than before. Furthermore, verification programs cannot detect all problems within server applications. Beyond the internet services mentioned above, many development companies have additionally developed and applied their own proprietary non-stop technologies. Have you ever seen Facebook’s service go down while using it? While I’ve often experienced issues like images loading slowly or posts taking a long time to publish, I’ve never encountered Facebook’s servers crashing. This is the result of Facebook’s unique non-stop operation technology.
Such non-stop operation technology is directly linked to a company’s competitiveness. Uninterrupted service builds user trust, which ultimately plays a crucial role in securing loyal users. Conversely, services that frequently go down can accelerate user churn. Therefore, until both internet service providers and users can consistently deliver and receive uninterrupted service, more research is needed on zero-downtime operation technology. Through this, we can build a more stable and reliable internet environment.

 

About the author

Writer

I'm a "Cat Detective" I help reunite lost cats with their families.
I recharge over a cup of café latte, enjoy walking and traveling, and expand my thoughts through writing. By observing the world closely and following my intellectual curiosity as a blog writer, I hope my words can offer help and comfort to others.