A distributed system is a collection of computers, also known as nodes, that collaborate to perform a specific task or provide a service.
These nodes are physically separate and communicate with each other by passing messages over a network. Distributed systems can span geographical boundaries, enabling them to utilize resources from different locations.
Distributed systems have several characteristics that distinguish them from traditional centralized systems:
The computers in a distributed system are physically separate and connected via a network. They do not share a memory or a common clock.
From an external perspective, a distributed system appears as a single, unified entity to the end user.
Distributed systems offer the flexibility to add or remove computers from the system.
The nodes in a distributed system need to coordinate and agree with each other to perform actions consistently.
Nodes in a distributed system can fail independently, and messages can be lost or delayed over the network.
Distributed systems are ubiquitous in our daily lives. Examples include large web applications like Google Search, online banking systems, multiplayer games, etc. These systems leverage the power of multiple computers working together to provide a seamless and responsive user experience.
In this post, we’ll explore the benefits and challenges of distributed systems. We will also discuss common approaches and techniques used to address these challenges and ensure the reliable operation of distributed systems.
Distributed System – Definition
Also known as distributed computing and distributed databases, a distributed system is a collection of independent components located on different machines that share messages with each other in order to achieve common goals.
As such, the distributed system will appear as if it is one interface or computer to the end-user. The hope is that together, the system can maximize resources and information while preventing failures, as if one system fails, it won’t affect the availability of the service.
Today, data is more distributed than ever, and modern applications no longer run in isolation. The vast majority of products and applications rely on distributed systems.
Elements of a Distributed System
The most important functions of distributed computing are:
Resource sharing – whether it’s the hardware, software or data that can be shared
Openness – how open is the software designed to be developed and shared with each other
Concurrency – multiple machines can process the same function at the same time
Scalability – how do the computing and processing capabilities multiply when extended to many machines
Fault tolerance – how easy and quickly can failures in parts of the system be detected and recovered
Transparency – how much access does one node have to locate and communicate with other nodes in the system.
Modern distributed systems have evolved to include autonomous processes that might run on the same physical machine, but interact by exchanging messages with each other.
Distributed System Examples
Networks
The earliest example of a distributed system happened in the 1970s when ethernet was invented and LAN (local area networks) were created. For the first time computers would be able to send messages to other systems with a local IP address. Peer-to-peer networks evolved and e-mail and then the Internet as we know it continue to be the biggest, ever growing example of distributed systems. As the internet changed from IPv4 to IPv6, distributed systems have evolved from “LAN” based to “Internet” based.
Telecommunication networks
Telephone and cellular networks are also examples of distributed networks. Telephone networks have been around for over a century and it started as an early example of a peer to peer network. Cellular networks are distributed networks with base stations physically distributed in areas called cells. As telephone networks have evolved to VOIP (voice over IP), it continues to grow in complexity as a distributed network.
Distributed Real-time Systems
Many industries use real-time systems that are distributed locally and globally. Airlines use flight control systems, Uber and Lyft use dispatch systems, manufacturing plants use automation control systems, logistics and e-commerce companies use real-time tracking systems.
Parallel Processing
There used to be a distinction between parallel computing and distributed systems. Parallel computing was focused on how to run software on multiple threads or processors that accessed the same data and memory. Distributed systems meant separate machines with their own processors and memory. With the rise of modern operating systems, processors and cloud services these days, distributed computing also encompasses parallel processing.
Distributed artificial intelligence
Distributed Artificial Intelligence is a way to use large scale computing power and parallel processing to learn and process very large data sets using multi-agents.
Distributed Database Systems
A distributed database is a database that is located over multiple servers and/or physical locations. The data can either be replicated or duplicated across systems.
Most popular applications use a distributed database and need to be aware of the homogenous or heterogenous nature of the distributed database system.
A homogenous distributed database means that each system has the same database management system and data model. They are easier to manage and scale performance by adding new nodes and locations.
Heterogenous distributed databases allow for multiple data models, different database management systems. Gateways are used to translate the data between nodes and usually happen as a result of merging applications and systems.
Distributed System Architecture
Distributed systems must have a network that connects all components (machines, hardware, or software) together so they can transfer messages to communicate with each other.
That network could be connected with an IP address or use cables or even on a circuit board.
The messages passed between machines contain forms of data that the systems want to share like databases, objects, and files.
The way the messages are communicated reliably whether it’s sent, received, acknowledged or how a node retries on failure is an important feature of a distributed system.
Distributed systems were created out of necessity as services and applications needed to scale and new machines needed to be added and managed. In the design of distributed systems, the major trade-off to consider is complexity vs performance.
To understand this, let’s look at types of distributed architectures, pros, and cons.
Types of Distributed System Architectures:
Distributed applications and processes typically use one of four architecture types below:
Client-server:
In the early days, distributed systems architecture consisted of a server as a shared resource like a printer, database, or a web server. It had multiple clients (for example, users behind computers) that decide when to use the shared resource, how to use and display it, change data, and send it back to the server. Code repositories like git is a good example where the intelligence is placed on the developers committing the changes to the code.
Today, distributed systems architecture has evolved with web applications into:
Three-tier: In this architecture, the clients no longer need to be intelligent and can rely on a middle tier to do the processing and decision making. Most of the first web applications fall under this category. The middle tier could be called an agent that receives requests from clients, that could be stateless, processes the data and then forwards it on to the servers.
Multi-tier: Enterprise web services first created n-tier or multi-tier systems architectures. This popularized the application servers that contain the business logic and interacts both with the data tiers and presentation tiers.
Peer-to-peer: There are no centralized or special machine that does the heavy lifting and intelligent work in this architecture. All the decision making and responsibilities are split up amongst the machines involved and each could take on client or server roles. Blockchain is a good example of this.
Pros and Cons of Distributed Systems
Advantages of Distributed Systems:
The ultimate goal of a distributed system is to enable the scalability, performance and high availability of applications.
Major benefits include:
Unlimited Horizontal Scaling – machines can be added whenever required.
Low Latency – having machines that are geographically located closer to users, it will reduce the time it takes to serve users.
Fault Tolerance – if one server or data centre goes down, others could still serve the users of the service.
Advantages of Distributed Systems:
Disadvantages of Distributed Systems:
Every engineering decision has trade offs. Complexity is the biggest disadvantage of distributed systems. There are more machines, more messages, more data being passed between more parties which leads to issues with:
Data Integration & Consistency – being able to synchronize the order of changes to data and states of the application in a distributed system is challenging, especially when there nodes are starting, stopping or failing.
Network and Communication Failure – messages may not be delivered to the right nodes or in the incorrect order which lead to a breakdown in communication and functionality.
Management Overhead – more intelligence, monitoring, logging, load balancing functions need to be added for visibility into the operation and failures of the distributed systems
Why is it Important to Learn Distributed Systems?
Over the course of years, I have realized that no matter what technology I try to learn, I not only learn it better but also enjoy learning it a lot if and only if I have gracefully met the pre-requisites of it.
I have come to this realization that for some niches, the general fundamentals never really go away, and the better one understands the general fundamentals, the better one can perform in the depths of that niche.
Given that distributed systems are at the backbone of:
Cloud computing
Blockchain technology
System design
Software architecture
… etc., I deeply feel we must understand distributed systems properly to understand well everything that hinges upon it.
In software engineering interviews, system design questions often show up — especially in the interviews for the senior positions. Recently, even for rather junior positions, like the mid-level engineering interviews, many technical recruiters touch upon system design. So, understanding distributed systems becomes very vital for us to not only be able to architect better software in our day-to-day job, but also to stand out as better engineers during the software engineering interview process and qualify for higher positions in the corporate engineering hierarchy.
A Distributed Cloud System
Article Outline
In this blog post, we will be going over some basic but very, very, very crucial concepts of distributed systems. The main topics covered in this article include:
Part-0: Understanding Distributed Systems & Their Need
0-A. What is a Distributed System?
0-B. The Need for Distributed Systems
Part-I: Understanding the Design & Algorithms of Centralized Vs. Decentralized Distributed Systems
Centralized Vs. Decentralized DSs
Leader Election in Centralized DSs
Centralized Vs. Decentralized Failure Detection
Centralized Vs. Decentralized Distributed Consensus
Part-II: Understanding the Main Trade-Offs in Distributed Systems
5. Partition Tolerance & The CAP Theorem
6. Trade-Off: Availability Vs. Consistency
7. Quora – Write & Read Quora
8. Trade-Off: Latency Vs. Fault Tolerance
Part-III: Beginner-Friendly Misc. Topics in Distributed Systems
10. Consistent Hash Ring & Virtual Nodes
11. Distributed Time Synchronization Via Logical Clocks
So, without any further ado, lets’s dive in! But before we look at the individual topics, we need to see what a DS really is.
Part-0: Understanding Distributed Systems & Their Need
0-A. What is a Distributed System?
A distributed system (abbreviated as “DS”), putting it as simply as I can, is a computer system where more than one physical computers are working together — by establishing communication with each other over the internet — to run one, single software application.
In other words, the components of a single application are spread over multiple machines, but they are presented to us after getting aggregating from multiple sources, appearing as if they are coming from a single source.
For Example: If we are using Facebook, and we open up our profile page, we see multiple components, like our timeline pictures, timeline videos, friends list, “people you may know” suggestions, past events and check-ins, etc. All of them are coming from different computer services (called microservices), but to us it appears as if a single server is serving us this entire web page — the timeline.
0-B: The Need for Distributed Systems
The population of the world is around 8 billion. The total number of people with access to smartphones and the internet is around 6.4 billion. This equates to only one thing: more and more users using the networking service to get access to application software globally. Hence, we need to make our app’s infrastructure scalable to accomodate all these users. Distributed systems help us achieve that.
There are several other benefits to using DSs too, like:
Partition tolerant aystems
Failure tolerant systems
Highly available systems
Highly durable systems
Low latency, owing to multi-zone replication
… and the list goes on. Suffice it to say that none of the big tech software services — Microsoft, Meta, Amazon, Apple, Alphabet, Netflix, Google, Oracle — can exist if distributed cloud computing had not been invented by us.
Part-I: Understanding the Design & Algorithms of Centralized Vs. Decentralized Distributed Systems
01. Centralized Vs. Decentralized Distributed Systems
DSs can be centralized as well as decentralized. We must understand that distribution has got absolutely nothing to do with decentralization.
Centralization refers to the degree of “control” within the system, whereas distribution refers to having more than one physical machines, which may or may not be present at different geo locations.
So, if one central entity exerts all the control in that system — like the central coordinator of the CNS (Central Nervous System) of human beings, i.e., the brain — then the system will be centralized regardless of its distribution status.
For Example: Imagine we have lots of users in the system, and we want to scale up the DB system. So, we create two copies of our DB, thereby replicating the data. Now, let us say that this system allows all three of the DB instances to be able to serve read requests for all our clients. But if we put a constraint on our system that only the DB instance “A” can accept write requests from the users, and then afterwards it will forward that new data to be written to the replicas “B” and “C” too, then it is a centralized DS, where the central point of the DS is at the DB instance “A”.
DB server A, therefore, is called the “leader” of this DS. Centralized DSs are also many times called “leader-based” DSs. Decentralized DSs are sometimes called “leaderless” DSs.
The Next Question: Now, the question is, how do we elect a leader for our DS in a centralized DS? Because that is the only way we can even start such a centralized DS!
02. Leader Elections in Centralized DSs
So, if there is a centralized DS, then there must be a way to pick the leader of that DS, right? We have algorithms for that. There are two reasons why we need an algorithm to pick the system’s leader:
The leader is prone to failure and crashes — so somebody should fill in for them as the new leader. Afterall, we must have a leader who can decide for all how to proceed forward.
The algorithm would take care of automating the system for choosing a new leader, i.e., in case the leader crashes, the system won’t need any human intervention for choosing a new leader.
The Problem: The problem, however, is that all the replicas (servers) in our DS are many times of the exsct same power and have the exact same resources (RAM, CPU power, etc.), which is literally why they are called “replicas.” So, how do we choose the leader in such a situation?
Remember, the “old” leader has crashed. So, there is literally nobody to tell the “followers” what to do!
The Solution — Bully Algorithm: We can simply assign IDs to all the servers. The server with the biggest ID (the “big” guy — the rascal, the bully!) becomes automatically the leader in case there is no leader or the leader has crashed. In case they crash, the next biggest guy detects that failure and becomes the leader.
Note: There are other algos out there too for choosing a leader of a centralized DS, but for a beginner’s series, that would be a bit out of scope — I think. Therefore, we are not going into the details of those algos.
It is wort mentioning though that if we have a system where one server is disproportionately bigger than the others, then they naturally become the leader, and therefore an ID-based system might not be needed. We can hardcode in all the servers who the strongest replicas are. Naturally, the second strongest player will become the leader.
The Next Question: Now, the next question is, how do we even detect that our leader server has failed / crashed? Because that is the only way we can even start the new leader elections!
03. Centralized Vs. Decentralized Failure Detection in DSs
To start a new leader election in a centralized DS, we need a way at first to detect that the leader has crashed. There are two methods we can go about it. But before we go into the details of it, we need to see what a “heartbeat” is.
Heartbeat: This refers to an “empty” or a “bogus” message sent by a computer service, say a server, to another computer service, say another server, to let it know that it is up and running and that its “heart” hasn’t stopped “beating” yet. The second service will therefore know that the first server — the leader of a centralized DS in our case — hasn’t crashed.
Now, let us see the two very famous methods of detecting failures.
Method # 01 — Centralized Failure Detection: Our leader can continually send heartbeat to a very, very, highly available distributed service, say “Zookeeper”, which is a KV store, and, therefore, we will know that there is no need for a new election, as the leader is alive and healthy.
Method # 02 — Decentralized Failure Detection: The probelm with the first approach is that it relies on Zookeeper, which is a single point of failure (even though it is not, as we mentioned earlier that it is a very highly available, distributed service itself), and can, hence, fail to detect crashes in the system. Furthermore, another issue with it is that it is an external system — using it incurs extra charges for my client, and as an architect I would always like to reduce the costs incurred by my clients. So, what do we do then? Well, let’s say that we make our system do what we, ourselves, should never do in our lives — especially with our colleagues in an office setting: gossip! Basically, we allow all our servers in the DS to send heartbeat to each other, so that they can all communicate or “gossip” with each other and successfully keep the track of all those who have crashed. This is called the “Gossip Algorithm.”
Note: Gossipping this way, verbosely, is an N-squared algo. It will consume a lot of space in the mmeory and a lot of time too in a large distributed system. Therefore, usually some optimization of gosipping is used, where each server talks to only a subset of servers in the entire system.
The Next Question: Now, the next question is, if we have an alive and healthy leader (or of we don’t have a leader at all in our DS), how do we go on to deciding on things to reach a globally-accepted, world-wide consensus so to not stay stuck?
04. Centralized Vs. Decentralized Distributed Consensus
In a distributed system, we have several instances of a single component running on so many different servers. So, if more than one possibilities are present for our system at any point, then how do we reach a consensus on what path to take.
Centralized Consensus: For a centralized DS, the answer is crystal clear: the leader will decide everything.
But things become way more democratic and way more complex in a decendralized, leaderless distributed system where everybody truly acts like everybody else’s peer, like in the global blockchain networks.
Decentralized Consensus: For decentralized DSs, some kind of “voting” has to be performed to decide in which direction the system should move. There are several algos, like:
Raft
Paxos
… which can help us reach consensus in a very democratic and non-authoritative way. For a beginner’s high-level blog series, we will DELIBERATELY avoid going into the details of these algos. Similarly, in the blockchain world, we use algorithms, like:
PoS (Proof-of-Stake) on Ethereum chain
DPoS (Deligated Proof-of-Stake) on Hive chain
PoW (Proof-of-Work) on Bitcoin chain
… etc. to reach consensus about what block to accept and add to the existing chain of the blocks. Once again, we will completely avoid diving into these algorithms, but I have tried to give the reader at least some awareness of the subject by sharing the probelm and the names of the solution.
Part-II: Understanding the Main Trade-Offs in Distributed Systems
05. Partition Tolerance & the CAP Theorem
Partition: In a DS, there can be a partition because of unexpected node (a group of servers, a data center) crashes. We would define a partitioned DS as a DS in which certain part of our DS is not able to communicate with another part of thre same DS.
Imagine we have 3 nodes, A, B and C in our system. Node A and C are indirectly connected with each other via node B. The node B, however, is directly connected with both A and C. Now the only way A and C can communicate with each other and be aware of each other is via node B. If B goes down (crashes), there will be no way for A and C to communicate with each other. Hece, we say, there is a “partition” in the system, as one node (or certain group of nodes on one side of the DS) cannot communicate with the other node (or a group of nodes on the other side).
Partition Tolerance: If we allow our distributed system to operate regardless of having internal partitions, then our DS is a partition tolerant DS. If we completely shut down the DS in the face of partitions, then it means we are not partition tolerant, as we expect the entire system to be perfectly operational internally to be available to the external world.
The CAP Theorem: This is perhaps the MOST IMPORTANT aspect of distributed systems, as it lays a solid foundation for understanding the trade-off between two very important DS elements. This theorem says:
In a distributed, partition-tolerant system, we can either have very, very strong data consistency at the expense of availability or very, very high availability at the expense of data consistency, but not both at the same time.
To understand the theorem better, we need to understand consistency and availability at first.
Consistency: Consistency here refers to data consistency. If we have multiple replicas of our DB present on different physical servers, then the data on all of them must be the exact same, i.e., they all must have the consistent data state.
Availability: Availability means that the system must be up and running. Meaning, if there is some node unavailable in the system, then the traffic should automatically get routed to the other nodes and the system should present itself as available.
The Next Question: The question now is, why the heck is there a trade-off between consistency and availability? What is the CAP theorem really trying to say?
06. Trade-Off: Availability Vs. Consistency
There is a fundamental trade-off between availability and consistency of a distributed data storage system.
Imagine we have a partitioned (some nodes have crashed) and replicated distributed database, like DynamoDB or MongoDB NoSQL instances. Now, if a client sends write request to one of the two nodes, the node has two options:
Option-01 — Favour Consistency: The node can reject the write request of the client, presenting itself as unavailable, because it cannot replicate the data (to ensure consistency) to the other node, as it is not connected with it (there is a partition in the system). Given that it cannot replicate data, accepting the write request means making the system’s data state inconsistent. So, we keep the state consistent, thereby not allowing anybody to write anything. Such a system in both nodes will be unavailable (for writing).
Option-02 — Favour Availability: Accept the write request by presenting itself as available, thereby making this particular replica of the DB as the “latest truth” of the data and leaving the other replica in the “old” state, and leaving the whole DB system in an inconsistent state. That’s why there is a trade-off.
Partitions are Inevitable: We cannot ensure both availability and consistency simultaneously, because that would mean having absolutely 0 probability of partition. But partitions are commonplace; they are inevitable; nothing can have 0 probability. We cannot avoid partitions. So, we always assume that our system will get partitioned sooner or later because of some fault / crash, and we place algorithms in place beforehand to ensure whether our system chooses consistency or availability.
Eventual Consistency Model: Usually, most systems prefer availability to consistency. Slowly, the data gets eventually replicated across all nodes and the entire system becomes consistent over time. This is the eventual consistency model of the distributed systems.
DynamoDB by AWS has an eventual consistency model of around 1 second. AWS S3, on the other hand, features around 15 minutes to reach a globally consistent state.
The Next Question: The question now is how do we go about changing the availability or consistency of the system? Well that we can do through distributed “quora”.
7. Quora – Write & Read Quora
A quorum refers to a subset of servers — or nodes — in our DS. Usually, such a subset has to agree over a certain action for it to be performed successfully.
Write Quorum: The write quorum refers to a subset of nodes that must replicate the incoming write request’s data before sending the confirmation to the client that the data has been written.
Read Quorum: The read quorum refers to a subset of nodes in our DS that must agree upon the consistency and correctness of a value (read from the DB) for the incoming read request of the client before sending the response back to the client.
Effects of Having Large Quora: If we have a large read and write quora, it will produce the following effects on our system:
Strong Consistency: Replicate data internally to multiple nodes and reach a rather consistent state or read the same data from multiple replica nodes and then send back the most consistent and correct value
Increased Latency: Make the client wait for us to replicate data or to confirm the read value from our DB from multiple replicas
Possible Unavailablity: If the write and read quora are 50% of the nodes, and 90% of the nodes have crashed in our system, then we present the system as completely unavailable to writing and reading.
On the other hand, if the read and write quora are very small, we will have the exact opposite effects: eventual consistency, low latency, high availability.
The Next Question: How can the read and write quora help us understand the funndamental trade-off between latency and fault tolerance? Let us see how we can adjust the quora size to adjust the latency and our tolerance for faults.
8. Trade-Off: Latency Vs. Fault Tolerance
The effects of changing the write and read quora present a fundamental problem: we can either have low latency or low fault tolerance but not both.
Imagine we have a replicated distributed database, like DynamoDB or MongoDB NoSQL instances. Now, if a client sends write request to one of the two nodes, the node has two options:
Option-01 — Low Latency, High Fault Tolerance: The node can choose to not replicate the data internally and send back to the client write confirmation. In such a case we have very low latency, but given that our write quora is only one, we run the risk of losing this data, as this replica might crash before replicating data.
Option-02 — Low Fault Tolerance, High Latency: Here in this case the node will choose to replicate the data internally and first and then send back to the client write confirmation. In such a case we have rather high latency, but given that our write quora is not only one server, we do not run the risk of losing this data, as this replica has replicated the data to a couple of servers, and in case of it crashing, we can retrieve the data from some other server.
Hence, it is very important to control our quora and reach an appropriate level of latency with appropriate amout of tolerance for faults in our system, leading to delayed consistency or complete data loss.
Part-III: Beginner-Friendly Misc. Topics in Distributed Systems
9. Consistent Hash Ring & Virtual Nodes
Often times in a DS, we have to store static assets. Those static assets, like Netflix videos, can have a very large size — dozens of GBs. So, rather than storing the entire file in a single node, which may not even have enough space left, we spread the file across multiple nodes by splitting it up into multiple chunks.
Problem — Retrieving Chunks of Data: Well, if we have a file spread across servers in chunks, how do we find it the next time? The answer is via hashing. We hash each chunk, and the answer of the hash (when kept within the range of total nodes) gives us the server where the file is.
Problem — Addition of New Nodes: If we add a new node, all the previous data will have to be re-mapped to the new set of nodes. So, the system won’t scale. The same problem will be there if we remove a node from the system. Therefore, we use a consistent hash ring, where all the nodes are presented on a circular ring. The way a hash ring works is that the values on the ring are marked between 0 and 99. Each node takes up responsibility of those numbers which are after its predecesor node. So, the following two problems get solved:
Node Crashing: If this node crashes, we will only remap their keys to the next server on the ring, moving in a clockwise fashion
Addition of Node: If we add a new node, only a subset of the successor’s keys will be remapped to this new node.
So, such a ring accomodates addition and removal of nodes very well, thereby ensuring system scalability.
Problem — Unfair Load Balancing After Crashes: If there are only a few nodes in the system, or if one node is controlling a lot of keys in the server, upon crashing, it leaves such a big baggage for the next node, its successor on the ring. So, we use virtual nodes mapped to physical nodes in the system. So, if one of the nodes crashes, its keys are distributed to multiple virtual nodes present on the ring until its physical successor node. Each virtual node is connected to a specific real node, thereby distributing keys somewhat evenly.
11. Distributed Time Synchronization Via Logical Clocks
A wall clock is really a clock. Well, it tells us time. It acts like a clock. A clock solves a fundamental problem for us: it tells us time in an ordered way. For example, if 4:01 PM always comes before 4:02 PM, then looking at the clock we can umderstand how to order our life events in the past, present and the future based on that time. For example, if Jason takes shower at 3:00 PM and takes his lunch at 3:30 PM, we can easily order both these events in a topological order, where showering comes before eating lunch in the graph, always.
In a DS, we have to order events. It is EXTREMELY important for us to get the order of events right to remove any data inconsistencies from the system. The only problem is that there is no absolute time in distributed systems that every node can rely on. Every node operates in its own geo region, so their clocks are not really the single truth we can rely on. Event A happens in the UK at a specific relative time and the same event happens at the same instant but at a different relative time in the US.
So, how do we synchronize our DS if our clocks are not synchronized? The answer is using clocks which help us order events based on what happened first, no matter at what relative time it took place. Time is represented as versions of the document. 99 always comes before 100. So, version 99 of the doc occurred before 100.
For example, MongoDB and DynamoDB both use logical clocks to remove data inconsistencies. MongoDB allows for the submission of conflicting versions by multiple transactions, then create graphs internally to resolve conflict (MVCC). DynamoDB does the same using vector clocks.
Time is represented as integers, not absolute physical time. If an int is smaller, that event happened before the other one — that’s the rule in logival clocks.
Until next time!