Distributed Systems
"A distributed system is one that stops you from getting any work done when a machine you have never even heard of crashes." Lampert.
Tanenbaum and Van Renesse gave the following definition "A distributed system is one that looks to its users like an ordinary centralized operating system but runs on multiple independent CPUs. The key concept here is transparency, that is, the use of multiple processors should be invisible (transparent) to the user."
But I believe the definition is not a sufficient one. A distributed operating system should not have any single points of failure no single part failing should bring the whole system down.
This is not an easy condition to fulfil, just for starters, it means a distributed system should have many power supplies; if it had only one and it failed, the whole system would stop.
It is dangerous to attempt an exact definition of a distributed system. However we can give a list of symptoms of a distributed system.
- Multiple processing elements Each element can run independently and therefore possess atleast one CPU and memory.
- Intercommunication hardware Intercommunication allows communication between the processing elements. This in turn allows processes running in parallel to communicate and synchronize.
- Processing elements should fail independently A distributed system cannot be fault tolerant if all nodes fail simultaneously.
- Shared state This is necessary to recover from failures.
To see more clearly what constitutes a distributed system we shall look at some examples of systems.
- A multi-processor computer with shared memory It has multiple processing elements, and can interact via shared memory, interprocessor interrupt mechanisms amd a memory bus. Thus it has several of the characteristics of a distributed system. However what disqualifies multiprocessors is that there independent failure, that is, when one processor crashes, the whole system stops working.
- Diskless workstations with NFS file servers Each workstation and file server has a processor and memory, and a network interconnects the machines. Now, when a workstation crashes, the other workstations and file servers continue to work. When a file server crashes, its client workstations do not crash (although client processes may hang until the server comes back up). But there is no shared state. Therefore when server crashes the information is inaccessible until the server comes back up; and when a client crashes, all its internal state is lost. This network is therefore not a distributed system.
Why build distributed systems?
- People are distributed, information is distributed Distributed systems often eveole from networks of workstations. The owners of the workstations connect their systems together to share information and resources.
- Performance versus cost Computers are getting cheaper and cheaper. Today a processor of sufficient power to serve most needs of a single person costs less than one tenth of a processor powerful enough to server ten. The cost of communication depends on the bandwidth of the communication channel and the length of the channel. Bandwidth increase is limited by the cables and interfaces used. Also Wide Area Networks have to be used for decades since exchanging them is extremely expensive. Communication costs are therefore going down much less rapidly than computer costs. Users want instant visual (or audio) feedback from the user interface, and the delay/latency caused by distances greater than a few kilometers os often too high. These reasons make distributed systems not only economic but also necessary.
- Expandability Storage and processing capacity of a distributed system can be increased by adding file servers\processors one at a time.
- Scalability The capacity of any centralized components of a system imposes a limit for the systems maximum size.
- Availability Since distributed cyctems replicate data and have a built in redundancy in all resources that can fail, distributed systems have the potential to be available even when arbitrary failures occur.
- Reliability A distributed system does what it claims to do correctly, even when failures occur.
The complexity of ditributed systems
The main reason that the design of distributed systems is hard is that the enormous complexity of these systems is still beyond our understanding.
The cause of this extraordinary complexity can be understood by comparing distributed systems to railway systems, another distributed system that most of us are familiar with. Just as the railway system has become safe only at the cost of many accidents, so only distributed systems can become reliable, fault tolerant at the cost of many system crashes, discovering design bugs and learning about system behaviour by trial error.
An intercommunication of well understood components can generate new problems not apparent in the components. These problems then produce complexity beyond our limits of methodical understanding. Example : Consider the well known fairness of a token ring network. In this network a station may obtain the token to send one packet at a time; then the other stations get the opportunity to send a packet if they have one. Now it appears that a station A sending a large multi-packet message to server S cannot lock-out B from sending a message simultaneously. However the following scenario was observed
- A" sends the first packet of a message to S, which is received successfully.
- B sends the first packet of a message to S, behind As packet, but it is ignored as it is too close behind the previous packet.
- A sends the second packet immediately behind Bs first packet. Again As packet is received since S has recovered from As first packet by now.
- B sends the second packet or retransmits the first. In either case, S ignores the packet again since it immediately followed As second packet. The effect is that B is totally locked out from sending because the intended fairness of the token ring actually guarantees that A and B send alternate packets; As packets are always received while Bs are never received.
The above example nicely illustrates the sort of problems to which a combination of well-understood components can lead. In some cases, formal methods can be used to predict the outcome of interconnecting two systems, but these are of limited help especially when we have not understood the system/intercommunications completely. Complexity limits what we can build.