Why can't software run on distributed systems?

Question

I remember when everything was single threaded, both the OS and the applications. I remember using MS-DOS and PCs with single core CPUs.Now, both operating systems and applications are happily using how many cores and even CPUs you let them, provided they are on the same physical system.I wonder why we can't have operating systems and software that can run on the same time on more than a single machine. Why we can't run software on a pool of machines and add or remove physical machines as we see fit?Now when we need to do some work in parallel, we span more and more instances of the software on different machines and we load balance the work between them.I wonder why we can't have one instance of the software span across many machines, as we see fit?Yes, of course, network latency is a big deal but I think there are some types of high throughput interconnect and data transport that might help like Infiniband or HyperTransport. And if existing interconnect tech isn't enough, I believe they can be improved.With the emerging of computer clusters like Beowulf, helped by software stacks like Kerrighed and MOSIX, I thought we will make a step in that direction, but those innitiatives fade away.

pestatije · Accepted Answer

But software _does_ run on distributed systems...you might want to see:https://en.wikipedia.org/wiki/Distributed_computinghttps://en.wikipedia.org/wiki/Inter-process_communication

pjkundert · Answer

The problems arise because we envision the programs’ State as a large, consistent data-centric model.
If you must maintain a single, globally consistent state, you must implement global blocking primitives, which are bound by latency and correctness requirements on all parties in the transaction.
The lower bound for cost on these types of global state engines in a Byzantine environment seems to have been found by eg. Hashgraph: a few tens of thousands of distributed transactions per second in a globally distributed environment. Falls apart for multi-planetary (high latency) participants…
Reframing the global State as “agent-centric” — only participating agents must establish consistency (instead of data-centric - all participants must agree upon one globally consistent state) removes this limit; it appears that we can scale overall transaction rate linearly with node count, roughly.
This is the big breakthrough coming with eg. Holochain based systems. It takes some time to reframe algorithms in an agent-centric form…

perrygeo · Answer

Obviously software does run on distributed systems. It just has to use different hardware and software instructions. Many have tried to abstract away the difference between local memory and distributed memory but there's very few success stories, if any.
Check out "The fallacies of distributed computing", the history of CORBA, and "A note on distributed computing". The conclusion from the later:
> objects that interact in a distributed system need to be dealt with in ways that are intrinsically different from objects that interact in a single address space
The memory bus is not the network. They differ in fundamental ways (concurrency, semantics, atomicity, ordering, latency, control, responsibility, determinism, observability) that require entirely different programming approaches.

toast0 · Answer

It sounds like you're interested in 'Single System Image' distributed systems. And hopefully that keyword helps you find more of them to look into.
It's a lot harder to put together something like that than a more conventional distributed system based on multiple independent servers. I don't think there's a large benefit really, either, as there's plenty of ways to operate similarly across independent servers.
The biggest problem with single system image systems is around managing failure domains. Conventional systems don't have portions of the memory and CPU and X go offline. And when those things do go offline, how do you know if it'll come back with the same state (momentary network disruption) or come back with a loss of state (unscheduled reboot) or not at all (something broke). Or, actually worse, how do you handle when the networking gets degraded and suddenly you have half the throughput or twice the latency or both? These are real issues that need to be solved in any type of distributed system, but it's got to be a lot harder when your program is not aware of the node boundaries it's crossing; it's hard already in a conventional distributed system where nodes are aware of the network, but not of the network boundaries (ex: nodeA wants to talk to nodeB or nodeC, but doesn't know that connectivity to nodeB is congested, so it just divides load between B and C, contributing to the congestion)

brudgers · Answer

Erlang was built for doing that.It has been field tested at industrial scale and its ecosystem is mature and extensive.Good luck.

neximo64 · Answer

Isn't that like lambdas, service weaver, temporal..?

moremetadata · Answer

Alot of software is written for single instance data, thats the way IT has evolved, so whilst clusters are in their infancy, the market perhaps hasnt evolved enough to recognise their benefits, and the laws would need to be upgraded to be consistent across borders.
But you do already see instances like this, like window machines distributing updates to other local window machines. A great way for MS to offload its bandwidth costs onto users, whilst saving itself a packet or two.
So maybe the market has recognised the virtues of distributed systems, but they are not how you envisage them to be?

Why can't software run on distributed systems?

But software _does_ run on distributed systems...you might want to see:
https://en.wikipedia.org/wiki/Distributed_computing
https://en.wikipedia.org/wiki/Inter-process_communication

Erlang was built for doing that.
It has been field tested at industrial scale and its ecosystem is mature and extensive.
Good luck.

Isn't that like lambdas, service weaver, temporal..?