ANONYMOUS wrote:
Hello!
This has been something that's fascinated my imagination for a while, and I'd love to know how feasible it is to implement. Not too important, obviously, but feel free to chip in! (hehe)
We know that small rules can create complex emergent patterns in nature. My question: Would we be able to create many small chips (call them little-guys or LGs) to solve complex problems by computing relatively mundane tasks quite quickly, storing minimal state or simply reacting to an input?
Take, for example, sorting an unordered list. Our current Von-Neuman (big-guy) solution would require the (forgetful) CPU to keep asking for data from memory and storing new data in it. We analogise it as sorting a sequence of face-down cards. Knowledge is hidden.
The little-guy solution would have chips that have just enough memory to store one element of the array and its index, as well as a method to interact with other little-guys and a general "desire" to fit where they need to. This can be analogised as a bunch of people trying to sort themselves in order or birth year (something they all know). Everyone talks with each other and sort themselves.
Obviously you need to make sure the little-guy algorithms are provably correct, but when you do, you can initialise every node in one go, run the arena, watch the little guys scramble, then extract your tasty ordered list.
This way, while the time complexity of the problem being solved might stay the same, we will have parallelized the problem so much that the time reduction would be noticeable for a large number of little-guys. There is no CPU bureaucracy to hold the little-guys back.
The space complexity would skyrocket, obviously, as would the cost of manufacturing, but for a problem as common as sorting, if you're a large enterprise, why would you not use something like this for optimising your performance?
Thanks
You will be unsurprised to learn that this is an extant field of study.
I am not quite following the "little guy" architecture you are describing, but various models of computation have been studied differentiated by the capabilities and interconnection of compute elements.
What probably resembles your description the most is various custom FPGA (Field-Programmable Gate Array) structures. FPGAs use an array of (field-)programmable logic gates to be able to "rewire" themselves to make custom-purpose hardware. In this way you can make a system that has exactly the minimum amount of hardware it needs to store the state you need (the elements) and communicate the actions you take (compare and swap, or similar).
The next most relevant is probably GPU architectures. GPUs are what are called Single Instruction Multiple Data (SIMD) processors, where you have many more but simpler cores ("little guys") compared to a CPU ("big guy"). In particular, you run the same program on multiple cores at the same time, but each core can store and work with different data, applying the same algorithm to different values.
More general parallel processing models exist.
For (comparison-based) sorting in particular, we typically abstract parallel processing in general as sorting networks where we have a number of "wires" that propagate values forward through space/time, and then at various points we can connect pairs of wires together to do a "compare and swap" (CAS) operation, where the elements on those two wires are compared and swapped if they are inverted. Parallel processing generally allows us to do more compare and swap operations at once (but any wire can only safely be involved in one operation at a time).
To come up with parallel sorting algorithms, the analysis is not that different from single-threaded sorting algorithms, just if we do our compare and swap operations in the right order we can potentially do many at once.
The first algorithm I will suggest is based on the same idea as bubble sort of just finding any adjacent inverted pairs and removing them. Since we can do this in parallel, we can CAS every odd-even index pair at once, then every even-odd pair, and repeat this process until all inversions have been removed. This is known as odd-even sort or brick sort (because it stacks like bricks in a wall).
We can also attempt to improve merge sort by doing the merges in parallel. The lowest layer of merges (merging pairs of individual elements) takes a single CAS. Merging longer lists is more involved, but we can basically do it by comparing pairs of the same index from each list all at once, and then correcting for anything that might have missed. You can see a full example at Batcher odd-even merge sort.
Other algorithms such as bitonic sort exist, including some more asymptotically efficient than anything mentioned here, but in practice we typically implement either Batcher or bitonic when sorting on GPUs.
Scaling our number of cores to the size of the list quickly becomes infeasible, however, so you need to then have a strategy for sorting longer lists. Given that in many situations the list is going to take O(N)
time to copy into memory anyway, and how fast modern CPUs are, these parallel sorting techniques are very much not universally better.
Hopefully that has given you plenty to be exploring. I will also mention the game TIS-100, a programming puzzle game based on solving weird parallel processing puzzles not dissimilar to your "little guys".
Cheers,
Gozz