Here's a summary of what you learned from these experiments:
For a large set of workers, managers require a lot of CPUs. CPUs will spike whenever the Raft recovery process kicks in.
If the leading manager dies, it's better to stop Docker on that node and wait until the cluster becomes stable again with n-1 managers.
Keep snapshot reservation as small as possible. The default Docker Swarm configuration will do. Persisting Raft snapshots uses extra CPU.
Thousands of nodes require a huge set of resources to manage, both in terms of CPU and network bandwidth. Try to keep services and the managers' topology geographically compact.
Hundreds of thousand tasks require high memory nodes.
Now, a maximum of 500-1000 nodes are recommended for stable production setups.
If managers seem to be stuck, wait; they'll recover eventually.
The
advertise-addr
parameter is mandatory for Routing Mesh to work.Put your compute nodes as close to your data nodes as possible. The overlay...