Does it come back online properly after a power failure? If you don’t explicitly design for it, then it probably won’t.
In the mid-1990s, I worked with both HP and DEC workstations. There was an interesting difference when they booted up, all the hard drives in the HPs spun up immediately, while in the DEC computer, you heard each hard drive boot individually with a whoosh and a click. HP was built by engineers who assumed power would always be available. DEC was built by people who knew power could go out, and that if you had a herd of these workstations, you didn’t want them running at full power when the power came back on; they might just trip the circuit breaker and crash again.
When I worked on a new customer email system at Ziggo, we kept in mind that if the system went down during dinner and needed to be rebooted, it had to be up and running as soon as possible, and certainly that evening.
We politely declined when the storage vendor offered to speed up their system with RAM caches. Those caches are useless just after startup because at that point they will still be empty. If the mail system relied on them for speed, you couldn’t start it that evening without immediately becoming overloaded. Startup would only possible at a quiet time, like the following night…
A fellow architect spent several weeks camped out in the vendor’s lab to determine which configuration of hard drives and SSDs would work well enough for our application. Since this was in 2012 and we were going to buy many petabytes of storage, it was a significant investment that you wanted to get right.
Whether it’s power consumption, storage, bandwidth, network usage, computing power, or any other shared resource, what happens when it fails and comes back? Every generation of engineers and every technology seems to have to learn this lesson anew.
Today’s cloud services are fragile. Have you already prepared your service for this scenario, is your cloud infrastructure provider well-designed? Consider the recent outages at AWS and Cloudflare, or (mobile) internet providers, this will happen.
If your cloud is good, are your clients polite enough to wait a random amount of time before connecting to your service? Or do they all connect at the same time, and connect again 30 seconds later, etc., so you never get a proper restart?
You can not build connected systems piece by piece; you have to oversee the entire system, end-to-end. That overview doesn’t come naturally.
But people are often busy enough with their own little piece. If you ca not see the entire system end-to-end, you are guaranteed to fail. Your system won’t be available elegantly and quickly again, perhaps not elegantly, perhaps not at all.
You always have to think about the bigger picture. If you don’t have someone who can fit that into their head, someone will find it! Suggestion, DM me.😉