Distributed can of worms

How a silly classifieds ad morphed into distributed system spaning multiple countries, hundreds of edge devices and on prem offline deployments.

As a kid I used to chuckle at my old man carefully going through pages and pages of Classifieds every month, carefully underlining, then calling and finally going to pickup whatever trash treasure he found. Little did I expect to become even worse than him - while he would collect tractor parts and machines, I collect all he did, plus electronics, office stuff, books and my favorite - service ads.

Anyway, as it goes, I was doomscrolling ads late in the night when I spotted an interesting ad - guy wanted raspberry pi based music player, where he can use web interface to play his music library. My GF at the time was just starting to learn python so I figured this could be a cute practical learning project for her - shot an email to the sketchy address in the ad and forgot about it. 10 days later I get an email back with a phone number for the guy who is actually ordering “the thing”. March, 2020

1 Deception

.

2 Repentance

.

3 Gaslighting

.

4 Silence

.


5 Words of advice

When you get over hundreds to potentially thousands (or magnitudes more) in scale, you cannot count on being able to log in and fix things. Even more so with edge devices. To innocent programmer this might come as surprise, but when dealing with customers, especially when hardware is involved, a big, and potentially crucial part of your operations becomes customer support - you would’t believe the situations that can happen. From random people unplugging something somewhere, rodents, competitor sabotage, to just plain old non reproducible once-in-a-blue-moon showstopping bugs. So, you have to automate beyond directly touching things. I suggest focus on having;

  1. redundancy, modularity and compartmentalization
  2. things that fix themselves; and when they can’t they alert you (preventive maintenance)
  3. document everything and then more, ideally in form of self service

One of common defects in the field was cheap 12v wall plug PSUs burning out. What helped:

Second common defect was SD cards failing due too many writes. This again can be mitigated if you are aware it’s happening, and know the reason why. So put in telemetry, analyze the data and replace that SD card preventively.

Basically, define every case for which human would need to touch something, document it and automate it away. Script it so things pull their own configs, restart services, check their own health, etc. Create self service apps to help users troubleshoot things before calling in. Human involvement should really be only when the device itself is completely dead, and when you start scaling just keeping enough replacements flowing will likely be a full time job.

If you are assembling devices yourself, DO NOT mess up with your supply chain. I have seen my clients sweat bullets because they sold devices they haven’t even ordered components for, let alone assembled, and supplier had no SOC in stock. Supporting a new SOC was a lot of burned time and effort, new device didn’t go through enough testing, was unstable, and eventually got phased out with old devices. So the client incurred multiple costs due to bad planning (and ignoring my pleas):

  1. paying for new hardware R&D
  2. paying for new software for that hardware
  3. paying cost of no time for testing by having slow rollout, losing part of the deal and having to ship devices back and forth
  4. paying cost of replacing that whole deployment with the old device version as it became available again
  5. paying to REMOVE the new software and setup

Gosh.