Distributed can of worms

1 Deception
2 Repentance
3 Gaslighting
4 Silence
5 Words of advice

How a silly classifieds ad morphed into distributed system spaning multiple countries, hundreds of edge devices and on prem offline deployments.

As a kid I used to chuckle at my old man carefully going through pages and pages of Classifieds every month, carefully underlining, then calling and finally going to pickup whatever ~~trash~~ treasure he found. Little did I expect to become even worse than him - while he would collect tractor parts and machines, I collect all he did, plus electronics, office stuff, books and my favorite - service ads.

Anyway, as it goes, I was doomscrolling ads late in the night when I spotted an interesting ad - guy wanted raspberry pi based music player, where he can use web interface to play his music library. My GF at the time was just starting to learn python so I figured this could be a cute practical learning project for her - shot an email to the sketchy address in the ad and forgot about it. 10 days later I get an email back with a phone number for the guy who is actually ordering “the thing”. March, 2020

1 Deception

2 Repentance

3 Gaslighting

4 Silence

5 Words of advice

Be careful with pricing and expenses.
If you are primarily hardware shop - hire good Dev(Sec)Ops.
If you are primarily software shop - hire good consultant for EE part.
If you are in neither hardware nor software industry, get someone who does both, don’t patch up individual consultancies who lack cross sectional knowledge.

When you get over hundreds to potentially thousands (or magnitudes more) in scale, you cannot count on being able to log in and fix things. Even more so with edge devices. To innocent programmer this might come as surprise, but when dealing with customers, especially when hardware is involved, a big, and potentially crucial part of your operations becomes customer support - you would’t believe the situations that can happen. From random people unplugging something somewhere, rodents, competitor sabotage, to just plain old non reproducible once-in-a-blue-moon showstopping bugs. So, you have to automate beyond directly touching things. I suggest focus on having;

redundancy, modularity and compartmentalization
things that fix themselves; and when they can’t they alert you (preventive maintenance)
document everything and then more, ideally in form of self service

One of common defects in the field was cheap 12v wall plug PSUs burning out. What helped:

packing additional replacement one when shipping out - users will lose it, but they can’t say you didn’t account for the problem
using PSU that has indicator LED (standby/used) - now user can debug it
using connectors!!! We saw competitors soldering the wires directly to their SOCs, so when PSU goes whole device is FUBAR and you need to 1. collect it (or have the customer dispose it, losing reusable components) and 2. replace it ASAP

Second common defect was SD cards failing due too many writes. This again can be mitigated if you are aware it’s happening, and know the reason why. So put in telemetry, analyze the data and replace that SD card preventively.

Basically, define every case for which human would need to touch something, document it and automate it away. Script it so things pull their own configs, restart services, check their own health, etc. Create self service apps to help users troubleshoot things before calling in. Human involvement should really be only when the device itself is completely dead, and when you start scaling just keeping enough replacements flowing will likely be a full time job.

If you are assembling devices yourself, DO NOT mess up with your supply chain. I have seen my clients sweat bullets because they sold devices they haven’t even ordered components for, let alone assembled, and supplier had no SOC in stock. Supporting a new SOC was a lot of burned time and effort, new device didn’t go through enough testing, was unstable, and eventually got phased out with old devices. So the client incurred multiple costs due to bad planning (and ignoring my pleas):

paying for new hardware R&D
paying for new software for that hardware
paying cost of no time for testing by having slow rollout, losing part of the deal and having to ship devices back and forth
paying cost of replacing that whole deployment with the old device version as it became available again
paying to REMOVE the new software and setup

Gosh.