Fail-safes Applied to Code
All engineers are trained in this, it should be shouted about more in the tech world
Photo credit: Photo by Sarah Kilian on Unsplash
10 min read
As some may know, my background and degree is in engineering and by no means did I run away from the profession. Engineering is a wonderful discipline and culture that centers on designing systems and things that are safe, effective and efficient. We see the world rated against these criterion; it's all about how the solution performs, regardless of marketing material.
We see safety as "well it's got to be safe doesn't it", no questions, no ifs, no buts; we won't and shouldn't put our name to anything that isn't.
"Safe" is tricky objective, as a design should be safe by virtue of the fact it works. You wouldn't design something intentionally unsafe, well, unless you are designing weapons etc. Even then, the safety of the owner is still an objective!
Most lapses of safety in a system happen when the unexpected happens. How do you make the unexpected safe? You guessed it, encase everything in a 2 meter thick layer of concrete and make sure everyone is at least 5 miles away.
No, fail-safes, of course.
Formula 1 is a shining example of many engineering concepts, including fail-safes. A good example is wheel tethers, these are "ropes" (very fancy ropes) that run inside the wishbones (the car's suspension), connecting the wheel to the body of the car. When the unexpected happens and the suspension fails, the wheel fails safe in that it doesn't fly off injuring a martial or a member of the crowd. In the case of the unexpected, a great deal of the risk is removed.
Engineers have been working on these systems for a long time before Formula 1, take automatic vacuum brakes as an example, introduced in the mid-1860s on trains. Very simply, the brakes were released by increasing the amount of vacuum in a tube. There were a few technical reasons why these were beneficial given the technology at the time, but they also provided a handy fail-safe mechanism.
Imagine that the brakes were operated by a steel cable, much simpler in concept and would probably work fine. However, if the steel cable snaps ("fails"), the brakes will not be applied and you have a very unsafe system. However, if a your vacuum tube snaps, or even springs a leak, the trains brakes will be applied and the train will come to a stop.
Fail-safes are there so that a portion of an engineering system must be correct before the system will operate. Or if the criterion fails (no constant vacuum tube, or no wheels on the car!) something initiates to make the system safe, or safer.
Building on the last example of the automatic vacuum brake, think of a scenario in a rail yard, you just want to move a train slowly from on workshop to the next, maybe the engine is disassembled. Que much puffing and ranting; the train won't move because the brakes aren't connected up.
This is where many fail-safes are snubbed, designed out or bypasses installed. I'm going to be very frank; you learn of countless examples during your engineering degree where this has lead to severe and fatal "accidents".
The common theme is that in normal times, fail-safes are more expensive to implement and maintain, both in terms of time and money. It's only when the unexpected happens and disaster is averted that everyone goes "good thing we had X". This is unfortunately why it's taken many disasters to change the status quo across many industries.
Engineering and coding I don't believe are much different. However engineering has ~250 years of experience under it's belt and many people have died of a result of bad practice, coding hasn't been around as long and has been seen as harmless, maybe engineering started this way.
The mentality of many / most coders and their organizations is "if it works, lets move on". There is no legislation in place to prevent this mentality, no real standards to adhere to and not even much accreditation of expertise. The coding world is maybe somewhere around the 1920s in the engineering timeline; the bright dawn of the new future albeit still a lot going really wrong.
It's interesting that engineers have created fail-safe systems and checks for software systems in many applications. Car steering could be done by electronic motors for various benefits. This isn't done because if everything else fails, you want the driver to be able to steer the car; electronic assistance with mechanical fallback is much safer. It's well documented how difficult driverless cars are from a safety perspective.
When creating pure software however, like Mashoom, we have to work out and create these systems ourself. We are relied upon for critical applications that are outside the safety focussed arms of traditional engineers.
Bugs are closely analogous to an unexpected scenarios, where fail-safes would kick-in in the engineering world, so this is where we must turn our attention. So really we could re-ask this question as; how do we ensure a bug fails safe.
Code is very predictable, so we have a much better starting point than a lot of engineering systems. However, there are two broad ways an "unexpected" thing happens in software; unexpected user input and the addition of new code.
The first is simply a case of checking all user input is validated. I'm not going to dwell on this as it's well documented and known about, still not done by many applications but still.
To start to tackle the later point you must create single points of functionality, this comes from well written and / or refactored code. Think of the F1 car example, you must have a single method of attaching the wheels to the car. No good putting a fail safe on 4 wheels if you suddenly discover you have a 5th you didn't know about.
Then it's making sure that these single points of functionality are always used, they can't be used incorrectly and that if anything is "unusual", they fail safe. In this way, a coder going quickly, late at night with a deadline just round the corner can't deploy anything that would cause a catastrophic failure.
Something not working and the clients / customers unhappy is still preferable to data deletion or a security issue.
This is a big mentality shift for a lot of coders. Many don't write their own errors, and suppress or ignore any that do get generated to avoid fixing them. PHP even gives you a syntax for suppressing errors. Titled hopefully as "Error Control" it's actually known as the "STFU symbol" colloquially and is a great example of a tool that allows very bad things to happen.
This feeds into another bugbear of mine; errors are there for a reason, read them, fix them... yes, even "notices"! Also, don't just find the quickest way to stop them happening, you can often catch a bug before it's a problem by asking "why didn't I think of this".
Another good general rule is checking inputs of a coding function or method. Whenever you start a function, just write errors that define what you expect the inputs to be, even if you're not sure.
"Should a database ID be less than 0", don't bother with "maybe" or "this function would still work"; just put in an error. You can always go "I was wrong, that is possible and OK", that is a lovely bug to fix. Apart from that, it explains to the next reader of your code what you were anticipating this function to be used for in a way that can't be ignored.
When I started writing Mashoom I was keeping up a university degree whilst providing data management for a race engineering team at my university. Bugs were a massive pain, data loss even worse, and nothing quite beats your first "customers" being your course mates and friends relying on it for university work.
Very early on I started thinking about all of this and thankfully, got lucky by building what have turned out to be robust fail-safes into the foundations.
Every error is fatal. O yes, put a foot wrong on Mashoom as you are developing and it stops everything. Not just display, stops; it has to be fixed. Look at the only real difference between the React framework made public and the one used at Facebook; the error messages popup in an annoying box rather than quietly logging in the console. This stops a developer ignoring something that needs fixing. As a small token for the extra effort, it also forces you into good habits.
Database calls are controlled centrally. This brings everything mentioned above together. Firstly, we only have one method to connect to the database, we could do a find-all for any functions attempting a database connection to cross check this if we had a huge team. Input validation can then be applied to protect against and handle stupid values before they hit the database. Every query is prepared to abolish SQL injection attacks and every output is HTML encoded by default.
"By default" is important to touch on before explaining the final party trick. A default that is overly safe, that subsequently can be turned off is a great position when coding. If your going quickly, you are coding as safe as possible, then you can audit any changes to that known and safe stance.
Data separation is performed automatically. Data separation is making sure that one account / user's data can't be view by someone else, this boils down into having to add something like "WHERE UserID = X" to every database query. In a hurry, would you remember to do this every time? No. How bad is it if you forget? Really. Very really.
We have spent a painstaking amount of time creating a coding process where these clauses are separately managed, so you don't even have to think about it when writing most database queries. You can of course write individual "security degrades" to override these defaults, but they are well controlled and auditable.
Finally, I would like to give a shout out to all my customers and clients who have seen the error message "Minion security trigger"; my apologies. A complex explanation short, this error is triggered when a database call is written which conflicts with the standard setup. It's one of the most common errors because Mashoom's account separation etc is very complex and of course, it fails rather than in any way guesses.
However, whilst my customers are understandably frustrated when they see this error, I know I'm hitting the buffers of a safety device, which is a much safer and preferable place to be than less errors of a much more severe nature.
Bugs are never good, but I take a confidence and smugness in knowing that 99.9% of the time Mashoom either errors or fails-safe via methods I've written, rather than waiting for a user to say "I don't recognize this data" or even "where has my data gone".