Sunday, 7 October 2012

Software – the growing need for a zero failure design.



A recent ‘glitch’ in the CityLink’s computer system had caused Melbourne road users serious frustration. Burnley and Domain tunnels were both shut down, and CityLink were unable to communicate with its incident detection and safety systems. The suspect cause has been narrowed down to network connectivity, with Engineers working diligently since 5:30AM to find the root cause.
 
Most software is tested heavily before being released into production, often known as a ‘QA’ (Quality Assurance) process. Software Developers utilise Test Engineers who try to break the software to emulate real life scenarios such as over loading, typical user behaviour, power user behaviour and validation. Software Test Engineers also compensate for the risk of external factors which can also cause failure: For example, the outcome of a power source interruption or a network link outage.

In many cases, software has inbuilt capabilities and protection mechanisms to safeguard the data. If an ATM (Automatic Teller Machine) loses connection with the primary main frame, it will not alert the users of the outage or error, it will simply continue to dispense money and queue the transactions awaiting re-connectivity to the bank’s database. Customers would not be able to print receipts and online banking logs would not reflect the transactions until a later date, and banks have assessed this as an acceptable amount of risk at the cost of not inconveniencing the customer, and that’s great for me and you.

On the other hand, there are aspects of society in which there must be a 0% chance of failure. Life support systems and airport aeroplane towers are an example of sophisticated technology driven by defect free software design. A failure in either of these will lead to significant loss of human life, and at the end of the day that is the ultimate prices no one is willing to pay. Organisations invest heavily into ‘defect free software’, and the level of Engineering involved with developing such software takes approximately double the time.

In order for software developers to ensure that there is 0% chance of failure, each state space (or scenario) the software may execute under is tested for a predictable outcome. By minimizing the number of unpredictable outcomes to 0 it almost mitigates any chance of failure, and by ‘almost’ I mean external factors still need to be considered (backup power, natural disaster recovery etc.). 

Software Developers and Test Engineers are often under pressure and strict timelines to complete modules and this leads to more unpredictable outcomes (glitches), and I strongly believe the outcome of today’s gridlock and traffic chaos in Melbourne should be a testament for organisations to invest in more defect free software methodology.

No comments:

Post a Comment