I first read an article about this on GigaOM, and this really got me thinking in ways that companies go about down time. All lines of businesses are different and some can accept downtime others cannot. If you think of the likes of services we depend on and use daily such as Google, Amazon, Ebay etc we would not accept downtime. This is a rapid change of thinking as a few years ago we might have accepted this but not now, not in our online 24/7/365 world, and as I always allude to this is due user experience and choice that we as end users have.
An Example of this is Google Mail which went down between 8:45 AM PT and 9:13 AM PT when they were upgrading some of their load balancing software which turned out to be flawed so they had to revert back to a previous version Now why would Google deploy code at that time? The answer is simple when you think of it, in the online world we now live in there is no acceptable window of downtime so companies like this are constantly rolling out code upgrades to give more benefits to the end users and the business.
So a particular section of this article intrigued me which was how Netflix works, the full article can be found here. Netflix employs a service they created called “Chaos Monkeys” and this is an open invitation to break systems and cause downtime, because their philosophy is “The best defense against major unexpected failures is to fail often”. So they learn from failures and by doing this systems become more resilient.
Netflix were quoted saying “Systems that contain and absorb many small failures without breaking and get more resilient over time are “anti fragile” as described in [Nassim] Taleb’s latest book,” explains Adrian Cockcroft of Netflix. “We run chaos monkeys and actively try to break our systems regularly so we find the weak spots. Most of the time our end users don’t notice the breakage we induce, and as a result we tend to survive large-scale outages better than more fragile services.”
So the Chaos Monkey seeks out and terminates instances of virtual machines (AWS in this case) on a schedule usually within quiet hours, but this means that Netflix learn where their application is weak and they can identify ways to then keep this service running despite what goes down.
The thing is failures happen, everyone accepts this but when you find out about your applications stability is critical as with the case of Netflix the videos must keep on streaming!
I really like the philosophy of a “Chaos Monkeys” and it has really intrigued me as this is a different perspective to what I have viewed and experienced with scheduled DR testing, what Netflix are essentially doing is constantly trying to bring down this service.
This got me thinking about EMC VPLEX, which is designed to give you an active-active data center but more importantly giving you outage avoidance through such mediums as a stretched HA cluster spanning geographic distances. When I think of Netflix and in particular their infrastructure if automated VMware high availability restarted their services on the other cluster then the outage windows would be smaller as it would only need an application restart but they could maintain online services while still hunting for errors.
Everything I seem to read these days is about availability, cloud, users and demand. VPLEX is addressing this zero tolerance availability, I will post an article explaining more about VPLEX soon, in the meantime have a look here.
So to sign off enjoy your Xmas and New year!