@DevOpsSummit Authors: Yeshim Deniz, Pat Romanski, Liz McMillan, Mehdi Daoudi, Elizabeth White

Related Topics: @DevOpsSummit

@DevOpsSummit: Blog Feed Post

Checklist for Cloud Service Operations

I knew one software company that failed their SaaS transition because they chose to cut a few corners

I knew one software company that failed their SaaS transition because they chose to cut a few corners with the operations. Since they were software engineers, they did not really want to spend time on such mundane tasks as security, auditing, backups and so on. One day they let a disgruntled employee go, the guy went to an internet cafe, logged into the hosting account with the shared admin credentials, and deleted all customer data.

There were no backups or data replicas to bring the data back, no personal admin accounts or procedures to prevent such an incident from happening, and even no monitoring to learn about the issue before customers did. This was the end of this SaaS application – it just never recovered.

Agile DevOps in the Cloud - Session recording from WSO2Con Asia 2014

Cloud business is more than just putting some code online
(and collecting money ;)) Whether you are offering Software-as-a-Service (SaaS) web application, Platform-as-a-Service (PaaS) or Infrastructure-as-a-Service (IaaS) – what you are offering is more than just your code – it is your service.

Even if you do not offer a formal service level agreement (SLA) and have a statement in your Terms of Service that you are not liable for anything, your online application or platform is still a service so your customers expect it to be reliable and secure.

At our recent WSO2Con, Chamith Kumarage delivered an excellent session on how our Cloud DevOps team works. If you are delivering a service online (or considering doing so) – make sure to watch the recording (quick registration required).

Here’s my quick summary of Chamith’s advice:

1. Automate everything: repetitive tasks not only are inefficient and mundane, and eat your time. When done manually they are unreliable. Humans tend to do things slightly differently each time they do them, or not do them at all.

2. Tasks are really parts of processes: when you come up with something that needs to be done, ask yourself what is the process flow for this task? For example, a data backup is really a part of a process that includes:

  • Scheduled (e.g. at 1 a.m. every day) script which creates a backup,
  • Some sort of monitoring system which verifies that the script ran and the backup got created,
  • Notifications on failures and procedures that need to be followed in not,
  • Backup testing: automated and/or regular manual recovery drills (if manual then documented and performed by different team members).

3. Design for failure: everything will be failing so make sure that your system can sustain the failures. For example, if your system uses multiple virtual machines in the cloud, keep running a “chaos monkey” script which keeps randomly killing the instances and automated tests which ensure that these instance failures do not affect the overall system (by the way, see how Netflix does that.)

4. Self-healing and success verification are critical for all tasks. Any task and operation can fail (see above) so the system should not get “surprised” but should always automatically validate the action results and if something didn’t go right – implement the healing procedures (start new instances, retry, and so on).

5. Enforce discipline, processes, automation, checklists. Document everything. This will make your processes repeatable and reliable.

Bus monkey test” (related to the above) if one of your team members gets hit by a bus – all operations should keep working: everything needs to be documented and tried by other team members. (* This is a mental experiment – do not actually hit your team-members by busses :))

6. Monitoring and analytics: the key is not to collect and show tons of data and alerts, but be able to quickly detect abnormal behavior.

7. Communications: your dashboards should quickly and clearly give you the big picture and relevant details. Key metrics and system state should be something that everyone sees and understands, effective drill-downs should make it easy to understand and fix stuff.

8. Agile delivery: waterfall processes in the cloud are bad and stressful.The smaller the changes and the more often and in more automated fashion they are – the more mundane they become: which lowers the risks and improves the skills and reliability. Cloud and big-bang releases do not go well together.

9. Use standard tools and native systems of underlying platforms – do not reinvent the wheels. For example, if the platform gives you SQL-as-a-service (Amazon RDS, Azure SQL and so on) – use those and not your own MySQL running on a virtual machine.

10. Post-mortem analysis is a must. If something did get wrong after all, you need a formal investigation process:

  • What happened?
  • Why and what needs to be done to prevent this in the future?
  • If automated monitoring didn’t catch it, why and what needs to be done to prevent this in the future?
  • If validation and self-healing didn’t catch it, why and what needs to be done to prevent this?

Full session recording and slides are available here.

Read the original blog entry...

More Stories By Dmitry Sotnikov

Dmitry Sotnikov is VP of Cloud at WSO2, building the cloud business for this leading middleware provider. Check out the WSO2 Cloud platform at http://CloudPreview.WSO2.com

@DevOpsSummit Stories
DXWorldEXPO | CloudEXPO are the world's most influential, independent events where Cloud Computing was coined and where technology buyers and vendors meet to experience and discuss the big picture of Digital Transformation and all of the strategies, tactics, and tools they need to realize their goals. Sponsors of DXWorldEXPO | CloudEXPO benefit from unmatched branding, profile building and lead generation opportunities.
@DevOpsSummit New York 2018, colocated with CloudEXPO | DXWorldEXPO New York 2018 will be held November 11-13, 2018, in New York City. From showcase success stories from early adopters and web-scale businesses, DevOps is expanding to organizations of all sizes, including the world's largest enterprises - and delivering real results.
Dion Hinchcliffe is an internationally recognized digital expert, bestselling book author, frequent keynote speaker, analyst, futurist, and transformation expert based in Washington, DC. He is currently Chief Strategy Officer at the industry-leading digital strategy and online community solutions firm, 7Summits.
DXWorldEXPO LLC announced today that Dez Blanchfield joined the faculty of CloudEXPO's "10-Year Anniversary Event" which will take place on November 11-13, 2018 in New York City. Dez is a strategic leader in business and digital transformation with 25 years of experience in the IT and telecommunications industries developing strategies and implementing business initiatives. He has a breadth of expertise spanning technologies such as cloud computing, big data and analytics, cognitive computing, machine learning and the Internet of Things.
Enterprises have taken advantage of IoT to achieve important revenue and cost advantages. What is less apparent is how incumbent enterprises operating at scale have, following success with IoT, built analytic, operations management and software development capabilities - ranging from autonomous vehicles to manageable robotics installations. They have embraced these capabilities as if they were Silicon Valley startups.