Welcome!

@DevOpsSummit Authors: Yeshim Deniz, Zakia Bouachraoui, Pat Romanski, Liz McMillan, Elizabeth White

Related Topics: @DevOpsSummit

@DevOpsSummit: Blog Feed Post

Checklist for Cloud Service Operations

I knew one software company that failed their SaaS transition because they chose to cut a few corners

I knew one software company that failed their SaaS transition because they chose to cut a few corners with the operations. Since they were software engineers, they did not really want to spend time on such mundane tasks as security, auditing, backups and so on. One day they let a disgruntled employee go, the guy went to an internet cafe, logged into the hosting account with the shared admin credentials, and deleted all customer data.

There were no backups or data replicas to bring the data back, no personal admin accounts or procedures to prevent such an incident from happening, and even no monitoring to learn about the issue before customers did. This was the end of this SaaS application – it just never recovered.

Agile DevOps in the Cloud - Session recording from WSO2Con Asia 2014

Cloud business is more than just putting some code online
(and collecting money ;)) Whether you are offering Software-as-a-Service (SaaS) web application, Platform-as-a-Service (PaaS) or Infrastructure-as-a-Service (IaaS) – what you are offering is more than just your code – it is your service.

Even if you do not offer a formal service level agreement (SLA) and have a statement in your Terms of Service that you are not liable for anything, your online application or platform is still a service so your customers expect it to be reliable and secure.

At our recent WSO2Con, Chamith Kumarage delivered an excellent session on how our Cloud DevOps team works. If you are delivering a service online (or considering doing so) – make sure to watch the recording (quick registration required).

Here’s my quick summary of Chamith’s advice:

1. Automate everything: repetitive tasks not only are inefficient and mundane, and eat your time. When done manually they are unreliable. Humans tend to do things slightly differently each time they do them, or not do them at all.

2. Tasks are really parts of processes: when you come up with something that needs to be done, ask yourself what is the process flow for this task? For example, a data backup is really a part of a process that includes:

  • Scheduled (e.g. at 1 a.m. every day) script which creates a backup,
  • Some sort of monitoring system which verifies that the script ran and the backup got created,
  • Notifications on failures and procedures that need to be followed in not,
  • Backup testing: automated and/or regular manual recovery drills (if manual then documented and performed by different team members).

3. Design for failure: everything will be failing so make sure that your system can sustain the failures. For example, if your system uses multiple virtual machines in the cloud, keep running a “chaos monkey” script which keeps randomly killing the instances and automated tests which ensure that these instance failures do not affect the overall system (by the way, see how Netflix does that.)

4. Self-healing and success verification are critical for all tasks. Any task and operation can fail (see above) so the system should not get “surprised” but should always automatically validate the action results and if something didn’t go right – implement the healing procedures (start new instances, retry, and so on).

5. Enforce discipline, processes, automation, checklists. Document everything. This will make your processes repeatable and reliable.

Bus monkey test” (related to the above) if one of your team members gets hit by a bus – all operations should keep working: everything needs to be documented and tried by other team members. (* This is a mental experiment – do not actually hit your team-members by busses :))

6. Monitoring and analytics: the key is not to collect and show tons of data and alerts, but be able to quickly detect abnormal behavior.

7. Communications: your dashboards should quickly and clearly give you the big picture and relevant details. Key metrics and system state should be something that everyone sees and understands, effective drill-downs should make it easy to understand and fix stuff.

8. Agile delivery: waterfall processes in the cloud are bad and stressful.The smaller the changes and the more often and in more automated fashion they are – the more mundane they become: which lowers the risks and improves the skills and reliability. Cloud and big-bang releases do not go well together.

9. Use standard tools and native systems of underlying platforms – do not reinvent the wheels. For example, if the platform gives you SQL-as-a-service (Amazon RDS, Azure SQL and so on) – use those and not your own MySQL running on a virtual machine.

10. Post-mortem analysis is a must. If something did get wrong after all, you need a formal investigation process:

  • What happened?
  • Why and what needs to be done to prevent this in the future?
  • If automated monitoring didn’t catch it, why and what needs to be done to prevent this in the future?
  • If validation and self-healing didn’t catch it, why and what needs to be done to prevent this?

Full session recording and slides are available here.

Read the original blog entry...

More Stories By Dmitry Sotnikov

Dmitry Sotnikov is VP of Cloud at WSO2, building the cloud business for this leading middleware provider. Check out the WSO2 Cloud platform at http://CloudPreview.WSO2.com

@DevOpsSummit Stories
Dion Hinchcliffe is an internationally recognized digital expert, bestselling book author, frequent keynote speaker, analyst, futurist, and transformation expert based in Washington, DC. He is currently Chief Strategy Officer at the industry-leading digital strategy and online community solutions firm, 7Summits.
Addteq is a leader in providing business solutions to Enterprise clients. Addteq has been in the business for more than 10 years. Through the use of DevOps automation, Addteq strives on creating innovative solutions to solve business processes. Clients depend on Addteq to modernize the software delivery process by providing Atlassian solutions, create custom add-ons, conduct training, offer hosting, perform DevOps services, and provide overall support services.
Contino is a global technical consultancy that helps highly-regulated enterprises transform faster, modernizing their way of working through DevOps and cloud computing. They focus on building capability and assisting our clients to in-source strategic technology capability so they get to market quickly and build their own innovation engine.
The standardization of container runtimes and images has sparked the creation of an almost overwhelming number of new open source projects that build on and otherwise work with these specifications. Of course, there's Kubernetes, which orchestrates and manages collections of containers. It was one of the first and best-known examples of projects that make containers truly useful for production use. However, more recently, the container ecosystem has truly exploded. A service mesh like Istio addresses many of the challenges faced by developers and operators as monolithic applications transition towards a distributed microservice architecture. A tracing tool like Jaeger analyzes what's happening as a transaction moves through a distributed system. Monitoring software like Prometheus captures time-series events for real-time alerting and other uses. Grafeas and Kritis provide security polic...
DevOpsSUMMIT at CloudEXPO will expand the DevOps community, enable a wide sharing of knowledge, and educate delegates and technology providers alike. Recent research has shown that DevOps dramatically reduces development time, the amount of enterprise IT professionals put out fires, and support time generally. Time spent on infrastructure development is significantly increased, and DevOps practitioners report more software releases and higher quality. Sponsors of DevOpsSUMMIT at CloudEXPO will benefit from unmatched branding, profile building and lead generation opportunities.