Welcome!

@DevOpsSummit Authors: Janakiram MSV, Zakia Bouachraoui, Yeshim Deniz, Elizabeth White, Liz McMillan

Related Topics: @DevOpsSummit, Linux Containers, @CloudExpo

@DevOpsSummit: Blog Feed Post

DevOps: Apollo Mission Control | @DevOpsSummit @ToddVernon #DevOps

The pattern between the world of space mission operations and the evolution of SaaS businesses is converging

DevOps and Apollo Mission Control
By Todd Vernon

Lately I have been reading the excellent book Digital Apollo. It explores the evolution of digital control systems and the man-machine interface that evolved during the development of space flight and ultimately the Apollo missions. It’s a fantastic book – more technical than most – but very approachable to those not familiar with flight control, embedded software, or the challenges of building such systems. As I read the book, I could not help but compare the way space missions were executed to that of the role of DevOps in modern SaaS businesses.

Apollo cover

I started my career at NASA testing digital flight controls for an experimental aircraft the X-29. The X-29 flight test program was just the latest in the series of one-off aircrafts that started with the Bell X-1 and moved to the X-15 that laid the groundwork for Apollo. As a result, flight test was executed in a very similar fashion in all these programs. Nearly every switch, surface, actuator, probe was instrumented and that data was downlinked in real-time to a control room as the airplane or spacecraft flew.

As the vehicles became more fly-by-wire and had digital computers at their core, those computers also downlinked a lot of their internal state variables to the ground where teams of engineers could keep track of every button push, flight mode, acceleration in real-time, helping the pilot look for things that could happen to potentially end the mission or end his life.

The pattern between the world of space mission operations and the evolution of SaaS businesses is converging. While generally no one dies if your SaaS service fails to operate, the implication of downtime every year gets more and more real. If you operate a platform that services customers that collectively pay millions of dollars a day for your product or service, that is serious business.

Like state variables downlinked from Apollo, we now watch the equivalent using tools like New Relic as our systems support millions or billions of customer transactions through the services we have built. While Apollo’s AGC had to work for several hundred hours at a time, our SaaS services get turned on once when we launch our company and the mission goes on forever. As a result, we are replacing rooms of engineers there for days with systems that connect them to the technology all the time.

Modern monitoring tools are starting to approach the quality of observation we had back at NASA for immediacy of data, but at the same time, now far surpass those relatively crude tools for the spontaneity of exploration and discovery. Today, I get an alert on my iPhone when some part of our system is acting inconsistent and I can interact with our engineers in real-time regardless of location.

Like the rooms of engineers that supported an Apollo mission, today we are on the verge of supporting our complex systems with a virtual room of engineers using tools like VictorOps. As systems become more complex, it becomes more likely the problem needs to be solved by the person that wrote the code in the first place. Very often only that person has (or ever had) the knowledge of how the system works with such intimacy as to know how to fix it or work around it to keep the mission (business) functioning.

apollo_14_lm

On Apollo 14, an engineer noticed while the space craft was in Lunar orbit that the software bit, buried deep in the guidance and navigation computer inside the Lunar Module (LM), signified that the descent program abort was initiated. This was caused by a loose bead of solder that effectively kept “pushing” the abort button and was not a problem or even noticed by the crew as the descent program that would land the astronauts on the moon was not running yet.

Had that program been initiated, as was scheduled only minutes later, the mission would have been aborted and quite likely the crew would have been lost. The knowledge of that specific engineer who knew how the system would misbehave was enabled by the ability to be connected to software through advanced monitoring, and the ability to act on that data in real-time. If you removed any part of the equation, Apollo 14 would have been much different.

This is the basic DNA of how we look at our product at VictorOps. We connect engineers to the mission-critical machines that run your business. If you expect the unexpected and outfit your teams accordingly, you can be ready to respond to any problem faster and more accurately then your competition.

The post DevOps and Apollo Mission Control appeared first on VictorOps.

More Stories By VictorOps Blog

VictorOps is making on-call suck less with the only collaborative alert management platform on the market.

With easy on-call scheduling management, a real-time incident timeline that gives you contextual relevance around your alerts and powerful reporting features that make post-mortems more effective, VictorOps helps your IT/DevOps team solve problems faster.

@DevOpsSummit Stories
Serverless Computing or Functions as a Service (FaaS) is gaining momentum. Amazon is fueling the innovation by expanding Lambda to edge devices and content distribution network. IBM, Microsoft, and Google have their own FaaS offerings in the public cloud. There are over half-a-dozen open source serverless projects that are getting the attention of developers.
At CloudEXPO Silicon Valley, June 24-26, 2019, Digital Transformation (DX) is a major focus with expanded DevOpsSUMMIT and FinTechEXPO programs within the DXWorldEXPO agenda. Successful transformation requires a laser focus on being data-driven and on using all the tools available that enable transformation if they plan to survive over the long term. A total of 88% of Fortune 500 companies from a generation ago are now out of business. Only 12% still survive. Similar percentages are found throughout enterprises of all sizes.
As you know, enterprise IT conversation over the past year have often centered upon the open-source Kubernetes container orchestration system. In fact, Kubernetes has emerged as the key technology -- and even primary platform -- of cloud migrations for a wide variety of organizations. Kubernetes is critical to forward-looking enterprises that continue to push their IT infrastructures toward maximum functionality, scalability, and flexibility.
The Kubernetes vision is to democratize the building of distributed systems. As adoption of Kubernetes increases, the project is growing in popularity; it currently has more than 1,500 contributors who have made 62,000+ commits. Kubernetes acts as a cloud orchestration layer, reducing barriers to cloud adoption and eliminating vendor lock-in for enterprises wanting to use cloud service providers. Organizations can develop and run applications on any public cloud, such as Amazon Web Services, Microsoft Azure, Red Hat OpenShift and Google Cloud Platform.
Because Linkerd is a transparent proxy that runs alongside your application, there are no code changes required. It even comes with Prometheus to store the metrics for you and pre-built Grafana dashboards to show exactly what is important for your services - success rate, latency, and throughput. In this session, we'll explain what Linkerd provides for you, demo the installation of Linkerd on Kubernetes and debug a real world problem. We will also dig into what functionality you can build on top of the tools provided by Linkerd such as alerting and autoscaling.