@DevOpsSummit Authors: Carmen Gonzalez, Yeshim Deniz, Elizabeth White, Zakia Bouachraoui, Courtney Abud

Related Topics: @DevOpsSummit, Microservices Expo, Containers Expo Blog, @CloudExpo

@DevOpsSummit: Blog Post

Seven Ways To Build A Robust Testing in Production By @Neotys | @DevOpsSummit [#DevOps]

Testing in Production can be an extremely valuable tool in your QA arsenal, when used properly

7 Ways To Build A Robust Testing In Production (TiP) Practice

Here's a short anecdote: a QA team maintained a test environment separate from their production environment, but they had gone through great lengths to ensure it matched what was in production in every respect. It was filled with the same hardware, and all the software was identical down to the patch level. They had a strong process to ensure that any change made to one environment was mirrored in the other.

Perhaps you can see where this story is going. Sure enough, the environments turned out to not be identical. In fact, there was some BIOS setting that differed on the servers, which resulted in the production machines running slower than the test machines. As a result, the overall production system couldn't handle the volume of traffic that the test system could, and one day, the whole live site unexpectedly crashed due to a load level it was supposed to be able to manage.

What's the point of this story? No matter how small, any difference between testing and production environments can have a serious impact.

And let's face it - test environments are almost never maintained with the same consistency as in our anecdote. Yet, overall, production systems largely behave the way we expect them to. Why?

One reason has been a growing practice of Testing in Production (TiP). Essentially, once code is released into the production environment, it is put through a battery of tests to make sure it works. These tests continue as part of ongoing operations, and we're alerted when there are problems. TiP is steadily becoming a standard and critical part of any modern web development organization.

TiP: Testing In Production

A little while ago I wrote about how QA fits into the DevOps culture. The basic idea of that post was that QA professionals' jobs are changing, especially when it comes to cloud or web-based apps. Instead of finding bugs in a particular release of software, the job of a tester is to be the guardian and steward of the entire development process, ensuring that defects are identified and removed before they get to the production environment.

That's why TiP is so important. It's not that it takes the place of traditional testing, but rather it enhances it with a set of test procedures that just make sense to do in the production environment. As the story above illustrated, it can be very difficult to create and maintain a test environment that's truly an exact clone of production - so much so, that there are a class of tests that simply don't make sense to execute in any environment other than production.

TiP provides a structured way of conducting tests using the live site and real users - because for those tests, that is the only way of getting meaningful results.

There are a number of different types of TiP that any software tester should know about. Here's a summary of some of the most important ones.

Canary Testing

Back in the days before PETA, coal miners would bring a caged canary into the mines with them. If there was a sudden expulsion of poisonous gas like methane, the fragile canary would succumb before the humans, providing an early warning system for the miners. Put simply, Dead Bird = Danger.

In TiP, Canary Testing refers to the process of deploying new code to a small subset of your production machines before releasing it widely. It's kind of like a smoke test for SaaS. If those machines continue to operate as expected against live traffic, it gives you confidence that there is no poisonous gas lurking, and you can greenlight a full deployment.

Controlled Test Flight

In a Canary Test you are testing hardware, but in a Controlled Test Flight you are testing users. In this kind of TiP, you expose a select group of real users to software changes to see if they behave as expected. For example, let's say your release involves a change to your app's navigation structure. You've gone through your usability tests, but want to do a little better than that before everyone sees the change.

That's where a Controlled Test Flight comes in. You make the change but only expose it to a specific slice of your users. See how they behave. If things go as expected, you can roll the change out to the wider audience.

A/B Split Testing

Sometimes you aren't exactly sure what users will prefer, and the only way to know is to observe their behavior. A/B Split Tests are very common in web-based apps because it's a great way to use behavioral data to make decisions. In this case, you are developing two (or more) experiences - the "A" experience and the "B" experience - and exposing an equivalent set of users to each experience. Then you measure the results.

A/B Testing is an incredibly powerful tool when used properly, because it truly allows a development organization to follow its users. It does involve more work and coordination, but the benefits can be substantial when done properly.

Synthetic User Testing

Synthetic user testing involves the creation and monitoring of fake users which will interact with the real site. These users operate against predefined scripts to execute various functions and transactions within the web app. For example, they could visit the site, navigate to an Ecommerce store, select some items into their cart, and check out. As this script executes, you keep track of relevant performance metrics of the synthetic user so you know what kind of end-user experience your real users are having.

Synthetic monitoring, using a product like NeoSense, is a key component of any website's application performance monitoring strategy.

Fault Injection

Here's an interesting, and perhaps unsettling idea: create a problem in your production environment, just to see how gracefully its handled. That's the idea behind fault injection. You have built all this infrastructure to make sure that you are protected from specific errors. You should actually test those processes.

Netflix is famous among testing circles for its Chaos Monkey routine. This is a service that will randomly shut down a virtual machine or terminate a process. It creates errors that the service is supposed to be able to handle, and in the process has drastically improved the reliability of the application. Plus, it keeps the operational staff on its toes.

Recovery Testing

Similarly to fault injection, you want to know that your app and organization can recover from a bad problem when it's called for. There are procedures that are rarely tested in production environments, like failing over to a secondary site or recovering from a previous backup. Recovery testing exercises these processes.

Run fire drills for your app. Select a time when usage is low and put your environment through the paces that it is supposedly designed to handle. Make sure that your technology and your people are able to handle real problems in a controlled way, so you are confident they will be handled properly when it's truly a surprise.

Data Driven Quality

Finally - and this may go without saying - put in place systems that will help your QA team receive and review operational data to measure quality. Make sure that testers have access to logs, performance metrics, alerts, and other information from the production environment, so they can be proactive in identifying and fixing problems.


Testing in Production can be an extremely valuable tool in your QA arsenal, when used properly. Sure, there are always risks of testing with live users, but let's face it - there are risks to NOT testing with live users as well. However, if you build the right procedures, TiP can result in a huge boost to your app's overall quality.

If you want to learn more, check out our post Tips for Testing in Production.

More Stories By Tim Hinds

Tim Hinds is the Product Marketing Manager for NeoLoad at Neotys. He has a background in Agile software development, Scrum, Kanban, Continuous Integration, Continuous Delivery, and Continuous Testing practices.

Previously, Tim was Product Marketing Manager at AccuRev, a company acquired by Micro Focus, where he worked with software configuration management, issue tracking, Agile project management, continuous integration, workflow automation, and distributed version control systems.

@DevOpsSummit Stories
The dream is universal: heuristic driven, global business operations without interruption so that nobody has to wake up at 4am to solve a problem. Building upon Nutanix Acropolis software defined storage, virtualization, and networking platform, Mark will demonstrate business lifecycle automation with freedom of choice and consumption models. Hybrid cloud applications and operations are controllable by the Nutanix Prism control plane with Calm automation, which can weave together the following: database as a service with Era, micro segmentation with Flow, event driven lifecycle operations with Epoch monitoring, and both financial and cloud governance with Beam. Combined together, the Nutanix Enterprise Cloud OS democratizes and accelerates every aspect of your business with simplicity, security, and scalability.
Is your enterprise growing the right skills to fight the digital transformation (DX) battles? With 69% of enterprises describing the DX skill drought as being soft skills, rather than technology skills, are you ready to survive against disrupters? The next wave of business disruption is already crashing on your enterprise as AI, Blockchain and IoT change the nature and location of business. Now is the time to prepare. Drawing on experiences with large and midsize enterprises, Marco Coulter tabulates the skills needed to survive DX while innovating at scale. He will start with a focus on the ‘lingua franca' or common language between business and technology needed for today's digitally savvy or agile enterprise.
Where many organizations get into trouble, however, is that they try to have a broad and deep knowledge in each of these areas. This is a huge blow to an organization's productivity. By automating or outsourcing some of these pieces, such as databases, infrastructure, and networks, your team can instead focus on development, testing, and deployment. Further, organizations that focus their attention on these areas can eventually move to a test-driven development structure that condenses several long phases into a faster, more efficient process. This methodology has a name, of course: Continuous delivery. As Jones pointed out at CloudEXPO, continuous delivery allows developers to trim the fat off tasks and gives them more time to focus on the individual parts of the process. But remember-implementing this methodology requires organizations to offload management of databases, infrastruct...
In today's always-on world, customer expectations have changed. Competitive differentiation is delivered through rapid software innovations, the ability to respond to issues quickly and by releasing high-quality code with minimal interruptions. DevOps isn't some far off goal; it's methodologies and practices are a response to this demand. The demand to go faster. The demand for more uptime. The demand to innovate. In this keynote, we will cover the Nutanix Developer Stack. Built from the foundation of software-defined infrastructure, Nutanix has rapidly expanded into full application lifecycle management across any infrastructure or cloud .Join us as we delve into how the Nutanix Developer Stack makes it easy to build hybrid cloud applications by weaving DBaaS, micro segmentation, event driven lifecycle operations, and both financial and cloud governance together into a single unified st...
The graph represents a network of 1,329 Twitter users whose recent tweets contained "#DevOps", or who were replied to or mentioned in those tweets, taken from a data set limited to a maximum of 18,000 tweets. The network was obtained from Twitter on Thursday, 10 January 2019 at 23:50 UTC. The tweets in the network were tweeted over the 7-hour, 6-minute period from Thursday, 10 January 2019 at 16:29 UTC to Thursday, 10 January 2019 at 23:36 UTC. Additional tweets that were mentioned in this data set were also collected from prior time periods. These tweets may expand the complete time period of the data.