Welcome!

DevOps Journal Authors: Trevor Parsons, Carmen Gonzalez, Roger Strukhoff, Jackie Kahle, Lori MacVittie

Related Topics: DevOps Journal, Java, Linux, Web 2.0, Cloud Expo, Big Data Journal

DevOps Journal: Article

How to Approach Application Failures in Production

The wealth of out-of-the-box insights you could obtain from a single urgent, albeit unspecific log message

In my recent article, "Software Quality Metrics for your Continuous Delivery Pipeline - Part III - Logging," I wrote about the good parts and the not-so-good parts of logging and concluded that logging usually fails to deliver what it is so often mistakenly used for: as a mechanism for analyzing application failures in production. In response to the heated debates on reddit.com/r/devops and reddit.com/r/programing, I want to demonstrate the wealth of out-of-the-box insights you could obtain from a single urgent, albeit unspecific log message if you only are equipped with the magic ingredient; full transaction context:

Examples of insights you could obtain from full transaction context on a single log message

Bear with me until I get to explain what this actually means and how it helps you get almost immediate answers to the most urgent questions when your users are struck by an application failure:

  • "How many users are affected and who are they?"
  • "Which tiers are affected by which errors and what is the root cause?"

Operator: I'm here because you broke something. (courtesy of ThinkGeek.com)

When All You Have Is a Lousy Log Message
Does this story sound familiar to you? It's a Friday afternoon and you just received the release artifacts from the development team belatedly, which need to be released by Monday morning. After spending the night and another day in operations to get this release out into production timely, you notice the Monday after that everything you have achieved in the end was some lousy log message:

08:55:26 SEVERE com.company.product.login.LoginLogic - LoginException occurred when processing Login transaction

While this scenario hopefully does not reflect a common case for you, it still shows an important aspect in the life of development and operations: working as an operator involves monitoring the production environment and providing assistance in troubleshooting application failures mainly with the help of log messages - things that developers have baked into their code. While certainly not all log messages need to be as poor as this one, getting down to the bottom of a production failure is often a tedious endeavor (see this comment on reddit by RecklessKelly who sometimes needs weeks to get his "Eureka moment") - if at all possible.

Why There Is No Such Thing as a 100% Error-Free Code
Production failures can become a major pain for your business with long-term effects: they will not only make your visitors buy elsewhere, but depending on the level of frustrations, your customers may choose to stay at your competition instead of giving you another chance.

As we all know, we just cannot get rid of application failures in production entirely. Agile methodologies, such as Extreme Programming or Scrum, aim to build quality into our processes; however, there is still no such thing as a 100% error-free application. "We need to write more tests!" you may argue and I would agree: disciplines such as TDD and ATDD should be an integral part of your software development process since they, if applied correctly, help you produce better code and fewer bugs. Still, it is simply impossible to test each and every corner of your application for all possible combinations of input parameters and application state. Essentially, we can run only a limited subset of all possible test scenarios. The common goal of developers and test automation engineers, hence, must be to implement a testing strategy, which allows them to deliver code of sufficient quality. Consequently, there is always a chance that something can go wrong, and, as a serious business, you will want to be prepared for the unpredictable and, additionally, have as much control over it as possible:

Why you cannot get rid of application failures in production: remaining failure probability

Without further ado, let's examine some precious out-of-the-box insights you could obtain if you are equipped with full transaction context and are able to capture all transactions.

Why this is important? Because it enables you to see the contributions of input parameters, processes, infrastructure and users at all times whenever a failure occurred, solve problems faster, and additionally use the presented information such as unexpected input parameters to further improve your testing strategy.

Initial Situation: Aggregated Log Messages
Instead of crawling a bunch of possibly distributed log files to determine the count of particular log messages, we may, first of all, want to have this done automatically for us just as they happen. This gives a good overview on the respective message frequencies and facilitates prioritization:

Aggregated log events: severity, logger name, message and count

What we see here (analysis view based on our PurePath technology) is that there have been 104 occurrences of the same log message in the application. We could also observe other captured event data, such as the severity level and the name of the logger instance (usually the name of the class that created the logger).

Question #1: How many users are affected and who are they?

Failed Business Transactions: "Logins" and "Logins by Username"

Having the full transactional context and not just the log message allows us to figure out which critical Business Transactions of our application are impacted. From the dashboard above we can observe that "Logins" and "Logins by Username" have failed: we see that 61 users attempted the 104 logins and who these users were by their username.

For questions 2 and 3, and for further insight, click here for the full article.

More Stories By Martin Etmajer

Martin Etmajer has several years of experience as a Software Architect as well as in monitoring and managing performance in highly available cluster environments. In his current role as a Technology Strategist at the Compuware APM Center of Excellence, he contributes to the strategic development of Compuware’s dynaTrace APM solution, where he focuses on performance monitoring in Cloud technologies and along the Continuous Delivery deployment pipeline. Reach him at @metmajer

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


@DevOpsSummit Stories
DevOps Summit at Cloud Expo Silicon Valley announced today a limited time free "Expo Plus" registration option. On site registration price of $1,95 will be set to 'free' for delegates who register during this offer perios. To take advantage of this opportunity, attendees can use the coupon code, and secure their registration to attend all keynotes, DevOps Summit sessions at Cloud Expo, expo floor, and SYS-CON.tv power panels. Registration page is located at the DevOps Summit site.
The widespread success of cloud computing is driving the DevOps revolution in enterprise IT. Now as never before, development teams must communicate and collaborate in a dynamic, 24/7/365 environment. There is no time to wait for long development cycles that produce software that is obsolete at launch. DevOps may be disruptive, but it is essential. The DevOps Summit at Cloud Expo--to be held November 4-6 at the Santa Clara Convention Center in the heart of Silicon Valley--will expand the DevO...
We had three quick questions for Mike Kail, and he had three quick answers. Mike is SVP of Infrastructure at Yahoo!, and formerly VP of IT Operations at Netflix. He'll be speaking at @DevOpsSummit about his experiences in integrating DevOps on a big scale in big-scale projects. Here's what we asked and what he said: DevOps Journal: You mention “eventual consistency” as a goal. Is there a timeframe? Mike Kail: It is really a strategy for successful transformation instead of a strict ...
The widespread success of cloud computing is driving the DevOps revolution in enterprise IT. Now as never before, development teams must communicate and collaborate in a dynamic, 24/7/365 environment. There is no time to wait for long development cycles that produce software that is obsolete at launch. DevOps may be disruptive, but it is essential. The DevOps Summit at Cloud Expo--to be held November 4-6 at the Santa Clara Convention Center in the heart of Silicon Valley--will expand the DevO...
Having just joined a large technology company with 20 years of history, it would be suicidal to believe that I can immediately move the entire organization to the DevOps mindset and model. For those not familiar with the term, “Eventual Consistency” is a model used in distributed computing to ensure high availability. In this context, it’s a model for replicating best practices and automation across IT teams and business units. The logical place to start with automation is the on-boarding of a ...
Software is eating the world. Companies that were not previously in the technology space now find themselves competing with Google and Amazon on speed of innovation. As the innovation cycle accelerates, companies must embrace rapid and constant change to both applications and their infrastructure, and find a way to deliver speed and agility of development without sacrificing reliability or efficiency of operations. In her keynote DevOps Summit, Victoria Livschitz, CEO of Qubell, will discuss ho...
DevOps Summit at Cloud Expo Silicon Valley announced today a limited time free "Expo Plus" registration option through September. On site registration price of $1,95 will be set to 'free' for delegates who register during special offer. To take advantage of this opportunity, attendees can use the coupon code, and secure their registration to attend all keynotes, DevOps Summit sessions at Cloud Expo, expo floor, and SYS-CON.tv power panels. Registration page is located at the DevOps Summit site. ...
Despite the fact that majority of developers firmly believe that “it worked on my laptop” is a poor excuse for production failures, most don’t truly understand why it is virtually impossible to make your development environment representative of production. When asked, the primary reason for the production/development difference everyone mentions is technology stack spec/configuration differences. While it’s true, thanks to the black magic of Cloud (capitalization intended) with a bit of wizard...
SYS-CON Events announced today that AppDynamics will exhibit at DevOps Summit Silicon Valley, which will take place November 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA. Digital businesses like yours need a way to turn data into actual results. AppDynamics is ushering in the next digital age – the age of the software-defined business. AppDynamics’ mission is to deliver true application intelligence that helps your software-defined business run faster, leaner, and more ef...

BOULDER, Colo., Sept. 24, 2014 /PRNewswire/ -- VictorOps, the leading collaboration and incident management platform for DevOps teams, is hosting a webinar that will discuss how to take full advantage of your project post-mortems with or without a template.

DevOps Journal: Cloud, Big Data, and the IoT all carry disruption within enterprise IT. The same goes with DevOps. Which of these is the major disruptor, in your opinion? Andi Mann: It may well be cloud, because it fundamentally enables all the rest. Cloud scale is why we are now considering Big Data; cloud connectivity is a key enabler of IoT; cloud agility has enabled DevOps to take hold. But in the end, the cloud is “just” a platform, while the results of DevOps speak for themselves--l...
These days, implementing automatic deployment for .NET web projects is easier than ever. Drastic improvements started in Visual Studio 2010 when basic deployment strategies and tools were incorporated into VS itself. Yet, documentation was quite poor at that time, so you had to scour the Internet to find good tutorials in blogs or conference videos. Things have been constantly improving since then; now, we have even more functionality available out-of-the-box and documentation provided in a way ...
Azul Systems Inc. (Azul), the award-winning leader in Java runtime solutions, today announced that its OpenJDK-based Zulu 8 offering is now freely available on Docker. Zulu 8 is a 100% open source, fully tested, compatibility verified, and trusted binary distribution of the OpenJDK 8 platform. Azul has also made Zulu versions compliant with earlier Java SE 7 and Java SE 6 standards available on Docker in the same format.
Founded in 1997, ActiveState is a global leader providing software application development and management solutions. The Company's products include: Stackato, a commercially supported Platform-as-a-Service (PaaS) that harnesses open source technologies such as Cloud Foundry and Docker; dynamic language distributions ActivePerl, ActivePython and ActiveTcl; and developer tools such as the popular Komodo Edit and Komodo IDE. Headquartered in Vancouver, Canada, ActiveState is trusted by customers an...
DevOps Summit at Cloud Expo Silicon Valley announced today a limited time free "Expo Plus" registration option. On site registration price of $1,95 will be set to 'free' for delegates who register during this offer perios. To take advantage of this opportunity, attendees can use the coupon code, and secure their registration to attend all keynotes, DevOps Summit sessions at Cloud Expo, expo floor, and SYS-CON.tv power panels. Registration page is located at the DevOps Summit site.
Leading provider of Continuous Delivery and DevOps software XebiaLabs today announced enhanced integration between Puppet and XebiaLabs' XL Deploy, the deployment automation solution that supports DevOps and Continuous Delivery teams. XL Deploy in combination with Puppet means one seamless automation process to deploy your apps.
PagerDuty, the leader in operations performance management, announced the public release of its Advanced Analytics tools, which provide insights IT teams can use to improve team and system performance. Leveraging PagerDuty’s robust data on key operational metrics like incident frequency and time to respond and resolve, companies can now drive even faster incident resolution. The new capabilities further expand PagerDuty’s operations performance platform by giving managers the ability to analy...
In today's application economy, enterprise organizations realize that it's their applications that are the heart and soul of their business. If their application users have a bad experience, their revenue and reputation are at stake. In his session at 15th Cloud Expo, Anand Akela, Senior Director of Product Marketing for Application Performance Management at CA Technologies, will discuss how a user-centric Application Performance Management solution can help inspire your users with every appli...
SYS-CON Events announced today that Serena Software will exhibit at DevOps Summit Silicon Valley, which will take place on November 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA. Serena Software supports DevOps and Continuous Delivery by providing application deployment automation and software release management solutions to replace slow and error-prone manual processes. 2,500 enterprises around the world trust Serena to help them develop and deploy better software.
Qubell, an innovator in application deployment and configuration management, empowers online companies to do what they have never been able to do before: put into consumers' hands innovative new features and services, almost as fast as they can conceive them, without sacrificing control, reliability or uptime. Qubell emerged from stealth in the summer of 2013 (see related press release) and announced that Kohl's completed its initial implementation (see press release). Founded by pioneers in ent...