@DevOpsSummit Authors: Pat Romanski, Elizabeth White, Liz McMillan, Yeshim Deniz, Zakia Bouachraoui

Related Topics: @DevOpsSummit, Containers Expo Blog, @CloudExpo

@DevOpsSummit: Blog Post

Performance Monitor All Your Apps By @Dynatrace | @DevOpsSummit [#DevOps]

A solution for monitoring large IT infrastructures that contain several hundred components

How to Performance Monitor All Your Applications on a Single Dashboard

It's become easy to monitor applications that are deployed on hundreds of servers - thanks to the advances in application performance management tools. But the more data you collect the harder it is to visualize the health state in a way that a single dashboard tells you both the overall status as well as the problematic component.

Eugene Turetsky (Dynatrace) and Stephan Levesque (SSQ Financial Group) shared their solution for monitoring large IT infrastructures that contain several hundred components that support SSQ's most-critical applications running on a variety of technology stacks including WebLogic, Oracle Databases, Ingres Databases, and WebSphere MQs. When Stephan showed me his SSQ dashboards, I knew I had to write a blog about this.

Stephan agreed to share these details with a larger audience - eventually uploading the plugins that were designed, developed and built by Eugene Turetsky for this onto our Dynatrace GitHub Organization. Now check this out. All Dynatrace dashboards are designated to a wide audience - from high management teams to support engineering teams responsible for maintaining the health of specific components. For example, the following screenshot shows one of SSQ's dashboards: application health arranged vertically, cluster, server and component health horizontally. The names of the apps and servers are sanitized for privacy reasons:

Each dot represents the health status of a component, aggregated to a cluster or an individual server and aggregated onan  application level. If an app goes red or yellow, it's easy to spot which component is causing it

Stephan and his colleagues read this dashboard from top left to bottom right: The big red dot in the top left means that at least one of the applications is unhealthy. Spotting which apps are unhealthy is easy - just look for red. On those application rows it's easy to find the red dots that tell which component (Web Server, App Server, Message Queue, etc.) to focus his root cause analysis on.

Let's look a little deeper into how he calculates the health status of each individual component and how he aggregates the data so that you can rebuild this for your own environment in case you find this useful:

Health Status of Components
A component can be an application server, a database, a message queue or a device such as a Load Balancer. Stephan uses Dynatrace to monitor each component and has one or more metrics for each component that tells him whether it's healthy or not. Here are some examples:

  • Application: Application status is red if one or more clusters or individual un-cluster components are red. Application status is yellow (degraded) if some of an application's clustered components (i.e., nodes) are down but surviving nodes in the cluster can manage the application load. Otherwise the application status is green.
  • WebLogic: If all clustered WebLogic components are down (i.e., cluster is down) then the status of WebLogic is red. If some nodes in the cluster are down but surviving nodes can manage the application load, the status of WebLogic is yellow. Otherwise the status of WebLogic is green.
  • Database: If all clustered database components are down (i.e., cluster is down) then the status of the database is red. If some nodes in the cluster are down but surviving nodes can manage the application load, the status of the database is yellow. Otherwise the status of the database is green.
  • MQ: If all clustered MQ components are down (i.e., cluster is down) then the status of the database is red. If some nodes in the cluster are down but surviving nodes can manage the application load, the status of MQ is yellow. Otherwise the status of MQ is green.
  • Dynatrace agents: The state, or availability, of the Dynatrace agents is also monitored. If a critical agent is unavailable, an alert will be triggered and a red dot will be shown.

Whether you use Dynatrace or other APM tools - make sure you capture both system metrics, such as Availability, CPU, and Memory, but also performance relevant metrics such as Response Time and combine these metrics into your health states.

Aggregating Performance Data from Component to Server to Application
Besides monitoring the health of each component individually, the dashboard also aggregates data "upwards." Stephan calculates an overall health state per component type, e.g., overall WebLogic health in the cluster is calculated based on the states of each individual WebLogic instance. The overall Application Health is then calculated by the Applications Availability as well as the aggregated state of all supporting components. The final overall system health shows whether there is any application currently suffering an issue. The following screenshot shows how this works in a simple example.

Health States get aggregated to Health Groups which eventually end up being aggregated to the Application and the Overall System Status

For further insight, click here for the full article

More Stories By Andreas Grabner

Andreas Grabner has been helping companies improve their application performance for 15+ years. He is a regular contributor within Web Performance and DevOps communities and a prolific speaker at user groups and conferences around the world. Reach him at @grabnerandi

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.

@DevOpsSummit Stories
With more than 30 Kubernetes solutions in the marketplace, it's tempting to think Kubernetes and the vendor ecosystem has solved the problem of operationalizing containers at scale or of automatically managing the elasticity of the underlying infrastructure that these solutions need to be truly scalable. Far from it. There are at least six major pain points that companies experience when they try to deploy and run Kubernetes in their complex environments. In this presentation, the speaker will detail these pain points and explain how cloud can address them.
In an era of historic innovation fueled by unprecedented access to data and technology, the low cost and risk of entering new markets has leveled the playing field for business. Today, any ambitious innovator can easily introduce a new application or product that can reinvent business models and transform the client experience. In their Day 2 Keynote at 19th Cloud Expo, Mercer Rowe, IBM Vice President of Strategic Alliances, and Raejeanne Skillern, Intel Vice President of Data Center Group and GM, discussed how clients in this new era of innovation can apply data, technology, plus human ingenuity to springboard to advance new business value and opportunities.
Discussions of cloud computing have evolved in recent years from a focus on specific types of cloud, to a world of hybrid cloud, and to a world dominated by the APIs that make today's multi-cloud environments and hybrid clouds possible. In this Power Panel at 17th Cloud Expo, moderated by Conference Chair Roger Strukhoff, panelists addressed the importance of customers being able to use the specific technologies they need, through environments and ecosystems that expose their APIs to make true change and transformation possible.
The current age of digital transformation means that IT organizations must adapt their toolset to cover all digital experiences, beyond just the end users’. Today’s businesses can no longer focus solely on the digital interactions they manage with employees or customers; they must now contend with non-traditional factors. Whether it's the power of brand to make or break a company, the need to monitor across all locations 24/7, or the ability to proactively resolve issues, companies must adapt to the new world.
In his session at 20th Cloud Expo, Scott Davis, CTO of Embotics, discussed how automation can provide the dynamic management required to cost-effectively deliver microservices and container solutions at scale. He also discussed how flexible automation is the key to effectively bridging and seamlessly coordinating both IT and developer needs for component orchestration across disparate clouds – an increasingly important requirement at today’s multi-cloud enterprise.