Welcome!

DevOps Journal Authors: Pat Romanski, Yeshim Deniz, Elizabeth White, Carmen Gonzalez, Aater Suleman

Blog Feed Post

10 Things We Forgot to Monitor

10 Things We Forgot to Monitor

After publishing Ben’s blog post about “Memory Monitoring with LXC” we realized there is a lot of interest in articles about Monitoring. I got in touch with Jehiah Czebotar, Head of Engineering at bitly, and asked him if we could republish his blog post about things bitly forgot to monitor.

Jehiah originally published his blog post on the bitly engeneering blog. You can find Jehiah on twitter and on his personal page. Definitely check out his “Personal Annual Reports“.


There is always a set of standard metrics that are universally monitored (Disk Usage, Memory Usage, Load, Pings, etc). Beyond that, there are a lot of lessons that we’ve learned from operating our production systems that have helped shape the breadth of monitoring that we perform at bitly.

One of my favorite all-time tweets is from @DevOps_Borat

What follows is a small list of things we monitor at bitly that have grown out of those (sometimes painful!) experiences, and where possible little snippets of the stories behind those instances.

1 – Fork Rate

We once had a problem where IPv6 was intentionally disabled on a box via options ipv6 disable=1 and alias ipv6 off in /etc/modprobe.conf. This caused a large issue for us: each time a new curl object was created, modprobe would spawn, checking net-pf-10 to evaluate IPv6 status. This fork bombed the box, and we eventually tracked it down by noticing that the process counter in /proc/stat was increasing by several hundred a second. Normally you would only expect a fork rate of 1-10/sec on a production box with steady traffic.

2 – flow control packets

TL;DR; If your network configuration honors flow control packets and isn’t configured to disable them, they can temporarily cause dropped traffic. (If this doesn’t sound like an outage, you need your head checked.)

$ /usr/sbin/ethtool -S eth0 | grep flow_control
rx_flow_control_xon: 0
rx_flow_control_xoff: 0
tx_flow_control_xon: 0
tx_flow_control_xoff: 0

Note: Read this to understand how these flow control frames can cascade to switch-wide loss of connectivity if you use certain Broadcom NIC’s. You should also trend these metrics on your switch gear. While at it, watch your dropped frames.

3 – Swap In/Out Rate

It’s common to check for swap usage above a threshold, but even if you have a small quantity of memory swapped, it’s actually the rate it’s swapped in/out that can impact performance, not the quantity. This is a much more direct check for that state.

4 – Server Boot Notification

Unexpected reboots are part of life. Do you know when they happen on your hosts? Most people don’t. We use a simple init script that triggers an ops email on system boot. This is valuable to communicate provisioning of new servers, and helps capture state change even if services handle the failure gracefully without alerting.

5 – NTP Clock Offset

If not monitored, yes, one of your servers is probably off. If you’ve never thought about clock skew you might not even be running ntpd on your servers. Generally there are 3 things to check for. 1) That ntpd is running, 2) Clock skew inside your datacenter, 3) Clock skew from your master time servers to an external source.

We use check_ntp_time for this check.

6 – DNS Resolutions

Internal DNS – It’s a hidden part of your infrastructure that you rely on more than you realize. The things to check for are 1) Local resolutions from each server, 2) If you have local DNS servers in your datacenter, you want to check resolution, and quantity of queries, 3) Check availability of each upstream DNS resolver you use.

External DNS – It’s good to verify your external domains resolve correctly against each of your published external nameservers. At bitly we also rely on several CC TLD’s and we monitor those authoritative servers directly as well (yes, it’s happened that all authoritative nameservers for a TLD have been offline).

7 – SSL Expiration

It’s the thing everyone forgets about because it happens so infrequently. The fix is easy, just check it and get alerted with enough timeframe to renew your SSL certificates.

define command{
    command_name    check_ssl_expire
    command_line    $USER1$/check_http --ssl -C 14 -H $ARG1$
}
define service{
    host_name               virtual
    service_description     bitly_com_ssl_expiration
    use                     generic-service
    check_command           check_ssl_expire!bitly.com
    contact_groups          email_only
    normal_check_interval   720
    retry_check_interval    10
    notification_interval   720
}

8 – DELL OpenManage Server Administrator (OMSA)

We run bitly split across two data centers, one is a managed environment with DELL hardware, and the second is Amazon EC2. For our DELL hardware it’s important for us to monitor the outputs from OMSA. This alerts us to RAID status, failed disks (predictive or hard failures), RAM Issues, Power Supply states and more.

9 – Connection Limits

You probably run things like memcached and mysql with connection limits, but do you monitor how close you are to those limits as you scale out application tiers?

Related to this is addressing the issue of processes running into file descriptor limits. We make a regular practice of running services with ulimit -n 65535 in our run scripts to minimize this. We also set Nginx worker_rlimit_nofile.

10 – Load Balancer Status.

We configure our Load Balancers with a health check which we can easily force to fail in order to have any given server removed from rotation.We’ve found it important to have visibility into the health check state, so we monitor and alert based on the same health check. (If you use EC2 Load Balancers you can monitor the ELB state from Amazon API’s).

Various Other things to watch

New entries written to Nginx Error Logs, service restarts (assuming you have something in place to auto-restart them on failure), numa stats, new process core dumps (great if you run any C code).


We want to thank Jehiah for making his original article available to our readers. Is there something missing on this list in your opportunity? Let us know in the comments!

Codeship – A hosted Continuous Deployment platform for web applications

Read the original blog entry...

More Stories By Manuel Weiss

I am the cofounder of Codeship – a hosted Continuous Integration and Deployment platform for web applications. On the Codeship blog we love to write about Software Testing, Continuos Integration and Deployment. Also check out our weekly screencast series 'Testing Tuesday'!

@DevOpsSummit Stories
Software development, like manufacturing, is a craft that requires the application of creative approaches to solve problems given a wide range of constraints. However, while engineering design may be craftwork, the production of most designed objects relies on a standardized and automated manufacturing process. By contrast, much of moving an application from prototype to production and, indeed, maintaining the application through its lifecycle has often remained craftwork. In his session at De...
In a world of ever-accelerating business cycles and fast-changing client expectations, the cloud increasingly serves as a growth engine and a path to new business models. Dynamic clouds enable businesses to continuously reinvent themselves, adapting their business processes, their service and software delivery and their operations to achieve speed-to-market and quick response to customer feedback. As the cloud evolves, the industry has multiple competing cloud technologies, offering on-premises ...
The old monolithic style of building enterprise applications just isn't cutting it any more. It results in applications and teams both that are complex, inefficient, and inflexible, with considerable communication overhead and long change cycles. Microservices architectures, while they've been around for a while, are now gaining serious traction with software organizations, and for good reasons: they enable small targeted teams, rapid continuous deployment, independent updates, true polyglot lan...
Docker offers a new, lightweight approach to application portability. Applications are shipped using a common container format and managed with a high-level API. Their processes run within isolated namespaces that abstract the operating environment independently of the distribution, versions, network setup, and other details of this environment. This "containerization" has often been nicknamed "the new virtualization." But containers are more than lightweight virtual machines. Beyond their small...
DevOps Summit at Cloud Expo Silicon Valley announced today a limited time free "Expo Plus" registration option. On site registration price of $1,95 will be set to 'free' for delegates who register during this offer perios. To take advantage of this opportunity, attendees can use the coupon code, and secure their registration to attend all keynotes, DevOps Summit sessions at Cloud Expo, expo floor, and SYS-CON.tv power panels. Registration page is located at the DevOps Summit site.
DevOps Summit at Cloud Expo Silicon Valley announced today a limited time free "Expo Plus" registration option. On site registration price of $1,95 will be set to 'free' for delegates who register during this offer perios. To take advantage of this opportunity, attendees can use the coupon code, and secure their registration to attend all keynotes, DevOps Summit sessions at Cloud Expo, expo floor, and SYS-CON.tv power panels. Registration page is located at the DevOps Summit site.
The widespread success of cloud computing is driving the DevOps revolution in enterprise IT. Now as never before, development teams must communicate and collaborate in a dynamic, 24/7/365 environment. There is no time to wait for long development cycles that produce software that is obsolete at launch. DevOps may be disruptive, but it is essential. The DevOps Summit at Cloud Expo--to be held November 4-6 at the Santa Clara Convention Center in the heart of Silicon Valley--will expand the DevO...
We had three quick questions for Mike Kail, and he had three quick answers. Mike is SVP of Infrastructure at Yahoo!, and formerly VP of IT Operations at Netflix. He'll be speaking at @DevOpsSummit about his experiences in integrating DevOps on a big scale in big-scale projects. Here's what we asked and what he said: DevOps Journal: You mention “eventual consistency” as a goal. Is there a timeframe? Mike Kail: It is really a strategy for successful transformation instead of a strict ...
The widespread success of cloud computing is driving the DevOps revolution in enterprise IT. Now as never before, development teams must communicate and collaborate in a dynamic, 24/7/365 environment. There is no time to wait for long development cycles that produce software that is obsolete at launch. DevOps may be disruptive, but it is essential. The DevOps Summit at Cloud Expo--to be held November 4-6 at the Santa Clara Convention Center in the heart of Silicon Valley--will expand the DevO...
Having just joined a large technology company with 20 years of history, it would be suicidal to believe that I can immediately move the entire organization to the DevOps mindset and model. For those not familiar with the term, “Eventual Consistency” is a model used in distributed computing to ensure high availability. In this context, it’s a model for replicating best practices and automation across IT teams and business units. The logical place to start with automation is the on-boarding of a ...
Software is eating the world. Companies that were not previously in the technology space now find themselves competing with Google and Amazon on speed of innovation. As the innovation cycle accelerates, companies must embrace rapid and constant change to both applications and their infrastructure, and find a way to deliver speed and agility of development without sacrificing reliability or efficiency of operations. In her keynote DevOps Summit, Victoria Livschitz, CEO of Qubell, will discuss ho...
DevOps Summit at Cloud Expo Silicon Valley announced today a limited time free "Expo Plus" registration option through September. On site registration price of $1,95 will be set to 'free' for delegates who register during special offer. To take advantage of this opportunity, attendees can use the coupon code, and secure their registration to attend all keynotes, DevOps Summit sessions at Cloud Expo, expo floor, and SYS-CON.tv power panels. Registration page is located at the DevOps Summit site. ...
Despite the fact that majority of developers firmly believe that “it worked on my laptop” is a poor excuse for production failures, most don’t truly understand why it is virtually impossible to make your development environment representative of production. When asked, the primary reason for the production/development difference everyone mentions is technology stack spec/configuration differences. While it’s true, thanks to the black magic of Cloud (capitalization intended) with a bit of wizard...
SYS-CON Events announced today that AppDynamics will exhibit at DevOps Summit Silicon Valley, which will take place November 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA. Digital businesses like yours need a way to turn data into actual results. AppDynamics is ushering in the next digital age – the age of the software-defined business. AppDynamics’ mission is to deliver true application intelligence that helps your software-defined business run faster, leaner, and more ef...

BOULDER, Colo., Sept. 24, 2014 /PRNewswire/ -- VictorOps, the leading collaboration and incident management platform for DevOps teams, is hosting a webinar that will discuss how to take full advantage of your project post-mortems with or without a template.

DevOps Journal: Cloud, Big Data, and the IoT all carry disruption within enterprise IT. The same goes with DevOps. Which of these is the major disruptor, in your opinion? Andi Mann: It may well be cloud, because it fundamentally enables all the rest. Cloud scale is why we are now considering Big Data; cloud connectivity is a key enabler of IoT; cloud agility has enabled DevOps to take hold. But in the end, the cloud is “just” a platform, while the results of DevOps speak for themselves--l...
These days, implementing automatic deployment for .NET web projects is easier than ever. Drastic improvements started in Visual Studio 2010 when basic deployment strategies and tools were incorporated into VS itself. Yet, documentation was quite poor at that time, so you had to scour the Internet to find good tutorials in blogs or conference videos. Things have been constantly improving since then; now, we have even more functionality available out-of-the-box and documentation provided in a way ...
Azul Systems Inc. (Azul), the award-winning leader in Java runtime solutions, today announced that its OpenJDK-based Zulu 8 offering is now freely available on Docker. Zulu 8 is a 100% open source, fully tested, compatibility verified, and trusted binary distribution of the OpenJDK 8 platform. Azul has also made Zulu versions compliant with earlier Java SE 7 and Java SE 6 standards available on Docker in the same format.
Founded in 1997, ActiveState is a global leader providing software application development and management solutions. The Company's products include: Stackato, a commercially supported Platform-as-a-Service (PaaS) that harnesses open source technologies such as Cloud Foundry and Docker; dynamic language distributions ActivePerl, ActivePython and ActiveTcl; and developer tools such as the popular Komodo Edit and Komodo IDE. Headquartered in Vancouver, Canada, ActiveState is trusted by customers an...
Leading provider of Continuous Delivery and DevOps software XebiaLabs today announced enhanced integration between Puppet and XebiaLabs' XL Deploy, the deployment automation solution that supports DevOps and Continuous Delivery teams. XL Deploy in combination with Puppet means one seamless automation process to deploy your apps.