Click here to close now.

Welcome!

DevOps Journal Authors: Carmen Gonzalez, Elizabeth White, Liz McMillan, Pat Romanski, Yeshim Deniz

Related Topics: Cloud Expo, Java, Microservices Journal, Virtualization, Big Data Journal, SDN Journal

Cloud Expo: Book Excerpt

Book Excerpt: Systems Performance: Enterprise and the Cloud | Part 1

CPUs drive all software and are often the first target for systems performance analysis

"This excerpt is from the book, "Systems Performance: Enterprise and the Cloud", authored by Brendan Gregg, published by Prentice Hall Professional, Oct. 2013, ISBN 9780133390094, Copyright © 2014 Pearson Education, Inc. For more info, please visit the publisher site:

CPUs drive all software and are often the first target for systems performance analysis. Modern systems typically have many CPUs, which are shared among all running software by the kernel scheduler. When there is more demand for CPU resources than there are resources available, process threads (or tasks) will queue, waiting their turn. Waiting can add significant latency during the runtime of applications, degrading performance.

The usage of the CPUs can be examined in detail to look for performance improvements, including eliminating unnecessary work. At a high level, CPU usage by process, thread, or task can be examined. At a lower level, the code path within applications and the kernel can be profiled and studied. At the lowest level, CPU instruction execution and cycle behavior can be studied.

This chapter consists of five parts:

  • Background introduces CPU-related terminology, basic models of CPUs, and key CPU performance concepts.
  • Architecture introduces processor and kernel scheduler architecture.
  • Methodology describes performance analysis methodologies, both observa- tional and experimental.
  • Analysis describes CPU performance analysis tools on Linux- and Solaris- based systems, including profiling, tracing, and visualizations.
  • Tuning includes examples of tunable parameters.

The first three sections provide the basis for CPU analysis, and the last two show its practical application to Linux- and Solaris-based systems.

The effects of memory I/O on CPU performance are covered, including CPU cycles stalled on memory and the performance of CPU caches. Chapter 7, Memory, continues the discussion of memory I/O, including MMU, NUMA/UMA, system interconnects, and memory busses.

Terminology
For reference, CPU-related terminology used in this chapter includes the following:

  • Processor: the physical chip that plugs into a socket on the system or pro- cessor board and contains one or more CPUs implemented as cores or hard- ware threads.
  • Core: an independent CPU instance on a multicore processor. The use of cores is a way to scale processors, called chip-level multiprocessing (CMP).
  • Hardware thread: a CPU architecture that supports executing multiple threads in parallel on a single core (including Intel's Hyper-Threading Tech- nology), where each thread is an independent CPU instance. One name for this scaling approach is multithreading.
  • CPU instruction: a single CPU operation, from its instruction set. There are instructions for arithmetic operations, memory I/O, and control logic.
  • Logical CPU: also called a virtual processor,1 an operating system CPU instance (a schedulable CPU entity). This may be implemented by the processor as a hardware thread (in which case it may also be called a virtual core), a core, or a single-core processor.
  • Scheduler: the kernel subsystem that assigns threads to run on CPUs.
  • Run queue: a queue of runnable threads that are waiting to be serviced by
  • CPUs. For Solaris, it is often called a dispatcher queue.

Other terms are introduced throughout this chapter. The Glossary includes basic terminology for reference, including CPU, CPU cycle, and stack. Also see the terminology sections in Chapters 2 and 3.

Models
The following simple models illustrate some basic principles of CPUs and CPU per- formance. Section 6.4, Architecture, digs much deeper and includes implementation- specific details.

CPU Architecture
Figure 1 shows an example CPU architecture, for a single processor with four cores and eight hardware threads in total. The physical architecture is pictured, along with how it is seen by the operating system.

Figure 1: CPU architecture

Each hardware thread is addressable as a logical CPU, so this processor appears as eight CPUs. The operating system may have some additional knowledge of topology, such as which CPUs are on the same core, to improve its scheduling decisions.

CPU Memory Caches
Processors provide various hardware caches for improving memory I/O perfor- mance. Figure 2 shows the relationship of cache sizes, which become smaller and faster (a trade-off) the closer they are to the CPU.

The caches that are present, and whether they are on the processor (integrated) or external to the processor, depend on the processor type. Earlier processors pro- vided fewer levels of integrated cache.

Figure 2: CPU cache sizes

CPU Run Queues
Figure 3 shows a CPU run queue, which is managed by the kernel scheduler.

Figure 3: CPU run queue

The thread states shown in the figure, ready to run and on-CPU, are covered in Figure 3.7 in Chapter 3, Operating Systems.

The number of software threads that are queued and ready to run is an impor- tant performance metric indicating CPU saturation. In this figure (at this instant) there are four, with an additional thread running on-CPU. The time spent waiting on a CPU run queue is sometimes called run-queue latency or dispatcher-queue latency. In this book, the term scheduler latency is used instead, as it is appropri- ate for all dispatcher types, including those that do not use queues (see the discus- sion of CFS in Section 6.4.2, Software).

For multiprocessor systems, the kernel typically provides a run queue for each CPU and aims to keep threads on the same run queue. This means that threads are more likely to keep running on the same CPUs, where the CPU caches have cached their data. (These caches are described as having cache warmth, and the approach to favor CPUs is called CPU affinity.) On NUMA systems, memory locality may also be improved, which also improves performance (this is described in Chapter 7, Memory).

It also avoids the cost of thread synchronization (mutex locks) for queue operations, which would hurt scalability if the run queue was global and shared among all CPUs.

Concepts
The following are a selection of important concepts regarding CPU performance, beginning with a summary of processor internals: the CPU clock rate and how instructions are executed. This is background for later performance analysis, particularly for understanding the cycles-per-instruction (CPI) metric.

Clock Rate
The clock is a digital signal that drives all processor logic. Each CPU instruction may take one or more cycles of the clock (called CPU cycles) to execute. CPUs exe- cute at a particular clock rate; for example, a 5 GHz CPU performs 5 billion clock cycles per second.

Some processors are able to vary their clock rate, increasing it to improve performance or decreasing it to reduce power consumption. The rate may be varied on request by the operating system, or dynamically by the processor itself. The ker- nel idle thread, for example, can request the CPU to throttle down to save power.

Clock rate is often marketed as the primary feature of the processor, but this can be a little misleading. Even if the CPU in your system appears to be fully utilized (a bottleneck), a faster clock rate may not speed up performance-it depends on what those fast CPU cycles are actually doing. If they are mostly stall cycles while waiting on memory access, executing them more quickly doesn't actually increase the CPU instruction rate or workload throughput.

Instruction
CPUs execute instructions chosen from their instruction set. An instruction includes the following steps, each processed by a component of the CPU called a functional unit:

  1. Instruction fetch
  2. Instruction decode
  3. Execute
  4. Memory access
  5. Register write-back

The last two steps are optional, depending on the instruction. Many instructions operate on registers only and do not require the memory access step.

Each of these steps takes at least a single clock cycle to be executed. Memory access is often the slowest, as it may take dozens of clock cycles to read or write to main memory, during which instruction execution has stalled (and these cycles while stalled are called stall cycles). This is why CPU caching is important, as described in Section 6.4: it can dramatically reduce the number of cycles needed for memory access.

Instruction Pipeline
The instruction pipeline is a CPU architecture that can execute multiple instructions in parallel, by executing different components of different instructions at the same time. It is similar to a factory assembly line, where stages of production can be executed in parallel, increasing throughput.

Consider the instruction steps previously listed. If each were to take a single clock cycle, it would take five cycles to complete the instruction. At each step of this instruction, only one functional unit is active and four are idle. By use of pipe- lining, multiple functional units can be active at the same time, processing differ- ent instructions in the pipeline. Ideally, the processor can then complete one instruction with every clock cycle.

Instruction Width
But we can go faster still. Multiple functional units can be included of the same type, so that even more instructions can make forward progress with each clock cycle. This CPU architecture is called superscalar and is typically used with pipe- lining to achieve a high instruction throughput.

The instruction width describes the target number of instructions to process in parallel. Modern processors are 3-wide or 4-wide, meaning they can complete up to three or four instructions per cycle. How this works depends on the processor, as there may be different numbers of functional units for each stage.

CPI, IPC
Cycles per instruction (CPI) is an important high-level metric for describing where a CPU is spending its clock cycles and for understanding the nature of CPU utilization. This metric may also be expressed as instructions per cycle (IPC), the inverse of CPI.

A high CPI indicates that CPUs are often stalled, typically for memory access. A low CPI indicates that CPUs are often not stalled and have a high instruction throughput. These metrics suggest where performance tuning efforts may be best spent.

Memory-intensive workloads, for example, may be improved by installing faster memory (DRAM), improving memory locality (software configuration), or reducing the amount of memory I/O. Installing CPUs with a higher clock rate may not improve performance to the degree expected, as the CPUs may need to wait the same amount of time for memory I/O to complete. Put differently, a faster CPU may mean more stall cycles but the same rate of completed instructions.

The actual values for high or low CPI are dependent on the processor and processor features and can be determined experimentally by running known work- loads. As an example, you may find that high-CPI workloads run with a CPI at ten or higher, and low CPI workloads run with a CPI at less than one (which is possi- ble due to instruction pipelining and width, described earlier).

It should be noted that CPI shows the efficiency of instruction processing, but not of the instructions themselves. Consider a software change that added an inefficient software loop, which operates mostly on CPU registers (no stall cycles): such a change may result in a lower overall CPI, but higher CPU usage and utilization.

Utilization
CPU utilization is measured by the time a CPU instance is busy performing work during an interval, expressed as a percentage. It can be measured as the time a CPU is not running the kernel idle thread but is instead running user-level application threads or other kernel threads, or processing interrupts.

High CPU utilization may not necessarily be a problem, but rather a sign that the system is doing work. Some people also consider this an ROI indicator: a highly utilized system is considered to have good ROI, whereas an idle system is considered wasted. Unlike with other resource types (disks), performance does not degrade steeply under high utilization, as the kernel supports priorities, preemption, and time sharing. These together allow the kernel to understand what has higher priority, and to ensure that it runs first.

The measure of CPU utilization spans all clock cycles for eligible activities, including memory stall cycles. It may seem a little counterintuitive, but a CPU may be highly utilized because it is often stalled waiting for memory I/O, not just executing instructions, as described in the previous section.

CPU utilization is often split into separate kernel- and user-time metrics.

More Stories By Brendan Gregg

Brendan Gregg, Lead Performance Engineer at Joyent, analyzes performance and scalability throughout the software stack. As Performance Lead and Kernel Engineer at Sun Microsystems (and later Oracle), his work included developing the ZFS L2ARC, a pioneering file system technology for improving performance using flash memory. He has invented and developed many performance tools, including some that ship with Mac OS X and Oracle® Solaris™ 11. His recent work has included performance visualizations for Linux and illumos kernel analysis. He is coauthor of DTrace (Prentice Hall, 2011) and Solaris™ Performance and Tools (Prentice Hall, 2007).

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


@DevOpsSummit Stories
SYS-CON Events announced today that Plutora, Inc., the leading global provider of enterprise release management and test environment management SaaS solutions, will exhibit at SYS-CON's DevOps Summit 2015 New York, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. Headquartered in Mountain View, California, Plutora provides enterprise release management and test environment SaaS solutions to clients in North America, Europe and Asia Pacific. Leading companies across a variety of industries, including financial services, telecommunications, retail, pharmaceut...
SYS-CON Events announced today the DevOps Foundation Certification Course, being held June ?, 2015, in conjunction with DevOps Summit and 16th Cloud Expo at the Javits Center in New York City, NY. This sixteen (16) hour course provides an introduction to DevOps – the cultural and professional movement that stresses communication, collaboration, integration and automation in order to improve the flow of work between software developers and IT operations professionals. Improved workflows will result in an improved ability to design, develop, deploy and operate software and services faster.
SYS-CON Events announced today that Akana, formerly SOA Software, has been named “Bronze Sponsor” of SYS-CON's 16th International Cloud Expo® New York, which will take place June 9-11, 2015, at the Javits Center in New York City, NY. Akana’s comprehensive suite of API Management, API Security, Integrated SOA Governance, and Cloud Integration solutions helps businesses accelerate digital transformation by securely extending their reach across multiple channels – mobile, cloud and Internet of Things. Akana enables enterprises to share data as APIs, connect and integrate applications, drive part...
SYS-CON Events announced today that SafeLogic has been named “Bag Sponsor” of SYS-CON's 16th International Cloud Expo® New York, which will take place June 9-11, 2015, at the Javits Center in New York City, NY. SafeLogic provides security products for applications in mobile and server/appliance environments. SafeLogic’s flagship product CryptoComply is a FIPS 140-2 validated cryptographic engine designed to secure data on servers, workstations, appliances, mobile devices, and in the Cloud.
SYS-CON Events announced today that FierceDevOps will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. FierceDevOps keeps software developers and IT operations personnel updated on the latest news and trends around the rapidly evolving role of the traditional IT worker.
SOASTA, the leader in performance analytics, today reported record growth of the CloudTest community, exceeding 30,000 registered users of the CloudTest platform in Q1 2015. SOASTA also announced widespread adoption of its Web and mobile testing solutions, with more than 1,600 customers completing more than 285,000 tests using CloudTest during the quarter. This rapid growth shows that DevOps-driven digital businesses are embracing a more continuous approach to testing, and CloudTest is meeting their needs for fast and efficient global performance measurement.
ProfitBricks, the provider of painless cloud infrastructure for IaaS, today announced the release of a Node.js SDK written against its recently launched REST API. This new JavaScript based library provides coverage for all existing ProfitBricks REST API functions. With additional libraries set to release this month, ProfitBricks continues to prove its dedication to the DevOps community and commitment to making cloud migrations and cloud management painless. Node.js is an open source, cross-platform runtime environment. It offers an asynchronous event-driven framework designed to build scala...
SYS-CON Events announced today that WSM International (WSM), the world’s leading cloud and server migration services provider, will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. WSM is a solutions integrator with a core focus on cloud and server migration, transformation and DevOps services.
SYS-CON Events announced today that Site24x7, the cloud infrastructure monitoring service, will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. Site24x7 is a cloud infrastructure monitoring service that helps monitor the uptime and performance of websites, online applications, servers, mobile websites and custom APIs. The monitoring is done from 50+ locations across the world and from various wireless carriers, thus providing a global perspective of the end-user experience. Site24x7 supports monitoring H...
SYS-CON Events announced today that Gridstore™, the leader in hyper-converged infrastructure purpose-built to optimize Microsoft workloads, will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. Gridstore™ is the leader in hyper-converged infrastructure purpose-built for Microsoft workloads and designed to accelerate applications in virtualized environments. Gridstore’s hyper-converged infrastructure is the industry’s first all flash version of HyperConverged Appliances that include both compute and storag...
SYS-CON Events announced today the IoT Bootcamp – Jumpstart Your IoT Strategy, being held June 9–10, 2015, in conjunction with 16th Cloud Expo and Internet of @ThingsExpo at the Javits Center in New York City. This is your chance to jumpstart your IoT strategy. Combined with real-world scenarios and use cases, the IoT Bootcamp is not just based on presentations but includes hands-on demos and walkthroughs. We will introduce you to a variety of Do-It-Yourself IoT platforms including Arduino, Raspberry Pi, BeagleBone, Spark and Intel Edison. You will also get an overview of cloud technologies s...
SYS-CON Events announced today that Soha will exhibit at SYS-CON's DevOps Summit New York, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. Soha delivers enterprise-grade application security, on any device, as agile as the cloud. This turnkey, cloud-based service enables customers to solve secure application access and delivery challenges that traditional or virtualized network solutions cannot solve because they are too expensive, inflexible and operationally complex. Organizations can now take full advantage of the cloud without sacrificing security or a...
Containers and microservices have become topics of intense interest throughout the cloud developer and enterprise IT communities. Accordingly, attendees at the upcoming 16th Cloud Expo at the Javits Center in New York June 9-11 will find fresh new content in a new track called PaaS | Containers & Microservices Containers are not being considered for the first time by the cloud community, but a current era of re-consideration has pushed them to the top of the cloud agenda. With the launch of Docker's initial release in March of 2013, interest was revved up several notches. Then late last...
of cloud, colocation, managed services and disaster recovery solutions, will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. TierPoint, LLC, is a leading national provider of information technology and data center services, including cloud, colocation, disaster recovery and managed IT services, with corporate headquarters in St. Louis, MO. TierPoint was formed through the strategic combination of some of the country's most respected and innovative data centers and IT solution providers whose origins date...
ProfitBricks, the provider of painless cloud infrastructure IaaS, today released its SDK for Ruby, written against the company's new RESTful API. The new SDK joins ProfitBricks' previously announced support for the popular multi-cloud open-source Fog project. This new Ruby SDK, which exposes advanced functionality to take advantage of ProfitBricks' simplicity and productivity, aligns with ProfitBricks' mission to provide a painless way to automate infrastructure in the cloud. Ruby is a general-purpose programming language, characterized as dynamic, reflective and object-oriented. Designed t...
The 17th International Cloud Expo has announced that its Call for Papers is open. 17th International Cloud Expo, to be held November 3-5, 2015, at the Santa Clara Convention Center in Santa Clara, CA, brings together Cloud Computing, APM, APIs, Microservices, Security, Big Data, Internet of Things, DevOps and WebRTC to one location. With cloud computing driving a higher percentage of enterprise IT budgets every year, it becomes increasingly important to plant your flag in this fast-expanding business opportunity. Submit your speaking proposal today!
The 5th International DevOps Summit, co-located with 17th International Cloud Expo – being held November 3-5, 2015, at the Santa Clara Convention Center in Santa Clara, CA – announces that its Call for Papers is open. Born out of proven success in agile development, cloud computing, and process automation, DevOps is a macro trend you cannot afford to miss. From showcase success stories from early adopters and web-scale businesses, DevOps is expanding to organizations of all sizes, including the world's largest enterprises – and delivering real results. Among the proven benefits, DevOps is cor...
ProfitBricks has launched its new DevOps Central and REST API, along with support for three multi-cloud libraries and a Python SDK. This, combined with its already existing SOAP API and its new RESTful API, moves ProfitBricks into a position to better serve the DevOps community and provide the ability to automate cloud infrastructure in a multi-cloud world. Following this momentum, ProfitBricks has also introduced several libraries that enable developers to use their favorite language to code against its RESTful and SOAP APIs. This development provides users with multi-cloud support and enha...
Chef and Canonical announced a partnership to integrate and distribute Chef with Ubuntu. Canonical is integrating the Chef automation platform with Canonical's Machine-As-A-Service (MAAS), enabling users to automate the provisioning, configuration and deployment of bare metal compute resources in the data center. Canonical is packaging Chef 12 server in upcoming distributions of its Ubuntu open source operating system and will provide commercial support for Chef within its user base.
In 2015, 4.9 billion connected "things" will be in use. By 2020, Gartner forecasts this amount to be 25 billion, a 410 percent increase in just five years. How will businesses handle this rapid growth of data? Hadoop will continue to improve its technology to meet business demands, by enabling businesses to access/analyze data in real time, when and where they need it. Cloudera's Chief Technologist, Eli Collins, will discuss how Big Data is keeping up with today's data demands and how in the future, data and analytics will be pervasive, embedded into every workflow, application and infra...
SYS-CON Media announced today that @ThingsExpo Blog launched with 7,788 original stories. @ThingsExpo Blog offers top articles, news stories, and blog posts from the world's well-known experts and guarantees better exposure for its authors than any other publication. @ThingsExpo Blog can be bookmarked. The Internet of Things (IoT) is the most profound change in personal and enterprise IT since the creation of the Worldwide Web more than 20 years ago.
The world's leading Cloud event, Cloud Expo has launched Microservices Journal on the SYS-CON.com portal, featuring over 19,000 original articles, news stories, features, and blog entries. DevOps Journal is focused on this critical enterprise IT topic in the world of cloud computing. Microservices Journal offers top articles, news stories, and blog posts from the world's well-known experts and guarantees better exposure for its authors than any other publication. Follow new article posts on Twitter at @MicroservicesE
DevOps tasked with driving success in the cloud need a solution to efficiently leverage multiple clouds while avoiding cloud lock-in. Flexiant today announces the commercial availability of Flexiant Concerto. With Flexiant Concerto, DevOps have cloud freedom to automate the build, deployment and operations of applications consistently across multiple clouds. Concerto is available through four disruptive pricing models aimed to deliver multi-cloud at a price point everyone can afford.
SYS-CON Events announced today that Vicom Computer Services, Inc., a provider of technology and service solutions, will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. They are located at booth #427. Vicom Computer Services, Inc. is a progressive leader in the technology industry for over 30 years. Headquartered in the NY Metropolitan area. Vicom provides products and services based on today’s requirements around Unified Networks, Cloud Computing strategies, Virtualization around Software defined Data Ce...
While DevOps most critically and famously fosters collaboration, communication, and integration through cultural change, culture is more of an output than an input. In order to actively drive cultural evolution, organizations must make substantial organizational and process changes, and adopt new technologies, to encourage a DevOps culture. Moderated by Andi Mann, panelists will discuss how to balance these three pillars of DevOps, where to focus attention (and resources), where organizations might slip up with the wrong focus, how to manage change and risk in all three areas, what is possibl...