@DevOpsSummit Authors: Zakia Bouachraoui, Liz McMillan, Yeshim Deniz, Pat Romanski, Elizabeth White

Related Topics: @DevOpsSummit, Microservices Expo, @CloudExpo, @DXWorldExpo

@DevOpsSummit: Blog Feed Post

Why We Switched to Cassandra By @DanJones914 | @DevOpsSummit [#DevOps]

Cassandra was the best option to help deliver on these extreme high availability and reliability requirements

Why We Switched to Cassandra

By Dan Jones

Due to the nature of our business, high availability is extremely important to VictorOps and something we take very seriously. We know our customers rely on our service to be always up so that we can process and deliver their alerts and notifications. One of the key components that is critical to the functioning and availability of any SaaS service is the datastore.

At VictorOps we have historically used MySQL in high availability Percona Xtradb Clusters for operational and analytical uses. While MySQL is a mature and reliable relational database and has performed well, we had planned from early on to move to a more horizontally scalable datastore in order to meet our scalability and high availability requirements (including multi-datacenter failover capabilities).

Last fall we began to evaluate datastore alternatives that could help improve scalability, both relational and NoSQL, before deciding to use Cassandra. After evaluating these options we decided that Cassandra was the best option to help deliver on these extreme high availability and reliability requirements.


Some of Cassandra’s strengths that influenced this decision include:

- High Availability – Cassandra is a distributed database where all nodes are equivalent (i.e. there is no master node so clients can connect to any available node). Data is replicated at a configurable number of nodes, so that failure of some number of nodes (depending on the replication factor) will not result in loss of data. From the CAP theorem perspective (Consistency, Availability, Partition tolerance), Cassandra’s design provides tunable consistency at the read/write request level, which allows you to increase availability at the expense of consistency where it makes sense.

- Scalability – Cassandra has been shown to be linearly scalable. Since each node adds processing power as well as data capacity, it is possible to scale incrementally to very large data volumes and high throughputs by simply adding new nodes to the cluster.

- “Self-healing” – Cassandra’s eventually consistent data model and node repair features ensure that the consistency of the cluster will be automatically maintained over time. This also makes it very easy to recover failed nodes, increase or decrease the size of the cluster as needed, and even do in place version upgrades (in most cases).

- Multi-datacenter replication – Cassandra’s node replication and eventual consistency features are core to the functioning of this distributed system. These features were designed from the outset and have been improved and battle tested throughout its lifetime and are now considered highly reliable. These features were therefore easily extended to clusters that contain nodes in different geographical locations, and due to the eventual consistency model this includes support for true Active-Active clusters. In fact, Cassandra has the reputation of having the most robust, reliable multi-datacenter replication of any datastore in the industry. This is an important part of our multi-datacenter failover capability at VictorOps and was one of the major factors in the decision to go with Cassandra.

- Large community – Cassandra is an Apache project with a very large, active community including influential companies like Netflix. In addition DataStax continues to drive development and continual improvements of the Cassandra core as well as operational components (they also provide support subscriptions).


While Cassandra has many advantages including those described above, it is very different than most other datastores. Cassandra is not a relational database and while the interface to retrieve data (CQL) is very similar to SQL, the underlying data storage and access model is very different. As a result, the performance and operational characteristics of Cassandra are very dependent on the application data model. Therefore, it is important to understand how data is accessed and to design the data model so that it will perform well on the common queries that the application uses.

One data model on which Cassandra performs particularly well is log structured (or time series) data. In this type of model, the data represents a series of measurements or events that happen over time, rather than a set of updates to existing data items. Cassandra allows storing these “immutable” events contiguously on disk ordered by a clustering key (which is often insertion time). It is therefore very efficient to return the set of items based on this clustering key, using serial rather than random disk I/O.

There are many parts of VictorOp’s data model that naturally map to this log structured approach. For example, an incident’s lifecycle is comprised of a set of events that cause the state of the incident to change (e.g. a Critical alert, Creation of an Incident, a Paging escalation, an Acknowledgement, a Recovery, etc). VictorOps surfaces this in the notion of the main Timeline as well as an Incident Timeline.

Obviously the choice of a datastore is an important decision that has a major affect on the scalability, reliability, availability, maintainability and extensibility of a SaaS service. While Cassandra requires more awareness of the underlying data access patterns and the operational characteristics when designing the system, we feel that the benefits it provides in terms of availability, linear scalability and seamless, reliable multi-datacenter replication are a great fit for our business requirements, and will scale to meet our needs in the future.

The post Why we switched to Cassandra appeared first on VictorOps.

More Stories By VictorOps Blog

VictorOps is making on-call suck less with the only collaborative alert management platform on the market.

With easy on-call scheduling management, a real-time incident timeline that gives you contextual relevance around your alerts and powerful reporting features that make post-mortems more effective, VictorOps helps your IT/DevOps team solve problems faster.

@DevOpsSummit Stories
Dynatrace is an application performance management software company with products for the information technology departments and digital business owners of medium and large businesses. Building the Future of Monitoring with Artificial Intelligence. Today we can collect lots and lots of performance data. We build beautiful dashboards and even have fancy query languages to access and transform the data. Still performance data is a secret language only a couple of people understand. The more business becomes digital the more stakeholders are interested in this data including how it relates to business. Some of these people have never used a monitoring tool before. They have a question on their mind like "How is my application doing" but no idea how to get a proper answer.
DXWorldEXPO LLC announced today that Nutanix has been named "Platinum Sponsor" of CloudEXPO | DevOpsSUMMIT | DXWorldEXPO New York, which will take place November 12-13, 2018 in New York City. Nutanix makes infrastructure invisible, elevating IT to focus on the applications and services that power their business. The Nutanix Enterprise Cloud Platform blends web-scale engineering and consumer-grade design to natively converge server, storage, virtualization and networking into a resilient, software-defined solution with rich machine intelligence.
Hackers took three days to identify and exploit a known vulnerability in Equifax’s web applications. I will share new data that reveals why three days (at most) is the new normal for DevSecOps teams to move new business /security requirements from design into production. This session aims to enlighten DevOps teams, security and development professionals by sharing results from the 4th annual State of the Software Supply Chain Report -- a blend of public and proprietary data with expert research and analysis.Attendees can join this session to better understand how DevSecOps teams are applying lessons from W. Edwards Deming (circa 1982), Malcolm Goldrath (circa 1984) and Gene Kim (circa 2013) to improve their ability to respond to new business requirements and cyber risks.
So the dumpster is on fire. Again. The site's down. Your boss's face is an ever-deepening purple. And you begin debating whether you should join the #incident channel or call an ambulance to deal with his impending stroke. Yes, we know this is a developer's fault. There's plenty of time for blame later. Postmortems have a macabre name because they were once intended to be Viking-like funerals for someone's job. But we're civilized now. Sort of. So we call them post-incident reviews. Fires are never going to stop. We're human. We miss bugs. Or we fat finger a command - deleting dozens of servers and bringing down S3 in US-EAST-1 for hours - effectively halting the internet. These things happen.
The digital transformation is real! To adapt, IT professionals need to transform their own skillset to become more multi-dimensional by gaining both depth and breadth of a wide variety of knowledge and competencies. Historically, while IT has been built on a foundation of specialty (or "I" shaped) silos, the DevOps principle of "shifting left" is opening up opportunities for developers, operational staff, security and others to grow their skills portfolio, advance their careers and become "T"-shaped.