Welcome!

@DevOpsSummit Authors: Liz McMillan, Yeshim Deniz, Zakia Bouachraoui, Pat Romanski, Elizabeth White

Related Topics: @DevOpsSummit, Microservices Expo, @CloudExpo, @DXWorldExpo

@DevOpsSummit: Blog Feed Post

Why We Switched to Cassandra By @DanJones914 | @DevOpsSummit [#DevOps]

Cassandra was the best option to help deliver on these extreme high availability and reliability requirements

Why We Switched to Cassandra

By Dan Jones

Due to the nature of our business, high availability is extremely important to VictorOps and something we take very seriously. We know our customers rely on our service to be always up so that we can process and deliver their alerts and notifications. One of the key components that is critical to the functioning and availability of any SaaS service is the datastore.

At VictorOps we have historically used MySQL in high availability Percona Xtradb Clusters for operational and analytical uses. While MySQL is a mature and reliable relational database and has performed well, we had planned from early on to move to a more horizontally scalable datastore in order to meet our scalability and high availability requirements (including multi-datacenter failover capabilities).

Last fall we began to evaluate datastore alternatives that could help improve scalability, both relational and NoSQL, before deciding to use Cassandra. After evaluating these options we decided that Cassandra was the best option to help deliver on these extreme high availability and reliability requirements.

apache-cassandra

Some of Cassandra’s strengths that influenced this decision include:

- High Availability – Cassandra is a distributed database where all nodes are equivalent (i.e. there is no master node so clients can connect to any available node). Data is replicated at a configurable number of nodes, so that failure of some number of nodes (depending on the replication factor) will not result in loss of data. From the CAP theorem perspective (Consistency, Availability, Partition tolerance), Cassandra’s design provides tunable consistency at the read/write request level, which allows you to increase availability at the expense of consistency where it makes sense.

- Scalability – Cassandra has been shown to be linearly scalable. Since each node adds processing power as well as data capacity, it is possible to scale incrementally to very large data volumes and high throughputs by simply adding new nodes to the cluster.

- “Self-healing” – Cassandra’s eventually consistent data model and node repair features ensure that the consistency of the cluster will be automatically maintained over time. This also makes it very easy to recover failed nodes, increase or decrease the size of the cluster as needed, and even do in place version upgrades (in most cases).

- Multi-datacenter replication – Cassandra’s node replication and eventual consistency features are core to the functioning of this distributed system. These features were designed from the outset and have been improved and battle tested throughout its lifetime and are now considered highly reliable. These features were therefore easily extended to clusters that contain nodes in different geographical locations, and due to the eventual consistency model this includes support for true Active-Active clusters. In fact, Cassandra has the reputation of having the most robust, reliable multi-datacenter replication of any datastore in the industry. This is an important part of our multi-datacenter failover capability at VictorOps and was one of the major factors in the decision to go with Cassandra.

- Large community – Cassandra is an Apache project with a very large, active community including influential companies like Netflix. In addition DataStax continues to drive development and continual improvements of the Cassandra core as well as operational components (they also provide support subscriptions).

5012504924_88ed505a04_z

While Cassandra has many advantages including those described above, it is very different than most other datastores. Cassandra is not a relational database and while the interface to retrieve data (CQL) is very similar to SQL, the underlying data storage and access model is very different. As a result, the performance and operational characteristics of Cassandra are very dependent on the application data model. Therefore, it is important to understand how data is accessed and to design the data model so that it will perform well on the common queries that the application uses.

One data model on which Cassandra performs particularly well is log structured (or time series) data. In this type of model, the data represents a series of measurements or events that happen over time, rather than a set of updates to existing data items. Cassandra allows storing these “immutable” events contiguously on disk ordered by a clustering key (which is often insertion time). It is therefore very efficient to return the set of items based on this clustering key, using serial rather than random disk I/O.

There are many parts of VictorOp’s data model that naturally map to this log structured approach. For example, an incident’s lifecycle is comprised of a set of events that cause the state of the incident to change (e.g. a Critical alert, Creation of an Incident, a Paging escalation, an Acknowledgement, a Recovery, etc). VictorOps surfaces this in the notion of the main Timeline as well as an Incident Timeline.

Obviously the choice of a datastore is an important decision that has a major affect on the scalability, reliability, availability, maintainability and extensibility of a SaaS service. While Cassandra requires more awareness of the underlying data access patterns and the operational characteristics when designing the system, we feel that the benefits it provides in terms of availability, linear scalability and seamless, reliable multi-datacenter replication are a great fit for our business requirements, and will scale to meet our needs in the future.

The post Why we switched to Cassandra appeared first on VictorOps.

More Stories By VictorOps Blog

VictorOps is making on-call suck less with the only collaborative alert management platform on the market.

With easy on-call scheduling management, a real-time incident timeline that gives you contextual relevance around your alerts and powerful reporting features that make post-mortems more effective, VictorOps helps your IT/DevOps team solve problems faster.

@DevOpsSummit Stories
With more than 30 Kubernetes solutions in the marketplace, it's tempting to think Kubernetes and the vendor ecosystem has solved the problem of operationalizing containers at scale or of automatically managing the elasticity of the underlying infrastructure that these solutions need to be truly scalable. Far from it. There are at least six major pain points that companies experience when they try to deploy and run Kubernetes in their complex environments. In this presentation, the speaker will detail these pain points and explain how cloud can address them.
While DevOps most critically and famously fosters collaboration, communication, and integration through cultural change, culture is more of an output than an input. In order to actively drive cultural evolution, organizations must make substantial organizational and process changes, and adopt new technologies, to encourage a DevOps culture. Moderated by Andi Mann, panelists discussed how to balance these three pillars of DevOps, where to focus attention (and resources), where organizations might slip up with the wrong focus, how to manage change and risk in all three areas, what is possible and what is not, where to start, and especially how new structures, processes, and technologies can help drive a new DevOps culture.
When building large, cloud-based applications that operate at a high scale, it's important to maintain a high availability and resilience to failures. In order to do that, you must be tolerant of failures, even in light of failures in other areas of your application. "Fly two mistakes high" is an old adage in the radio control airplane hobby. It means, fly high enough so that if you make a mistake, you can continue flying with room to still make mistakes. In his session at 18th Cloud Expo, Lee Atchison, Principal Cloud Architect and Advocate at New Relic, discussed how this same philosophy can be applied to highly scaled applications, and can dramatically increase your resilience to failure.
As Cybric's Chief Technology Officer, Mike D. Kail is responsible for the strategic vision and technical direction of the platform. Prior to founding Cybric, Mike was Yahoo's CIO and SVP of Infrastructure, where he led the IT and Data Center functions for the company. He has more than 24 years of IT Operations experience with a focus on highly-scalable architectures.
The explosion of new web/cloud/IoT-based applications and the data they generate are transforming our world right before our eyes. In this rush to adopt these new technologies, organizations are often ignoring fundamental questions concerning who owns the data and failing to ask for permission to conduct invasive surveillance of their customers. Organizations that are not transparent about how their systems gather data telemetry without offering shared data ownership risk product rejection, regulatory scrutiny and increasing consumer lack of trust in technology in general.