Welcome!

@DevOpsSummit Authors: Pat Romanski, Roger Strukhoff, Liz McMillan, Elizabeth White, Derek Weeks

Related Topics: @DevOpsSummit, Java IoT, Linux Containers, @CloudExpo, Apache

@DevOpsSummit: Blog Feed Post

Solr vs. Elasticsearch — How to Decide? By @Sematext | @DevOpsSummit [#DevOps]

Which one is better, Solr or Elasticsearch? Which one is faster? Which one scales better?

Solr vs. Elasticsearch — How to Decide?

By Otis Gospodnetić

[Otis is a Lucene, Solr, and Elasticsearch expert and co-author of “Lucene in Action” (1st and 2nd editions).  He is also the founder and CEO of Sematext. See full bio below.]

“Solr or Elasticsearch?”…well, at least that is the common question I hear from Sematext’s consulting services clients and prospects.  Which one is better, Solr or Elasticsearch?  Which one is faster?  Which one scales better?  Which one can do X, and Y, and Z?  Which one is easier to manage?  Which one should we use?  Which one do you recommend? etc., etc.

These are all great questions, though not always with clear and definite, universally applicable answers. So which one do we recommend you use? How do you choose in the end?  Well, let me share how I see Solr and Elasticsearch past, present, and future, let’s do a bit of comparing and contrasting, and hopefully help you make the right choice for your particular needs.

Early Days: Youth vs. Experience

Apache Solr is a mature project with a large and active development and user community behind it, as well as the Apache brand.  First released to open-source in 2006, Solr has long dominated the search engine space and was the go-to engine for anyone needing search functionality.  Its maturity translates to rich functionality beyond vanilla text indexing and searching; such as faceting, grouping (aka field collapsing), powerful filtering, pluggable document processing, pluggable search chain components, language detection, etc.

Solr dominated the search scene for several years.  Then, around 2010, Elasticsearch appeared as another option on the market.  Back then it was nowhere near as stable as Solr, did not have Solr’s feature depth, did not have the mindshare, brand, and so on.  But it had a few other things going for it: Elasticsearch was young and built on more modern principles, aimed at more modern use cases, and was built to make handling of large indices and high query rates easier.  Moreover, because it was so young and without a community to work with, it had the freedom to move forward in leaps and bounds, without requiring any sort of consensus or cooperation with others (users or developers), backwards compatibility, or anything else that more mature software typically has to handle.  As such it exposed certain highly sought-after functionality (e.g., Near Real-Time Search) before Solr did.  Technically speaking, the ability to have NRT Search really came from Lucene, the underlying search library to both Solr and Elasticsearch use.  The irony is that because Elasticsearch exposed NRT Search first, people associated NRT Search with Elasticsearch, even though Solr and Lucene are both part of the same Apache project and, as such, one would expect Solr to have such highly demanded functionality first.

Elasticsearch, being more modern, appealed to several groups of people and organizations:

  • those who didn’t yet have a search engine and hadn’t invested a lot of time, money, and energy in its adoption, integration, etc.
  • those who had to deal with large volumes of data and needed to more easily shard and replicate data (search indices) and shrink or grow their search cluster

Of course, let’s admit it, there will always be those who like jumping on new shiny objects, too.

Evening the Search Playing Field

Fast forward to 2014 and now 2015.  Elasticsearch is no longer new, but it’s still shiny.  It closed the feature gap with Solr and, in some cases, surpassed it.  It certainly has more buzz around it.  At this point both projects are very mature.  Both have lots of features.  Both are stable.  I have to say though, that I do see more Elasticsearch clusters with issues, but I think that is primarily because of a few reasons:

  • Elasticsearch, traditionally being easier to get started with, made it possible for anyone to start using it out of the box, without too much understanding of how things work.  That’s great to get started, but dangerous when data/cluster grows.
  • Elasticsearch, lending itself to easier scaling, attracts use cases demanding larger clusters with more data and more nodes.
  • Elasticsearch is more dynamic – data can easily move around the cluster as its nodes come and go, and this can impact stability and performance of the cluster.
  • While Solr has traditionally been more geared toward text search, Elasticsearch is aiming to handle analytical types of queries, too, and such queries come at a price.

Although this may sound scary, let me put it this way — Elasticsearch exposes a ton of control knobs one can play with to control the beast.  Of course, the key bit is that one has to be aware of all possible knobs, know what they do, and make use of that.  For example, despite what you just read about Elasticsearch, we rely on it in our organization for several different products, even though we know Solr just as well as we know Elasticsearch.

Solr: Not Totally Eclipsed

What about Solr?  Solr hasn’t exactly stood still.  The appearance of Elasticsearch was actually great for Solr and its community of developers and users.  Despite being almost 10 years old, Solr development is going faster than ever.  It, too, has a friendly API now.  It, too, has the ability to more easily grow and shrink clusters, create indices more dynamically, shard them on the fly, route documents and queries, etc., etc. Note: when people refer to SolrCloud they specifically mean this form of very distributed, Elasticsearch-like Solr deployment.

I recently attended a Lucene/Solr Revolution conference in Washington, DC and was pleasantly surprised by what I saw: a strong community, healthy project, lots of big name companies not only using Solr, but investing in it through adoption, contribution through development/engineering time, etc.  If you follow just the news you’d be led to believe Solr is dead and everyone is just flocking to Elasticsearch.  That is actually not the case.  Elasticsearch being newer, is naturally more interesting to write about.  Solr was news 5+ years ago.  And of course there were some people going from Solr to Elasticsearch when Elasticsearch appeared — in the beginning there were simply no Elasticsearch users.

So which is better?  Which one should you use?  Where do Solr and Elasticsearch differ?  What does the future hold?

Here are some other things you should keep in mind:

  • Both are released under the Apache Software License
  • Solr is truly open-source — community over code.  Anyone can contribute to Solr and new Solr developers (aka committers) are elected based on merit.  Elasticsearch is technically open-source, but less so in spirit.  Anyone can see the source, anyone can change it and offer a contribution, but only employees of Elasticsearch can actually make changes to Elasticsearch.
  • Solr contributors and committers come from a number of different organizations, while Elasticsearch committers are from a single company.
  • A number of organizations have chosen Solr over Elasticsearch as their horses in the search race (e.g. Cloudera, Hortonworks, MapR, etc.) even though they’ve also partnered with Elasticsearch.
  • Both Solr and Elasticsearch have lively user and developer communities and are rapidly being developed.
  • If you need to add certain missing functionality to either Solr or Elasticsearch, you may have more luck with Solr.  True, there are ancient Solr JIRA issues that are still open, but at least they are still open and not closed.  In Solr world the community has a bit more say even though at the end of the day it’s one of the Solr developers who has to accept and handle the contribution.
  • Both have good commercial support (consulting, production support, integration, etc.)
  • Both have good operational tools around it, although Elasticsearch has, because of its easier-to-work-with API, attracted the DevOps crowd a lot more, thus enabling a livelier ecosystem of tools around it.
  • Elasticsearch dominates the open-source log management use case — lots of organizations index their logs in Elasticsearch to make them searchable.  While Solr can now be used for this, too (see Solr for Indexing and Searching Logs and Tuning Solr for Logs), it just missed the mindshare boat on this one.
  • Solr is still much more text-search-oriented.  On the other hand, Elasticsearch is often for filtering and grouping – the analytical query workload – and not necessarily text search.  Elasticsearch developers are putting a lot of effort into making such queries more efficient (lowering of the memory footprint and CPU usage) at both Lucene and Elasticsearch level.  As such, at this point in time, Elasticsearch is a better choice for applications that need to do not just text search, but also complex search-time aggregations.
  • Elasticsearch is a bit easier to get started – a single download and a single command to get everything started.  Solr has traditionally required a bit more work and knowledge, but Solr has recently made great strides to eliminate this and now just has to work on changing its reputation.
  • Performance-wise, they are roughly the same.  I say “roughly”, because nobody has ever done comprehensive and non-biased benchmarks.  For 95% of use cases either choice will be just fine in terms of performance, and the remaining 5% need to test both solutions with their particular data and their particular access patterns.
  • Operationally speaking, Elasticsearch is a bit simpler to work with – it has just a single process.  Solr, in its Elasticsearch-like fully distributed deployment mode known as SolrCloud, depends on Apache ZooKeeper.  ZooKeeper is super mature, super widely used, etc. etc., but it’s still another moving part.  That said, if you are using Hadoop, HBase, Spark, Kafka, or a number of other newer distributed software, you are likely already running ZooKeeper somewhere in your organization.
  • While Elasticsearch has built-in ZooKeeper-like component called Xen, ZooKeeper is better at preventing the dreaded split-brain problem sometimes seen in Elasticsearch clusters.  To be fair, Elasticsearch developers are aware of this problem and are working on improving this aspect of Elasticsearch.
  • If you love monitoring and metrics, with Elasticsearch you’ll be in heaven.  The thing has more metrics than people you can squeeze in Times Square on New Year’s Eve!  Solr exposes the key metrics, but nowhere near as many as Elasticsearch.  Regardless, having comprehensive monitoring and centralized logging tools like Sematext’s SPM Performance Monitoring and Logsene Log Management and Analytics — especially when they work seamlessly together like these two do — is essential if you want to have a handle on metrics and other operational data.

Here are a few charts to demonstrate what I mean:

Elasticsearch user mailing list traffic: 36,127 (source: Search-Lucene)

Solr user mailing list traffic is about two thirds of the Elasticsearch mailing list traffic.  Of course, this could be because there are more Elasticsearch users, or because there are more problems with Elasticsearch and users are in need of more help, or perhaps they are just a chattier bunch.

Elasticsearch vs. Solr Contributors (source: Open Hub)   click to enlarge

As you can see, Elasticsearch numbers are trending sharply upward, and now more than double Solr with regard to Commit activity.  This is not a very precise or absolutely correct way to compare open-source projects, but it gives us some data points.  For example, Elasticsearch is developed on Github, which makes it very easy to merge others’ Pull Requests, while Solr contributors tend to create patches, upload them to JIRA, where they get reviewed by Solr committers before being applied — a less streamlined process.  Moreover, Elasticsearch repository contains documentation, not just code, while Solr keeps its documentation in a Wiki.  This contributes to higher numbers for both commits and contributors for Elasticsearch.

Boil It Down For Me

In conclusion, here are the bits that I think make the most difference for anyone having to make a choice:

  • If you’ve already invested a lot of time in Solr, stick with it, unless there are specific use cases that it just doesn’t handle well.  If you think that is the case, speak to somebody close to both Solr and Elasticsearch projects to save you time, guessing, research, and avoid mistakes.
  • If you are a strong believer in true open-source, Solr is closer to that than Elasticsearch, and having one company control Elasticsearch may be a turn-off.
  • If you need a data store that can handle analytical queries in addition to text searching, Elasticsearch is a better choice for that today.

If you expected a single definitive winner, I’m sorry to disappoint.  We don’t have one here.  However, I hope this quick comparison of the two leading open-source search engines provides enough information and guidance to help you make the right choice for your organization

About the author: in addition to being a Lucene, Solr, and Elasticsearch expert and author, Otis Gospodnetić is the founder and CEO of Sematext. Sematext is a globally distributed organization that builds innovative Cloud and On Premises solutions for performance monitoring, alerting and anomaly detection of Solr, Elasticsearch, Hadoop, Spark and many other applications (SPM), log management and analytics (Logsene), site search analytics (SSA), and search enhancement. The company also provides Search and Big Data consulting services and offers 24/7 production support for Solr and Elasticsearch to clients worldwide.

[Note: the original version of this article appeared at Datanami.com

Read the original blog entry...

More Stories By Sematext Blog

Sematext is a globally distributed organization that builds innovative Cloud and On Premises solutions for performance monitoring, alerting and anomaly detection (SPM), log management and analytics (Logsene), and search analytics (SSA). We also provide Search and Big Data consulting services and offer 24/7 production support for Solr and Elasticsearch.

@DevOpsSummit Stories
SYS-CON Events announced today that Synametrics Technologies will exhibit at SYS-CON's 22nd International Cloud Expo®, which will take place on June 5-7, 2018, at the Javits Center in New York, NY. Synametrics Technologies is a privately held company based in Plainsboro, New Jersey that has been providing solutions for the developer community since 1997. Based on the success of its initial product offerings such as WinSQL, Xeams, SynaMan and Syncrify, Synametrics continues to create and hone innovative products that help customers get more from their computer applications, databases and infrastructure. To date, over one million users around the world have chosen Synametrics solutions to help power their accelerated business and personal computing needs.
Cloud Expo | DXWorld Expo have announced the conference tracks for Cloud Expo 2018. Cloud Expo will be held June 5-7, 2018, at the Javits Center in New York City, and November 6-8, 2018, at the Santa Clara Convention Center, Santa Clara, CA. Digital Transformation (DX) is a major focus with the introduction of DX Expo within the program. Successful transformation requires a laser focus on being data-driven and on using all the tools available that enable transformation if they plan to survive over the long term. A total of 88% of Fortune 500 companies from a generation ago are now out of business. Only 12% still survive. Similar percentages are found throughout enterprises of all sizes.
A strange thing is happening along the way to the Internet of Things, namely far too many devices to work with and manage. It has become clear that we'll need much higher efficiency user experiences that can allow us to more easily and scalably work with the thousands of devices that will soon be in each of our lives. Enter the conversational interface revolution, combining bots we can literally talk with, gesture to, and even direct with our thoughts, with embedded artificial intelligence, which can process our conversational commands and orchestrate the outcomes we request across our personal and professional realm of connected devices.
Continuous Delivery makes it possible to exploit findings of cognitive psychology and neuroscience to increase the productivity and happiness of our teams. In his session at 22nd Cloud Expo | DXWorld Expo, Daniel Jones, CTO of EngineerBetter, will answer: How can we improve willpower and decrease technical debt? Is the present bias real? How can we turn it to our advantage? Can you increase a team’s effective IQ? How do DevOps & Product Teams increase empathy, and what impact does empathy have on productivity?
DevOps promotes continuous improvement through a culture of collaboration. But in real terms, how do you: Integrate activities across diverse teams and services? Make objective decisions with system-wide visibility? Use feedback loops to enable learning and improvement? With technology insights and real-world examples, in his general session at @DevOpsSummit, at 21st Cloud Expo, Andi Mann, Chief Technology Advocate at Splunk, explored how leading organizations use data-driven DevOps to close their feedback loops to drive continuous improvement.
As many know, the first generation of Cloud Management Platform (CMP) solutions were designed for managing virtual infrastructure (IaaS) and traditional applications. But that's no longer enough to satisfy evolving and complex business requirements. In his session at 21st Cloud Expo, Scott Davis, Embotics CTO, explored how next-generation CMPs ensure organizations can manage cloud-native and microservice-based application architectures, while also facilitating agile DevOps methodology. He explained how automation, orchestration and governance are fundamental to managing today's hybrid cloud environments and are critical for digital businesses to deliver services faster, with better user experience and higher quality, all while saving money.
There is a huge demand for responsive, real-time mobile and web experiences, but current architectural patterns do not easily accommodate applications that respond to events in real time. Common solutions using message queues or HTTP long-polling quickly lead to resiliency, scalability and development velocity challenges. In his session at 21st Cloud Expo, Ryland Degnan, a Senior Software Engineer on the Netflix Edge Platform team, will discuss how by leveraging a reactive stream-based protocol, we have been able to solve many of these problems at the communication layer. This makes it possible to create rich application experiences and support use-cases such as mobile-to-mobile communication and large file transfers that would be difficult or cost-prohibitive with traditional networking.
Digital transformation is about embracing digital technologies into a company's culture to better connect with its customers, automate processes, create better tools, enter new markets, etc. Such a transformation requires continuous orchestration across teams and an environment based on open collaboration and daily experiments. In his session at 21st Cloud Expo, Alex Casalboni, Technical (Cloud) Evangelist at Cloud Academy, explored and discussed the most urgent unsolved challenges to achieve full cloud literacy in the enterprise world.
You know you need the cloud, but you're hesitant to simply dump everything at Amazon since you know that not all workloads are suitable for cloud. You know that you want the kind of ease of use and scalability that you get with public cloud, but your applications are architected in a way that makes the public cloud a non-starter. You're looking at private cloud solutions based on hyperconverged infrastructure, but you're concerned with the limits inherent in those technologies. What do you do?
The 22nd International Cloud Expo | 1st DXWorld Expo has announced that its Call for Papers is open. Cloud Expo | DXWorld Expo, to be held June 5-7, 2018, at the Javits Center in New York, NY, brings together Cloud Computing, Digital Transformation, Big Data, Internet of Things, DevOps, Machine Learning and WebRTC to one location. With cloud computing driving a higher percentage of enterprise IT budgets every year, it becomes increasingly important to plant your flag in this fast-expanding business opportunity. Submit your speaking proposal today!
In his Opening Keynote at 21st Cloud Expo, John Considine, General Manager of IBM Cloud Infrastructure, led attendees through the exciting evolution of the cloud. He looked at this major disruption from the perspective of technology, business models, and what this means for enterprises of all sizes. John Considine is General Manager of Cloud Infrastructure Services at IBM. In that role he is responsible for leading IBM’s public cloud infrastructure including strategy, development, and offering management. To date, IBM has launched more than 50 cloud data centers that span the globe. He has been building advanced technology, delivering “as a service” solutions, and managing infrastructure services for the past 20 years.
In a recent survey, Sumo Logic surveyed 1,500 customers who employ cloud services such as Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP). According to the survey, a quarter of the respondents have already deployed Docker containers and nearly as many (23 percent) are employing the AWS Lambda serverless computing framework. It’s clear: serverless is here to stay. The adoption does come with some needed changes, within both application development and operations. That means serverless is also changing the way we leverage public clouds. Truth-be-told, many enterprise IT shops were so happy to get out of the management of physical servers within a data center that many limitations of the existing public IaaS clouds were forgiven. However, now that we’ve lived a few years with public IaaS clouds, developers and CloudOps pros are giving a huge thumbs down to the ...
Modern software design has fundamentally changed how we manage applications, causing many to turn to containers as the new virtual machine for resource management. As container adoption grows beyond stateless applications to stateful workloads, the need for persistent storage is foundational - something customers routinely cite as a top pain point. In his session at @DevOpsSummit at 21st Cloud Expo, Bill Borsari, Head of Systems Engineering at Datera, explored how organizations can reap the benefits of the cloud without losing performance as containers become the new paradigm.
The past few years have brought a sea change in the way applications are architected, developed, and consumed—increasing both the complexity of testing and the business impact of software failures. How can software testing professionals keep pace with modern application delivery, given the trends that impact both architectures (cloud, microservices, and APIs) and processes (DevOps, agile, and continuous delivery)? This is where continuous testing comes in. D
SYS-CON Events announced today that Evatronix will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Evatronix SA offers comprehensive solutions in the design and implementation of electronic systems, in CAD / CAM deployment, and also is a designer and manufacturer of advanced 3D scanners for professional applications.
Kubernetes is an open source system for automating deployment, scaling, and management of containerized applications. Kubernetes was originally built by Google, leveraging years of experience with managing container workloads, and is now a Cloud Native Compute Foundation (CNCF) project. Kubernetes has been widely adopted by the community, supported on all major public and private cloud providers, and is gaining rapid adoption in enterprises. However, Kubernetes may seem intimidating and complex to learn. This is because Kubernetes is more of a toolset than a ready solution. Hence it’s essential to know when and how to apply the appropriate Kubernetes constructs.
22nd International Cloud Expo, taking place June 5-7, 2018, at the Javits Center in New York City, NY, and co-located with the 1st DXWorld Expo will feature technical sessions from a rock star conference faculty and the leading industry players in the world. Cloud computing is now being embraced by a majority of enterprises of all sizes. Yesterday's debate about public vs. private has transformed into the reality of hybrid cloud: a recent survey shows that 74% of enterprises have a hybrid cloud strategy. Meanwhile, 94% of enterprises are using some form of XaaS – software, platform, and infrastructure as a service.
22nd International Cloud Expo, taking place June 5-7, 2018, at the Javits Center in New York City, NY, and co-located with the 1st DXWorld Expo will feature technical sessions from a rock star conference faculty and the leading industry players in the world. Cloud computing is now being embraced by a majority of enterprises of all sizes. Yesterday's debate about public vs. private has transformed into the reality of hybrid cloud: a recent survey shows that 74% of enterprises have a hybrid cloud strategy. Meanwhile, 94% of enterprises are using some form of XaaS – software, platform, and infrastructure as a service.
DevOps at Cloud Expo – being held June 5-7, 2018, at the Javits Center in New York, NY – announces that its Call for Papers is open. Born out of proven success in agile development, cloud computing, and process automation, DevOps is a macro trend you cannot afford to miss. From showcase success stories from early adopters and web-scale businesses, DevOps is expanding to organizations of all sizes, including the world's largest enterprises – and delivering real results. Among the proven benefits, DevOps is correlated with 20% faster time-to-market, 22% improvement in quality, and 18% reduction in dev and ops costs, according to research firm Vanson-Bourne. It is changing the way IT works, how businesses interact with customers, and how organizations are buying, building, and delivering software.
All clouds are not equal. To succeed in a DevOps context, organizations should plan to develop/deploy apps across a choice of on-premise and public clouds simultaneously depending on the business needs. This is where the concept of the Lean Cloud comes in - resting on the idea that you often need to relocate your app modules over their life cycles for both innovation and operational efficiency in the cloud.
@DevOpsSummit at Cloud Expo, taking place June 5-7, 2018, at the Javits Center in New York City, NY, is co-located with 22nd Cloud Expo | 1st DXWorld Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The widespread success of cloud computing is driving the DevOps revolution in enterprise IT. Now as never before, development teams must communicate and collaborate in a dynamic, 24/7/365 environment. There is no time to wait for long development cycles that produce software that is obsolete at launch. DevOps may be disruptive, but it is essential.
SYS-CON Events announced today that T-Mobile exhibited at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. As America's Un-carrier, T-Mobile US, Inc., is redefining the way consumers and businesses buy wireless services through leading product and service innovation. The Company's advanced nationwide 4G LTE network delivers outstanding wireless experiences to 67.4 million customers who are unwilling to compromise on quality and value. Based in Bellevue, Washington, T-Mobile US provides services through its subsidiaries and operates its flagship brands, T-Mobile and MetroPCS. For more information, visit https://www.t-mobile.com.
SYS-CON Events announced today that Cedexis will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Cedexis is the leader in data-driven enterprise global traffic management. Whether optimizing traffic through datacenters, clouds, CDNs, or any combination, Cedexis solutions drive quality and cost-effectiveness. For more information, please visit https://www.cedexis.com.
SYS-CON Events announced today that Google Cloud has been named “Keynote Sponsor” of SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Companies come to Google Cloud to transform their businesses. Google Cloud’s comprehensive portfolio – from infrastructure to apps to devices – helps enterprises innovate faster, scale smarter, stay secure, and do more with data than ever before.
Gemini is Yahoo’s native and search advertising platform. To ensure the quality of a complex distributed system that spans multiple products and components and across various desktop websites and mobile app and web experiences – both Yahoo owned and operated and third-party syndication (supply), with complex interaction with more than a billion users and numerous advertisers globally (demand) – it becomes imperative to automate a set of end-to-end tests 24x7 to detect bugs and regression. In their session at 21st Cloud Expo, Jenny Hung, E2E Engineer Manager at Yahoo Gemini, Haoran Zhao, Software Engineer at Oath Gemini, and Lin Zhang, Software Engineer at Oath (Yahoo), will describe the technical challenges and the principles we followed to build a reliable and scalable test automation infrastructure across desktops, mobile apps, and mobile web platforms on the cloud. We also share some...