Welcome!

@DevOpsSummit Authors: Yeshim Deniz, Pat Romanski, Liz McMillan, Elizabeth White, Jyoti Bansal

Related Topics: @DevOpsSummit, Java IoT, Linux Containers, @CloudExpo, Apache

@DevOpsSummit: Blog Feed Post

Solr vs. Elasticsearch — How to Decide? By @Sematext | @DevOpsSummit [#DevOps]

Which one is better, Solr or Elasticsearch? Which one is faster? Which one scales better?

Solr vs. Elasticsearch — How to Decide?

By Otis Gospodnetić

[Otis is a Lucene, Solr, and Elasticsearch expert and co-author of “Lucene in Action” (1st and 2nd editions).  He is also the founder and CEO of Sematext. See full bio below.]

“Solr or Elasticsearch?”…well, at least that is the common question I hear from Sematext’s consulting services clients and prospects.  Which one is better, Solr or Elasticsearch?  Which one is faster?  Which one scales better?  Which one can do X, and Y, and Z?  Which one is easier to manage?  Which one should we use?  Which one do you recommend? etc., etc.

These are all great questions, though not always with clear and definite, universally applicable answers. So which one do we recommend you use? How do you choose in the end?  Well, let me share how I see Solr and Elasticsearch past, present, and future, let’s do a bit of comparing and contrasting, and hopefully help you make the right choice for your particular needs.

Early Days: Youth vs. Experience

Apache Solr is a mature project with a large and active development and user community behind it, as well as the Apache brand.  First released to open-source in 2006, Solr has long dominated the search engine space and was the go-to engine for anyone needing search functionality.  Its maturity translates to rich functionality beyond vanilla text indexing and searching; such as faceting, grouping (aka field collapsing), powerful filtering, pluggable document processing, pluggable search chain components, language detection, etc.

Solr dominated the search scene for several years.  Then, around 2010, Elasticsearch appeared as another option on the market.  Back then it was nowhere near as stable as Solr, did not have Solr’s feature depth, did not have the mindshare, brand, and so on.  But it had a few other things going for it: Elasticsearch was young and built on more modern principles, aimed at more modern use cases, and was built to make handling of large indices and high query rates easier.  Moreover, because it was so young and without a community to work with, it had the freedom to move forward in leaps and bounds, without requiring any sort of consensus or cooperation with others (users or developers), backwards compatibility, or anything else that more mature software typically has to handle.  As such it exposed certain highly sought-after functionality (e.g., Near Real-Time Search) before Solr did.  Technically speaking, the ability to have NRT Search really came from Lucene, the underlying search library to both Solr and Elasticsearch use.  The irony is that because Elasticsearch exposed NRT Search first, people associated NRT Search with Elasticsearch, even though Solr and Lucene are both part of the same Apache project and, as such, one would expect Solr to have such highly demanded functionality first.

Elasticsearch, being more modern, appealed to several groups of people and organizations:

  • those who didn’t yet have a search engine and hadn’t invested a lot of time, money, and energy in its adoption, integration, etc.
  • those who had to deal with large volumes of data and needed to more easily shard and replicate data (search indices) and shrink or grow their search cluster

Of course, let’s admit it, there will always be those who like jumping on new shiny objects, too.

Evening the Search Playing Field

Fast forward to 2014 and now 2015.  Elasticsearch is no longer new, but it’s still shiny.  It closed the feature gap with Solr and, in some cases, surpassed it.  It certainly has more buzz around it.  At this point both projects are very mature.  Both have lots of features.  Both are stable.  I have to say though, that I do see more Elasticsearch clusters with issues, but I think that is primarily because of a few reasons:

  • Elasticsearch, traditionally being easier to get started with, made it possible for anyone to start using it out of the box, without too much understanding of how things work.  That’s great to get started, but dangerous when data/cluster grows.
  • Elasticsearch, lending itself to easier scaling, attracts use cases demanding larger clusters with more data and more nodes.
  • Elasticsearch is more dynamic – data can easily move around the cluster as its nodes come and go, and this can impact stability and performance of the cluster.
  • While Solr has traditionally been more geared toward text search, Elasticsearch is aiming to handle analytical types of queries, too, and such queries come at a price.

Although this may sound scary, let me put it this way — Elasticsearch exposes a ton of control knobs one can play with to control the beast.  Of course, the key bit is that one has to be aware of all possible knobs, know what they do, and make use of that.  For example, despite what you just read about Elasticsearch, we rely on it in our organization for several different products, even though we know Solr just as well as we know Elasticsearch.

Solr: Not Totally Eclipsed

What about Solr?  Solr hasn’t exactly stood still.  The appearance of Elasticsearch was actually great for Solr and its community of developers and users.  Despite being almost 10 years old, Solr development is going faster than ever.  It, too, has a friendly API now.  It, too, has the ability to more easily grow and shrink clusters, create indices more dynamically, shard them on the fly, route documents and queries, etc., etc. Note: when people refer to SolrCloud they specifically mean this form of very distributed, Elasticsearch-like Solr deployment.

I recently attended a Lucene/Solr Revolution conference in Washington, DC and was pleasantly surprised by what I saw: a strong community, healthy project, lots of big name companies not only using Solr, but investing in it through adoption, contribution through development/engineering time, etc.  If you follow just the news you’d be led to believe Solr is dead and everyone is just flocking to Elasticsearch.  That is actually not the case.  Elasticsearch being newer, is naturally more interesting to write about.  Solr was news 5+ years ago.  And of course there were some people going from Solr to Elasticsearch when Elasticsearch appeared — in the beginning there were simply no Elasticsearch users.

So which is better?  Which one should you use?  Where do Solr and Elasticsearch differ?  What does the future hold?

Here are some other things you should keep in mind:

  • Both are released under the Apache Software License
  • Solr is truly open-source — community over code.  Anyone can contribute to Solr and new Solr developers (aka committers) are elected based on merit.  Elasticsearch is technically open-source, but less so in spirit.  Anyone can see the source, anyone can change it and offer a contribution, but only employees of Elasticsearch can actually make changes to Elasticsearch.
  • Solr contributors and committers come from a number of different organizations, while Elasticsearch committers are from a single company.
  • A number of organizations have chosen Solr over Elasticsearch as their horses in the search race (e.g. Cloudera, Hortonworks, MapR, etc.) even though they’ve also partnered with Elasticsearch.
  • Both Solr and Elasticsearch have lively user and developer communities and are rapidly being developed.
  • If you need to add certain missing functionality to either Solr or Elasticsearch, you may have more luck with Solr.  True, there are ancient Solr JIRA issues that are still open, but at least they are still open and not closed.  In Solr world the community has a bit more say even though at the end of the day it’s one of the Solr developers who has to accept and handle the contribution.
  • Both have good commercial support (consulting, production support, integration, etc.)
  • Both have good operational tools around it, although Elasticsearch has, because of its easier-to-work-with API, attracted the DevOps crowd a lot more, thus enabling a livelier ecosystem of tools around it.
  • Elasticsearch dominates the open-source log management use case — lots of organizations index their logs in Elasticsearch to make them searchable.  While Solr can now be used for this, too (see Solr for Indexing and Searching Logs and Tuning Solr for Logs), it just missed the mindshare boat on this one.
  • Solr is still much more text-search-oriented.  On the other hand, Elasticsearch is often for filtering and grouping – the analytical query workload – and not necessarily text search.  Elasticsearch developers are putting a lot of effort into making such queries more efficient (lowering of the memory footprint and CPU usage) at both Lucene and Elasticsearch level.  As such, at this point in time, Elasticsearch is a better choice for applications that need to do not just text search, but also complex search-time aggregations.
  • Elasticsearch is a bit easier to get started – a single download and a single command to get everything started.  Solr has traditionally required a bit more work and knowledge, but Solr has recently made great strides to eliminate this and now just has to work on changing its reputation.
  • Performance-wise, they are roughly the same.  I say “roughly”, because nobody has ever done comprehensive and non-biased benchmarks.  For 95% of use cases either choice will be just fine in terms of performance, and the remaining 5% need to test both solutions with their particular data and their particular access patterns.
  • Operationally speaking, Elasticsearch is a bit simpler to work with – it has just a single process.  Solr, in its Elasticsearch-like fully distributed deployment mode known as SolrCloud, depends on Apache ZooKeeper.  ZooKeeper is super mature, super widely used, etc. etc., but it’s still another moving part.  That said, if you are using Hadoop, HBase, Spark, Kafka, or a number of other newer distributed software, you are likely already running ZooKeeper somewhere in your organization.
  • While Elasticsearch has built-in ZooKeeper-like component called Xen, ZooKeeper is better at preventing the dreaded split-brain problem sometimes seen in Elasticsearch clusters.  To be fair, Elasticsearch developers are aware of this problem and are working on improving this aspect of Elasticsearch.
  • If you love monitoring and metrics, with Elasticsearch you’ll be in heaven.  The thing has more metrics than people you can squeeze in Times Square on New Year’s Eve!  Solr exposes the key metrics, but nowhere near as many as Elasticsearch.  Regardless, having comprehensive monitoring and centralized logging tools like Sematext’s SPM Performance Monitoring and Logsene Log Management and Analytics — especially when they work seamlessly together like these two do — is essential if you want to have a handle on metrics and other operational data.

Here are a few charts to demonstrate what I mean:

Elasticsearch user mailing list traffic: 36,127 (source: Search-Lucene)

Solr user mailing list traffic is about two thirds of the Elasticsearch mailing list traffic.  Of course, this could be because there are more Elasticsearch users, or because there are more problems with Elasticsearch and users are in need of more help, or perhaps they are just a chattier bunch.

Elasticsearch vs. Solr Contributors (source: Open Hub)   click to enlarge

As you can see, Elasticsearch numbers are trending sharply upward, and now more than double Solr with regard to Commit activity.  This is not a very precise or absolutely correct way to compare open-source projects, but it gives us some data points.  For example, Elasticsearch is developed on Github, which makes it very easy to merge others’ Pull Requests, while Solr contributors tend to create patches, upload them to JIRA, where they get reviewed by Solr committers before being applied — a less streamlined process.  Moreover, Elasticsearch repository contains documentation, not just code, while Solr keeps its documentation in a Wiki.  This contributes to higher numbers for both commits and contributors for Elasticsearch.

Boil It Down For Me

In conclusion, here are the bits that I think make the most difference for anyone having to make a choice:

  • If you’ve already invested a lot of time in Solr, stick with it, unless there are specific use cases that it just doesn’t handle well.  If you think that is the case, speak to somebody close to both Solr and Elasticsearch projects to save you time, guessing, research, and avoid mistakes.
  • If you are a strong believer in true open-source, Solr is closer to that than Elasticsearch, and having one company control Elasticsearch may be a turn-off.
  • If you need a data store that can handle analytical queries in addition to text searching, Elasticsearch is a better choice for that today.

If you expected a single definitive winner, I’m sorry to disappoint.  We don’t have one here.  However, I hope this quick comparison of the two leading open-source search engines provides enough information and guidance to help you make the right choice for your organization

About the author: in addition to being a Lucene, Solr, and Elasticsearch expert and author, Otis Gospodnetić is the founder and CEO of Sematext. Sematext is a globally distributed organization that builds innovative Cloud and On Premises solutions for performance monitoring, alerting and anomaly detection of Solr, Elasticsearch, Hadoop, Spark and many other applications (SPM), log management and analytics (Logsene), site search analytics (SSA), and search enhancement. The company also provides Search and Big Data consulting services and offers 24/7 production support for Solr and Elasticsearch to clients worldwide.

[Note: the original version of this article appeared at Datanami.com

Read the original blog entry...

More Stories By Sematext Blog

Sematext is a globally distributed organization that builds innovative Cloud and On Premises solutions for performance monitoring, alerting and anomaly detection (SPM), log management and analytics (Logsene), and search analytics (SSA). We also provide Search and Big Data consulting services and offer 24/7 production support for Solr and Elasticsearch.

@DevOpsSummit Stories
SYS-CON Events announced today that Outscale, a global pure play Infrastructure as a Service provider and strategic partner of Dassault Systèmes, will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Founded in 2010, Outscale simplifies infrastructure complexities and boosts the business agility of its customers. Outscale delivers a secure, reliable and industrial strength solution for its customers, which include solopreneurs, fast-growing startups, large multi-national companies and government agencies. The company is a strategic partner to Dassault Systémes, and today powers hundreds of organizations throughout North America, Europe and Asia. Outscale’s U.S. headquarters is located in Jersey City, New Jersey, and its global headquarters is in Saint-Cloud, France.
SYS-CON Events announced today that Outscale will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Outscale's technology makes an automated and adaptable Cloud available to businesses, supporting them in the most complex IT projects while controlling their operational aspects. You boost your IT infrastructure's reactivity, with request responses that only take a few seconds.
SYS-CON Events announced today that Systena America will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Systena Group has been in business for various software development and verification in Japan, US, ASEAN, and China by utilizing the knowledge we gained from all types of device development for various industries including smartphones (Android/iOS), wireless communication, security technology and IoT services.
DevOps at Cloud Expo – being held October 31 - November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA – announces that its Call for Papers is open. Born out of proven success in agile development, cloud computing, and process automation, DevOps is a macro trend you cannot afford to miss. From showcase success stories from early adopters and web-scale businesses, DevOps is expanding to organizations of all sizes, including the world's largest enterprises – and delivering real results. Among the proven benefits, DevOps is correlated with 20% faster time-to-market, 22% improvement in quality, and 18% reduction in dev and ops costs, according to research firm Vanson-Bourne. It is changing the way IT works, how businesses interact with customers, and how organizations are buying, building, and delivering software.
Interested in leveling up on your Cloud Foundry skills? Join IBM for Cloud Foundry Days on June 7 at Cloud Expo New York at the Javits Center in New York City. Cloud Foundry Days is a free half day educational conference and networking event. Come find out why Cloud Foundry is the industry's fastest-growing and most adopted cloud application platform.
Cloud promises the agility required by today’s digital businesses. As organizations adopt cloud based infrastructures and services, their IT resources become increasingly dynamic and hybrid in nature. Managing these require modern IT operations and tools. In his session at 20th Cloud Expo, Raj Sundaram, Senior Principal Product Manager at CA Technologies, will discuss how to modernize your IT operations in order to proactively manage your hybrid cloud and IT environments. He will be sharing best practices around collaboration, monitoring, configuration and analytics that will help you boost experience and optimize utilization of your modern IT Infrastructures.
Five years ago development was seen as a dead-end career, now it’s anything but – with an explosion in mobile and IoT initiatives increasing the demand for skilled engineers. But apart from having a ready supply of great coders, what constitutes true ‘DevOps Royalty’? It’ll be the ability to craft resilient architectures, supportability, security everywhere across the software lifecycle. In his keynote at @DevOpsSummit at 20th Cloud Expo, Jeffrey Scheaffer, GM and SVP, Continuous Delivery Business Unit at CA Technologies, will share his vision about the true ‘DevOps Royalty’ and how it will take a new breed of digital cloud craftsman, architecting new platforms with a new set of tools to achieve it. He will also present a number of important insights and findings from a recent cloud and DevOps study – outlining the synergies high performance teams are exploiting to gain significant busin...
SYS-CON Events announced today that Twistlock, the leading provider of cloud container security solutions, will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Twistlock is the industry's first enterprise security suite for container security. Twistlock's technology addresses risks on the host and within the application of the container, enabling enterprises to consistently enforce security policies, monitor and audit activity and identify and isolate threats in a container or cluster of containers.
SYS-CON Events announced today that CollabNet, a global leader in enterprise software development, release automation and DevOps solutions, will be a Bronze Sponsor of SYS-CON's 20th International Cloud Expo®, taking place from June 6-8, 2017, at the Javits Center in New York City, NY. CollabNet offers a broad range of solutions with the mission of helping modern organizations deliver quality software at speed. The company’s latest innovation, the DevOps Lifecycle Manager (DLM), supports Value Stream Mapping for the development and operations tool chain by offering DevOps Tool Chain Integration and Traceability; DevOps Tool Chain Orchestration; and DevOps Insight and Intelligence. CollabNet also offers traditional application lifecycle management, ALM, for the enterprise through its TeamForge product.
SYS-CON Events announced today that Peak 10, Inc., a national IT infrastructure and cloud services provider, will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Peak 10 provides reliable, tailored data center and network services, cloud and managed services. Its solutions are designed to scale and adapt to customers’ changing business needs, enabling them to lower costs, improve performance and focus internal resources on core competencies.
DevOps is often described as a combination of technology and culture. Without both, DevOps isn't complete. However, applying the culture to outdated technology is a recipe for disaster; as response times grow and connections between teams are delayed by technology, the culture will die. A Nutanix Enterprise Cloud has many benefits that provide the needed base for a true DevOps paradigm. In his Day 3 Keynote at 20th Cloud Expo, Chris Brown, a Solutions Marketing Manager at Nutanix, will explore the ways that Nutanix technologies empower teams to react faster than ever before and connect teams in ways that were either too complex or simply impossible with traditional infrastructures.
SYS-CON Events announced today that Super Micro Computer, Inc., a global leader in compute, storage and networking technologies, will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Supermicro (NASDAQ: SMCI), the leading innovator in high-performance, high-efficiency server technology, is a premier provider of advanced server Building Block Solutions® for Data Center, Cloud Computing, Enterprise IT, Hadoop/Big Data, HPC and Embedded Systems worldwide. Supermicro is committed to protecting the environment through its “We Keep IT Green®” initiative and provides customers with the most energy-efficient, environmentally friendly solutions available on the market.
Regardless of what business you’re in, it’s increasingly a software-driven business. Consumers’ rising expectations for connected digital and physical experiences are driving what some are calling the "Customer Experience Challenge.” In his session at @DevOpsSummit at 20th Cloud Expo, Marco Morales, Director of Global Solutions at CollabNet, will discuss how organizations are increasingly adopting a discipline of Value Stream Mapping to ensure that the software they are producing is poised to offer continuous improvements to customers’ experience of products and services.
This talk centers around how to automate best practices in a multi-/hybrid-cloud world based on our work with customers like GE, Discovery Communications and Fannie Mae. Today’s enterprises are reaping the benefits of cloud computing, but also discovering many risks and challenges. In the age of DevOps and the decentralization of IT, it’s easy to over-provision resources, forget that instances are running, or unintentionally expose vulnerabilities.
With major technology companies and startups seriously embracing Cloud strategies, now is the perfect time to attend @CloudExpo | @ThingsExpo, June 6-8, 2017, at the Javits Center in New York City, NY and October 31 - November 2, 2017, Santa Clara Convention Center, CA. Learn what is going on, contribute to the discussions, and ensure that your enterprise is on the right path to Digital Transformation.
Most DevOps journeys involve several phases of maturity. Research shows that the inflection point where organizations begin to see maximum value is when they implement tight integration deploying their code to their infrastructure. Success at this level is the last barrier to at-will deployment. Storage, for instance, is more capable than where we read and write data. In his session at @DevOpsSummit at 20th Cloud Expo, Josh Atwell, a Developer Advocate for NetApp, will discuss the role and value extensible storage infrastructure has in accelerating software development activities, improve code quality, reveal multiple deployment options through automated testing, and support continuous integration efforts. All this will be described using tools common in DevOps organizations.
SYS-CON Events announced today that SoftLayer, an IBM Company, has been named “Gold Sponsor” of SYS-CON's 18th Cloud Expo, which will take place on June 7-9, 2016, at the Javits Center in New York, New York. SoftLayer, an IBM Company, provides cloud infrastructure as a service from a growing number of data centers and network points of presence around the world. SoftLayer’s customers range from Web startups to global enterprises.
In his opening keynote at 20th Cloud Expo, Michael Maximilien, Research Scientist, Architect, and Engineer at IBM, will motivate why realizing the full potential of the cloud and social data requires artificial intelligence. By mixing Cloud Foundry and the rich set of Watson services, IBM's Bluemix is the best cloud operating system for enterprises today, providing rapid development and deployment of applications that can take advantage of the rich catalog of Watson services to help drive insights from the vast trove of private and public data available to enterprises.
SYS-CON Events announced today that delaPlex will exhibit at SYS-CON's @CloudExpo, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. delaPlex pioneered Software Development as a Service (SDaaS), which provides scalable resources to build, test, and deploy software. It’s a fast and more reliable way to develop a new product or expand your in-house team.
SYS-CON Events announced today that EARP Integration will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. EARP Integration is a passionate software house. Since its inception in 2009 the company successfully delivers smart solutions for cities and factories that start their digital transformation. EARP provides bespoke solutions like, for example, advanced enterprise portals, business intelligence systems and mobile applications for international enterprises across different sectors such as Energy and Utilities, GreenTech, MedTech, FinTech, Facility Management and Housing, Automotive Manufacturing, and Sport. EARP also cooperates with international software houses by providing them with highly qualified and well-selected, multilingual teams for bigger projects.
SYS-CON Events announced today that Progress, a global leader in application development, has been named “Bronze Sponsor” of SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Enterprises today are rapidly adopting the cloud, while continuing to retain business-critical/sensitive data inside the firewall. This is creating two separate data silos – one inside the firewall and the other outside the firewall. Cloud ISVs often get requests to connect these silos using technologies such as VPN; however, these tend to be difficult to manage and are not engineered for accessing business data from the cloud.
As cloud adoption continues to transform business, today's global enterprises are challenged with managing a growing amount of information living outside of the data center. The rapid adoption of IoT and increasingly mobile workforce are exacerbating the problem. Ensuring secure data sharing and efficient backup poses capacity and bandwidth considerations as well as policy and regulatory compliance issues.
SYS-CON Events announced today that Cloud Academy will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Cloud Academy is the industry’s most innovative, vendor-neutral cloud technology training platform. Cloud Academy provides continuous learning solutions for individuals and enterprise teams for Amazon Web Services, Microsoft Azure, Google Cloud Platform, and the most popular cloud computing technologies. Get certified, manage the full lifecycle of your cloud-based resources, and build your knowledge based using Cloud Academy’s expert-created content, comprehensive Learning Paths, and innovative Hands-on Labs.
SYS-CON Events announced today that delaPlex will exhibit at SYS-CON's @ThingsExpo, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. delaPlex pioneered Software Development as a Service (SDaaS), which provides scalable resources to build, test, and deploy software. It’s a fast and more reliable way to develop a new product or expand your in-house team.
SYS-CON Events announced today that Interoute has been named “Bronze Sponsor” of SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Interoute is the owner operator of Europe's largest network and a global cloud services platform, which encompasses over 70,000 km of lit fiber, 15 data centers, 17 virtual data centers and 33 colocation centers, with connections to 195 additional partner data centers. Our full-service Unified ICT platform serves startups and international enterprises, as well as every major European telecommunications service provider and major operators across the world, global internet giants, governments and universities.