sia.hackernoon.com

Five years ago Manticore began as a fork of an open source version of the once popular search engine Sphinx Search. We had ~~two bags of grass, seventy-five pellets of mescaline~~, three C++ developers, a support engineer, a power user of Sphinx Search / backend team lead, an experienced manager, a mother of five helping us part-time, and a ton of bugs, crashes, and technical debts. So we got a shovel and other digging tools and started working to get it up to the search engine industry standards. Not that Sphinx was impossible to use, but many things were missing, and existing features weren’t quite stable or mature. And we had pushed it about as far as we could. So after 5 years and hundreds of new users, we’re ready to say that Manticore Search can be used as an alternative to Elasticsearch for both full-text search and (now) data analytics too.

In this article, I want to:

⭐⭐⭐ Your star on GitHub supports the project and makes us think we are on the right path!⭐⭐⭐

A little of the history

2001 - just Lucene and Sphinx

The first Apple store opened, Windows XP, iTunes and Mac OS X were released.

The genius Andrey Aksyonoff started working on Sphinx Search, for which I want to thank him very much! There was no SOLR and Elasticsearch yet, but there was already Lucene, on which they were both subsequently built. Sphinx Search started slowly coming together, and in a few years became quite popular technology having an impact on thousands of websites using it.

2010 - Elasticsearch appeared

Retina display, systemd, Ipad, and Elasticsearch appeared.

By this time Sphinx was already a popular full-text search engine, but the Sphinx’s concept of “source data has to be somewhere and we just make a full-text index that needs to be rebuilt regularly” was not as interesting as Elasticsearch’s “give me any JSON via HTTP in real-time, I will find a node to place it on”. SOLR wasn’t very good with data distribution, and JSON was gaining popularity, while XML was losing its attraction. Soon Elasticsearch started to rapidly gain popularity.

2017 - Manticore appeared

Elastic had firmly established itself as a standard tool for full-text search and log and data analytics.
Sphinx ceased its development as an open source project. Development, in general, slowed down significantly, and for some time was completely suspended.
Many Sphinx users who loved it and knew how to deal with it were not pleased about this and it was painful for them to migrate to Elasticsearch. In addition, by then, Elasticsearch’s conceptual flaws had surfaced: excessive memory consumption, difficulty in maintaining large clusters, and some performance issues.

As a result, the frustrated users and some former Sphinx developers teamed up and built a fork - Manticore Search. Our primary goals were as follows:

Continue developing the project as an open source
Look at everything from just a regular everyday normal user’s point of view and add the functionality they need in today’s environment
Strengthen Sphinx’s strong sides and eliminate obvious weaknesses

2022: Five more years later

“Okay, who wants to find out if this thing works?”

🙁 Sphinx 2: The main use case is indexing data from an external database: Sphinx returns id, then by id you have to go to the database and search there for the source document. The data schema can only be declared in the config.

✅ Manticore: The basic way to work with it is exactly the same as in MySQL / Postgres and Elasticsearch: a table can be created on the fly, data can be modified by a single/bulk INSERT/REPLACE/DELETE query, the data gets automatically compacted in the background. There is no need to look up the original document in an external source. Auto ID supported.

🙁 Sphinx 2: No replication.

✅ Manticore: Replication based on Galera, which is also used by Mariadb and Percona Server.

🙁 Sphinx 2: Queries can be done via SQL (MySQL wire protocol) or Sphinx binary protocol, there are clients for a few programming languages.

✅ Manticore: Added JSON interface very similar to Elasticsearch’s. Based on the new protocol, new clients for PHP, Python, Java, Javascript, and Elixir were built. The clients are generated automatically, making new functionality available in the client sooner after it appears in the engine.

🙁 Sphinx 2: Difficult to configure text tokenization for most languages

✅ Manticore: Simplified: made aliases cjk and non_cjk. Made tokenization of Chinese based on ICU. Added many new stemmers, including Ukrainian.

🙁 Sphinx 2: No official docker image and no support in the Kubernetes ecosystem

✅ Manticore: Made and support official docker and Helm chart for Kubernetes

🙁 Sphinx 2: No APT/YUM/Homebrew repositories

✅ Manticore: Added APT/YUM/Homebrew repositories. Nightly builds are also available in the development repository. Each new commit becomes available as a package.

🙁 Sphinx 2: Novice users had a hard time understanding what’s what.

✅ Manticore: Made platform with interactive courses — https://play.manticoresearch.com/

🙁 Sphinx 2: Few examples in the documentation

✅ Manticore: rewrote documentation, made our own rendering engine for it - https://manual.manticoresearch.com/. It’s also available in a simple markdown format for contributions and easy editing.

🙁 Sphinx 2: Bugs, that often lead to crashes

✅ Manticore: Crashes are now rare. Hundreds of old bugs have been fixed.

🙁 Sphinx 2: Running search queries in parallel is limited

✅ Manticore: Migrated to coroutines. Made it possible to parallelize any search query, so as to fully load the CPU and reduce the response time to a minimum

🙁 Sphinx 2: Cannot be used without full-text fields

✅ Manticore: Can be used without full-text, like any other database.

🙁 Sphinx 2: Non-full-text data is stored row-wise, it must be in memory to work efficiently.

✅ Manticore: Implemented and open-sourced Manticore Columnar Library, an external fully independent library that allows storing data column-oriented in blocks with support for different codecs for compressing different types of data efficiently. Requires almost no memory. You can now handle much larger amounts of data on the same server.

🙁 Sphinx 2: No secondary indexes

✅ Manticore: The second important functionality of Manticore Columnar LIbrary is support for secondary indexes based on the modern and innovative PGM algorithm.

🙁 Sphinx 2: No percolate indexes for reverse search (when there are queries in the index and documents are used as input to find out which queries would match them)

✅ Manticore: Added percolate type indexes.

This is approximately only a third of the changes - the ones you can easily see. On top of that, there have been many months of refactoring different parts of the system, resulting in a much simpler, more reliable, and more productive code. We hope this will attract new developers to the project.

What about Elasticsearch?

Elasticsearch is fine: it’s not very hard to use up to a certain amount of data, there’s replication, fault tolerance, and rich functionality. But there are nuances.

Let’s take a look at those nuances and what Manticore is like compared to Elasticsearch now (July 2022). Future reader, we’ve already bolted something else on, check out our Changelog.

Search Speed

Performance, namely low response time, is important in many cases, especially in log and data analytics, when there is a lot of data and not many search queries. You don’t want to wait 30 seconds instead of two for a response, do you? So here’s to the nuances: Elasticsearch is considered a standard for log management, but, for example, it can’t effectively parallelize a query to a single index shard. And Elasticsearch has only 1 shard by default, but there are much more CPU cores in modern servers. Making too many shards is also bad. All this doesn’t make life any easier for a devops who cares about the response time: you have to think about what hardware Elasticsearch will run on and make changes accordingly.

Manticore, on the contrary, is able to parallelize the search query to all CPU cores unconditionally and by default. It would be more correct to say that Manticore itself decides when to parallelize and when not, but in most cases it does, which allows you to efficiently load the CPU cores (which are often idle in cases of logging and data analytics) and significantly reduce response time.

But even if you make as many shards in Elasticsearch as there are CPU cores on the server, Manticore turns out to be significantly faster, specifically: here’s a test for 1.7 billion documents, from which you can see that overall Manticore is 4 times faster than Elasticsearch. If you are interested in the details or want to reproduce that on your own hardware, here is an article https://db-benchmarks.com/test-taxi/ (all examples below are also supported by scripts and links, etc., you won’t find any idle talking in this blog)

Here is a different case: no big data, just 1.1 million comments from Hacker News. In this test, Manticore is 15x faster than Elasticsearch. All the details here.

And another test indicative for Elasticsearch as a standard log analytics tool - 10 million Nginx logs and various quite realistic analytical queries - Manticore is 22 times faster than Elasticsearch here. All the details here

Data ingestion performance

There are also nuances with Elasticsearch’s write speed. For example, the dataset for the 1.7 billion-document test discussed above was loaded:

to Elasticsearch - in 28 hours and 33 minutes
to Manticore Search - 1 hour and 8 minutes.

This was on a 32-core server with SSD. The amounts of data after indexing are about the same. To learn more about how exactly the load was handled read here.

In brief:

Source - csv
Logstash was used to put data to Elasticsearch with PIPELINE_BATCH_SIZE=10000 and PIPELINE_WORKERS=32 in 32 shards.
Manticore Search used a built-in tool indexer to put data to 32 shards in parallel.

Here is the log of the data loading to Elasticsearch and Manticore: https://gist.github.com/sanikolaev/678dd862a7668921e3417321be0a2513

It turns out that in this test Manticore is 25 times faster in terms of data ingestion. Maybe I don’t know how to bake Logstash and Elasticsearch, but the import of the same dataset (but of a slightly smaller size) took Mark Litwintschik even longer - 4 days and 16 hours.

Maybe the problem is in Logstash, not Elasticsearch? Let’s go find out by writing directly to Elasticsearch. The index scheme is as follows:

"properties": {
  "name": {"type": "text"},
  "email": {"type": "keyword"},
  "description": {"type": "text"},
  "age": {"type": "integer"},
  "active": {"type": "integer"}
}

Starting Manticore and Elasticsearch using their official docker images like this:

docker run --name manticore --rm -p 9308:9308 -v $(pwd)/manticore_idx:/var/lib/manticore manticoresearch/manticore:5.0.2

docker run --name elasticsearch --rm -p 9200:9200 -e discovery.type=single-node -e xpack.security.enabled=false -v $(pwd)/es_idx/:/usr/share/elasticsearch/data docker.elastic.co/elasticsearch/elasticsearch:8.3.2

Let’s now put 50 million random docs like this to both:

{
  1,
  84,
  "Aut corporis qui necessitatibus architecto est. Harum laboriosam temporibus praesentium quis et nulla. Consequuntur quia neque et repellat.",
  "[email protected]",
  "Keely Doyle Sr."
}

We’ll use simple php scripts with a batch size 10,000 and concurrency 32 (there are 16 physical CPU cores on the server and hyper-threading).

root@perf3 ~ # php load_elasticsearch.php 10000 32 1000000 50
preparing...
found in cache
querying...
finished inserting
Total time: 178.24096798897
280519 docs per sec

root@perf3 ~ # php load_manticore.php 10000 32 1000000 50
preparing...
found in cache
querying...
finished inserting
Total time: 215.7572619915
231742 docs per sec

OK, now Elastic is 21% faster, but again there is an interesting nuance: Elasticsearch by default buffers new documents for one second, which means the last batch will not be available for searching right away. This is ok in many cases, but to make things fair let’s do /bulk?refresh=1 in Elasticsearch and see what it gives:

root@perf3 ~ # php load_elasticsearch.php 10000 32 1000000 50
preparing...
found in cache
querying...
finished inserting
Total time: 307.47588610649
162614 docs per sec

In this case Manticore is again faster by 43%.

If we want to test the maximum performance, we can:

Use sharding in both Elasticsearch and Manticore
Let Elasticsearch buffer incoming documents at maximum
Use MySQL interface to put data to Manticore Search (it’s slightly faster)
Disable binlog in Manticore Search (unfortunately, you can’t do that in Elasticsearch)

Here’s what it gives:

Manticore:

// docker run -p9306:9306 --name manticore --rm -v $(pwd)/manticore_idx:/var/lib/manticore -e searchd_binlog_path= manticoresearch/manticore:5.0.2

root@perf3 ~ # php load_manticore_sharded.php 10000 32 1000000 32 50
preparing...
found in cache /tmp/bc9719fb0d26e18fc53d6d5aaaf847b4_10000_1000000
querying...
finished inserting
Total time: 55.874907970428
894856 docs per sec

Elasticsearch:

root@perf3 ~ # php load_elasticsearch_sharded.php 10000 32 1000000 32 50
preparing...
found in cache
querying...
finished inserting
Total time: 119.96515393257
416788 docs per sec

But, remember the nuance: you have to spend another 13 seconds to make the documents searchable:

root@perf3 ~ # curl -s -X POST "localhost:9200/_sql?format=json&pretty" -H 'Content-Type: application/json' -d'{"query": "select count(*) from user"}'                 {
  "columns" : [
    {
      "name" : "count(*)",
      "type" : "long"
    }
  ],
  "rows" : [
    [
      0
    ]
  ]
}

root@perf3 ~ # time curl -XPOST "localhost:9200/user/_refresh"
{"_shards":{"total":64,"successful":32,"failed":0}}
real    0m13.505s
user    0m0.003s
sys     0m0.000s

root@perf3 ~ # curl -s -X POST "localhost:9200/_sql?format=json&pretty" -H 'Content-Type: application/json' -d'{"query": "select count(*) from user"}'
{
  "columns" : [
    {
      "name" : "count(*)",
      "type" : "long"
    }
  ],
  "rows" : [
    [
      50000000
    ]
  ]
}

All in all, Manticore is 2x faster than Elasticsearch in terms of data ingestion performance. And the data is searchable immediately after the batch is loaded, not 2 minutes later. The scripts used for this test can be found here.

What it’s written in

Elasticsearch itself is written in Java, and the Lucene library it uses and depends on is also written in Java.
Manticore is written in C++. What it gives:
- The code is harder to write, yes.
- But we are closer to the hardware, so we can make more optimized code.
- No need to think about JVM heap size.
- There is no risk for JVM garbage collector to start gc at an inappropriate moment, which can greatly affect performance.
- No need to run a heavy JVM on startup which takes quite a time.

Open source

Elasticsearch is not a pure open source anymore. The license was changed from Apache 2 to the Elastic License in 2021.
Manticore is purely open source with GPLv2 license for the daemon and the Apache 2 license for the columnar library.

JSON vs SQL

Both Elasticsearch and Manticore can do both SQL and JSON, but the difference is:

Elasticsearch is based on JSON by default while Manticore is SQL-first. What we love in SQL is that if use it, many things are much easier to do at the proof of concept stage. For example, here are 2 queries that do the same thing. Do you wanna spend a minute counting { and } brackets or … ?

SQL is very limited in Elasticsearch, for example:
- you can’t do SELECT id
- you can’t INSERT/UPDATE/DELETE
- you can’t run service commands (create cluster, see status, etc.).
In Manticore it’s the other way around:
- You can do everything via SQL
- JSON covers only basic functionality: search and data modification queries.

Startup time

In some cases, you need to be able to launch a service quickly. For example, in IoT (Internet of things) or some ETL scenarios.

Elasticsearch takes a long time to start up.
Manticore takes just a couple of seconds to start up with a table of 1.1 million documents

Near-real-time vs real-time

As mentioned above, by defaultwhen you put data to Elasticsearch, it becomes searchable only after a second. This can be adjusted, but then the ingestion rate will become significantly slower, as you can see above.

Manticore always works in real-time mode.

Full-text search

Probably worth another article to explain it all. In short: both Manticore and Elasticsearch are good in terms of full-text search, have a lot in common, but there are a lot of differences, too. According to these objective tests (which is important when evaluating relevance) on almost default settings Manticore can give higher relevance than Elasticsearch. Here is the relevant pull request in BEIR(information retrieval benchmark).

Aggregations

Both Manticore and Elasticsearch provide rich aggregation functionality. You probably know what Elasticsearch can do, here’s what can be done in Manticore for you to compare:

Just grouping: SELECT release_year FROM films GROUP BY release_year LIMIT 5
Get aggregates: SELECT release_year, AVG(rental_rate) FROM films GROUP BY release_year LIMIT 5
Sort buckets: SELECT release_year, count(*) from films GROUP BY release_year ORDER BY release_year asc limit 5
Group by multiple fields at the same time: SELECT category_id, release_year, count(*) FROM films GROUP BY category_id, release_year ORDER BY category_id ASC, release_year ASC
Get N records from each bucket, not 1: SELECT release_year, title FROM films GROUP 2 BY release_year ORDER BY release_year DESC LIMIT 6
Sort inside a bucket: SELECT release_year, title, rental_rate FROM films GROUP BY release_year WITHIN GROUP ORDER BY rental_rate DESC ORDER BY release_year DESC LIMIT 5
Filter buckets: SELECT release_year, avg(rental_rate) avg FROM films GROUP BY release_year HAVING avg > 3
Use GROUPBY() to access aggregation key: SELECT release_year, count(*) FROM films GROUP BY release_year HAVING GROUPBY() IN (2000, 2002)
Group by array value: SELECT groupby() gb, count(*) FROM shoes GROUP BY sizes ORDER BY gb asc
Group by json node: SELECT groupby() color, count(*) from products GROUP BY meta.color
Get count of distinct values: SELECT major, count(*), count(distinct age) FROM students GROUP BY major
Use GROUP_CONCAT(): SELECT major, count(*), count(distinct age), group_concat(age) FROM students GROUP BY major
Use FACET after your main query and it will group the main query’s results: SELECT *, price AS aprice FROM facetdemo LIMIT 10 FACET price LIMIT 10 FACET brand_id LIMIT 5
Faceting by aggregation over another attribute: SELECT * FROM facetdemo FACET brand_name by brand_id
Faceting without duplicates: SELECT brand_name, property FROM facetdemo FACET brand_name distinct property
Facet over expressions: SELECT * FROM facetdemo FACET INTERVAL(price,200,400,600,800) AS price_range
Facet over multi-level grouping: SELECT *,INTERVAL(price,200,400,600,800) AS price_range FROM facetdemo FACET price_range AS price_range, brand_name ORDER BY brand_name asc;

Sorting of facet results:

SELECT * FROM facetdemo
FACET brand_name BY brand_id ORDER BY FACET() ASC
FACET brand_name BY brand_id ORDER BY brand_name ASC
FACET brand_name BY brand_id ORDER BY COUNT(*) DESC

Pagination in facet results:

SELECT * FROM facetdemo
FACET brand_name BY brand_id ORDER BY FACET() ASC  LIMIT 0,1
FACET brand_name BY brand_id ORDER BY brand_name ASC LIMIT 2,4
FACET brand_name BY brand_id ORDER BY COUNT(*) DESC LIMIT 4;

Schemaless

Elasticsearch is famous for the fact that you can write anything into it. With Manticore Search, you have to create a scheme beforehand. Many Elasticsearch experts recommend using static mapping, for example, https://octoperf.com/blog/2018/09/21/optimizing-elasticsearch/#index-mapping:

One of the very first things you can do is to define your indice mapping statically.

But we find dynamic mapping important in the area of log management and analysis. Since we want Manticore to be easy to use for thatwe have plans to enable dynamic mapping in Manticore, too.

Integrations

Both Elasticsearch and Manticore have clients for different programming languages.
MySQL wire protocol support:
- An important advantage of Manticore is the possibility to use MySQL clients to work with the server. Even if there is no official Manticore client for some language, there is definitely a MySQL client you can use. Using the command line MySQL client for administration is more convenient than using curl, because the commands are much more compact and the session is supported.
- The support for the MySQL protocol has also made it possible to support MySQL/Mariadb FEDERATED engine for tight integration between those and Manticore.
- In addition, Manticore can be used via ProxySQL.
HTTP JSON API is supported in both Elasticsearch and Manticore.
Logstash, Kibana: Manticore supports Kibana, but it’s a work in progress and in a beta stage. We’ll get those integrations up to speed soon. This is how you can try Manticore with Kibana:

# download manticore beta version with support for Kibana, check https://repo.manticoresearch.com/repository/kibana_beta/ for different OS versions
wget https://repo.manticoresearch.com/repository/kibana_beta/ubuntu/jammy.zip

# unarchive it
unzip jammy.zip

# install the packages
dpkg -i build/*

# switch Manticore to the mode supporting Kibana
mysql -P9306 -h0 -e "set global log_management = 0; set global log_management = 1;"

# start Kibana pointing it to Manticore Search instance listening on port 9308
docker run -d --name kibana --rm -e ELASTICSEARCH_HOSTS=http://127.0.0.1:9308 -p 5601:5601 --network=host docker.elastic.co/kibana/kibana:7.4.2

# install php and composer, download loading script and put into Manticore 1 million docs of fake users
apt install php composer php8.1-mysql
wget https://gist.githubusercontent.com/sanikolaev/13bf61bbe6c39350bded7c577216435f/raw/8d8029c0d99998c901973fd9ac66a6fb920deda7/load_manticore_sharded.php
composer require fakerphp/faker
php load_manticore_sharded.php 10000 16 1000000 16 1

# don't forget to create an index patter in Kibana (user*)

# run `docker stop kibana` to stop the Kibana server

If all went well you should see:

Replication

Both Elasticsearch and Manticore Search use synchronous replication. At Manticore we decided not to reinvent the wheel and made integration with the Galera library, which is also used by Mariadb and Percona Xtradb cluster.
An important difference in managing replication and clustering in Manticore and Elasticsearch is that with Elasticsearch you need to edit the config to set up a replica, while in Manticore you don’t: replication is always enabled and it’s very easy to connect to and sync up with another node:

Sharding and distributed indexes

Unlike Elasticsearch, Manticore does not yet have automatic sharding, but combining multiple indexes into one for manual sharding is easier than in Elasticsearch:

Adding an index located on a remote node is also supported, just specify the remote host, port, and index name.

Ease of use and learning

Our thinking is that we don’t want our users, be it a developer or a devops to become experts in databases or search engines or have a PhD to be able to use Manticore products. We assume you have other things to do rather than spending hours trying to understand how this or that setting affects this or that functionality. Hence, Manticore Search should work fine in most cases even on defaults.

Our ultimate goal is to make Manticore Search as easy to use and learn as possible.

As mentioned previously, Manticore is SQL-first which we find important while you are just getting started with Manticore compared to Elasticsearch.
Manticore provides interactive courses - play.manticoresearch.com to walk you through the essential steps to get familiar with Manticore.
There is a guide on how to get started with examples for different OSes and programming languages - https://manual.manticoresearch.com/Quick_start_guide .
You can talk directly to the developers in public channels: Slack, Telegram, Forum.
We have a special short domain mnt.cr integrated with the documentation so that mnt.cr/<keyword> takes you to the search results in the documentation in special mode - it immediately rewinds to the most relevant section. This is especially handy when you need to recall some details on some setting, e.g. mnt.cr/max_packet_size.

Cloud native

Elasticsearch provides Kubernetes operator.
Manticore Search provides Helm chart.

Imperative and declarative usage modes

In Elasticsearch, most things are only done through the API. There is no way (anymore) to add mappings to a configuration file so that they are available immediately after startup.

Manticore, like Kubernetes, supports two usage modes:

Imperative: when everything can be managed online using CREATE TABLE/DROP TABLE/ALTER TABLE, CREATE CLUSTER/JOIN CLUSTER/DELETE CLUSTER etc.
Declarative: when you can define mappings in a configuration file, which gives greater portability and easier integration of Manticore into CI/CD, ETL, and other processes.

Percolate

Percolate or Persistent Query is when a table contains queries, not documents, and the search is performed on documents, not queries. The search results are queries that satisfy the documents. This type of search is useful for users’ subscriptions: if you subscribed, for example, to the query TV > 42 inches, then as soon as it appears on the site, you will be notified about it. Manticore provides the functionality for that as well as Elasticsearch. According to the tests we did a few years ago throughput of this type of search in Manticore is significantly higher than in Elasticsearch.

What’s next?

We are now developing the project in the following directions:

Drop-in replacement for Elasticsearch in the ELK stack, so Kibana and Logstash (or the Opensearch alternatives) can work with it fine. We want the low latency that’s easier to achieve with Manticore to be available to people for log analysis. We already have a beta.
Schemaless mode. When you use Manticore as a log analysis solution you don’t have to think about the schemas.
Automatic sharding and orchestration of shards, so you can load data into Manticore even faster and the shards will be spread out in an optimal order for better fault tolerance.
Further performance optimizations. We just want even lower latency and higher throughput, so you can run Manticore on cheaper hardware and make the Earth greener.

Conclusions

So, at the end of it all, what do we have? Manticore may now be of interest to those:

Who cares about low response times on both small and large amounts of data,
Who likes SQL,
Who wants something simpler than Elasticsearch to integrate search into their application faster,
Who wants something more lightweight which starts fast,
Who cares about using purely open source software.

We are continuing!