Five years ago Manticore began as a fork of an open source version of the once popular search engine Sphinx Search. We had two bags of grass, seventy-five pellets of mescaline, three C++ developers, a support engineer, a power user of Sphinx Search / backend team lead, an experienced manager, a mother of five helping us part-time, and a ton of bugs, crashes, and technical debts. So we got a shovel and other digging tools and started working to get it up to the search engine industry standards. Not that Sphinx was impossible to use, but many things were missing, and existing features weren’t quite stable or mature. And we had pushed it about as far as we could. So after 5 years and hundreds of new users, we’re ready to say that Manticore Search can be used as an alternative to Elasticsearch for both full-text search and (now) data analytics too.

In this article, I want to:

⭐⭐⭐ Your star on GitHub supports the project and makes us think we are on the right path!⭐⭐⭐

A little of the history

2001 - just Lucene and Sphinx

The first Apple store opened, Windows XP, iTunes and Mac OS X were released.

The genius Andrey Aksyonoff started working on Sphinx Search, for which I want to thank him very much! There was no SOLR and Elasticsearch yet, but there was already Lucene, on which they were both subsequently built. Sphinx Search started slowly coming together, and in a few years became quite popular technology having an impact on thousands of websites using it.

2010 - Elasticsearch appeared

Retina display, systemd, Ipad, and Elasticsearch appeared.

By this time Sphinx was already a popular full-text search engine, but the Sphinx’s concept of “source data has to be somewhere and we just make a full-text index that needs to be rebuilt regularly” was not as interesting as Elasticsearch’s “give me any JSON via HTTP in real-time, I will find a node to place it on”. SOLR wasn’t very good with data distribution, and JSON was gaining popularity, while XML was losing its attraction. Soon Elasticsearch started to rapidly gain popularity.

2017 - Manticore appeared

As a result, the frustrated users and some former Sphinx developers teamed up and built a fork - Manticore Search. Our primary goals were as follows:

2022: Five more years later

“Okay, who wants to find out if this thing works?”


🙁 Sphinx 2: The main use case is indexing data from an external database: Sphinx returns id, then by id you have to go to the database and search there for the source document. The data schema can only be declared in the config.

✅ Manticore: The basic way to work with it is exactly the same as in MySQL / Postgres and Elasticsearch: a table can be created on the fly, data can be modified by a single/bulk INSERT/REPLACE/DELETE query, the data gets automatically compacted in the background. There is no need to look up the original document in an external source. Auto ID supported.


🙁 Sphinx 2: No replication.

✅ Manticore: Replication based on Galera, which is also used by Mariadb and Percona Server.


🙁 Sphinx 2: Queries can be done via SQL (MySQL wire protocol) or Sphinx binary protocol, there are clients for a few programming languages.

✅ Manticore: Added JSON interface very similar to Elasticsearch’s. Based on the new protocol, new clients for PHP, Python, Java, Javascript, and Elixir were built. The clients are generated automatically, making new functionality available in the client sooner after it appears in the engine.


🙁 Sphinx 2: Difficult to configure text tokenization for most languages

✅ Manticore: Simplified: made aliases cjk and non_cjk. Made tokenization of Chinese based on ICU. Added many new stemmers, including Ukrainian.


🙁 Sphinx 2: No official docker image and no support in the Kubernetes ecosystem

✅ Manticore: Made and support official docker and Helm chart for Kubernetes


🙁 Sphinx 2: No APT/YUM/Homebrew repositories

✅ Manticore: Added APT/YUM/Homebrew repositoriesNightly builds are also available in the development repository. Each new commit becomes available as a package.


🙁 Sphinx 2: Novice users had a hard time understanding what’s what.

✅ Manticore: Made platform with interactive courses — https://play.manticoresearch.com/


🙁 Sphinx 2: Few examples in the documentation

✅ Manticore: rewrote documentation, made our own rendering engine for it - https://manual.manticoresearch.com/. It’s also available in a simple markdown format for contributions and easy editing.


🙁 Sphinx 2: Bugs, that often lead to crashes

✅ Manticore: Crashes are now rare. Hundreds of old bugs have been fixed.


🙁 Sphinx 2: Running search queries in parallel is limited

✅ Manticore: Migrated to coroutines. Made it possible to parallelize any search query, so as to fully load the CPU and reduce the response time to a minimum


🙁 Sphinx 2: Cannot be used without full-text fields

✅ Manticore: Can be used without full-text, like any other database.


🙁 Sphinx 2: Non-full-text data is stored row-wise, it must be in memory to work efficiently.

✅ Manticore: Implemented and open-sourced Manticore Columnar Library, an external fully independent library that allows storing data column-oriented in blocks with support for different codecs for compressing different types of data efficiently. Requires almost no memory. You can now handle much larger amounts of data on the same server.


🙁 Sphinx 2: No secondary indexes

✅ Manticore: The second important functionality of Manticore Columnar LIbrary is support for secondary indexes based on the modern and innovative PGM algorithm.


🙁 Sphinx 2: No percolate indexes for reverse search (when there are queries in the index and documents are used as input to find out which queries would match them)

✅ Manticore: Added percolate type indexes.

This is approximately only a third of the changes - the ones you can easily see. On top of that, there have been many months of refactoring different parts of the system, resulting in a much simpler, more reliable, and more productive code. We hope this will attract new developers to the project.

What about Elasticsearch?

Elasticsearch is fine: it’s not very hard to use up to a certain amount of data, there’s replication, fault tolerance, and rich functionality. But there are nuances.

Let’s take a look at those nuances and what Manticore is like compared to Elasticsearch now (July 2022). Future reader, we’ve already bolted something else on, check out our Changelog.

Search Speed

Performance, namely low response time, is important in many cases, especially in log and data analytics, when there is a lot of data and not many search queries. You don’t want to wait 30 seconds instead of two for a response, do you? So here’s to the nuances: Elasticsearch is considered a standard for log management, but, for example, it can’t effectively parallelize a query to a single index shard. And Elasticsearch has only 1 shard by default, but there are much more CPU cores in modern servers. Making too many shards is also bad. All this doesn’t make life any easier for a devops who cares about the response time: you have to think about what hardware Elasticsearch will run on and make changes accordingly.

Manticore, on the contrary, is able to parallelize the search query to all CPU cores unconditionally and by default. It would be more correct to say that Manticore itself decides when to parallelize and when not, but in most cases it does, which allows you to efficiently load the CPU cores (which are often idle in cases of logging and data analytics) and significantly reduce response time.

But even if you make as many shards in Elasticsearch as there are CPU cores on the server, Manticore turns out to be significantly faster, specifically: here’s a test for 1.7 billion documents, from which you can see that overall Manticore is 4 times faster than Elasticsearch. If you are interested in the details or want to reproduce that on your own hardware, here is an article https://db-benchmarks.com/test-taxi/ (all examples below are also supported by scripts and links, etc., you won’t find any idle talking in this blog)

Here is a different case: no big data, just 1.1 million comments from Hacker News. In this test, Manticore is 15x faster than ElasticsearchAll the details here.

And another test indicative for Elasticsearch as a standard log analytics tool - 10 million Nginx logs and various quite realistic analytical queries - Manticore is 22 times faster than Elasticsearch here. All the details here

Data ingestion performance

There are also nuances with Elasticsearch’s write speed. For example, the dataset for the 1.7 billion-document test discussed above was loaded:

This was on a 32-core server with SSD. The amounts of data after indexing are about the same. To learn more about how exactly the load was handled read here.

In brief:

Here is the log of the data loading to Elasticsearch and Manticore: https://gist.github.com/sanikolaev/678dd862a7668921e3417321be0a2513

It turns out that in this test Manticore is 25 times faster in terms of data ingestion. Maybe I don’t know how to bake Logstash and Elasticsearch, but the import of the same dataset (but of a slightly smaller size) took Mark Litwintschik even longer - 4 days and 16 hours.

Maybe the problem is in Logstash, not Elasticsearch? Let’s go find out by writing directly to Elasticsearch. The index scheme is as follows:

"properties": {
  "name": {"type": "text"},
  "email": {"type": "keyword"},
  "description": {"type": "text"},
  "age": {"type": "integer"},
  "active": {"type": "integer"}
}

Starting Manticore and Elasticsearch using their official docker images like this:

docker run --name manticore --rm -p 9308:9308 -v $(pwd)/manticore_idx:/var/lib/manticore manticoresearch/manticore:5.0.2

docker run --name elasticsearch --rm -p 9200:9200 -e discovery.type=single-node -e xpack.security.enabled=false -v $(pwd)/es_idx/:/usr/share/elasticsearch/data docker.elastic.co/elasticsearch/elasticsearch:8.3.2

Let’s now put 50 million random docs like this to both:

{
  1,
  84,
  "Aut corporis qui necessitatibus architecto est. Harum laboriosam temporibus praesentium quis et nulla. Consequuntur quia neque et repellat.",
  "[email protected]",
  "Keely Doyle Sr."
}

We’ll use simple php scripts with a batch size 10,000 and concurrency 32 (there are 16 physical CPU cores on the server and hyper-threading).

root@perf3 ~ # php load_elasticsearch.php 10000 32 1000000 50
preparing...
found in cache
querying...
finished inserting
Total time: 178.24096798897
280519 docs per sec

root@perf3 ~ # php load_manticore.php 10000 32 1000000 50
preparing...
found in cache
querying...
finished inserting
Total time: 215.7572619915
231742 docs per sec

OK, now Elastic is 21% faster, but again there is an interesting nuance: Elasticsearch by default buffers new documents for one second, which means the last batch will not be available for searching right away. This is ok in many cases, but to make things fair let’s do /bulk?refresh=1 in Elasticsearch and see what it gives:

root@perf3 ~ # php load_elasticsearch.php 10000 32 1000000 50
preparing...
found in cache
querying...
finished inserting
Total time: 307.47588610649
162614 docs per sec

In this case Manticore is again faster by 43%.

If we want to test the maximum performance, we can:

Here’s what it gives:

Manticore:

// docker run -p9306:9306 --name manticore --rm -v $(pwd)/manticore_idx:/var/lib/manticore -e searchd_binlog_path= manticoresearch/manticore:5.0.2

root@perf3 ~ # php load_manticore_sharded.php 10000 32 1000000 32 50
preparing...
found in cache /tmp/bc9719fb0d26e18fc53d6d5aaaf847b4_10000_1000000
querying...
finished inserting
Total time: 55.874907970428
894856 docs per sec

Elasticsearch:

root@perf3 ~ # php load_elasticsearch_sharded.php 10000 32 1000000 32 50
preparing...
found in cache
querying...
finished inserting
Total time: 119.96515393257
416788 docs per sec

But, remember the nuance: you have to spend another 13 seconds to make the documents searchable:

root@perf3 ~ # curl -s -X POST "localhost:9200/_sql?format=json&pretty" -H 'Content-Type: application/json' -d'{"query": "select count(*) from user"}'                 {
  "columns" : [
    {
      "name" : "count(*)",
      "type" : "long"
    }
  ],
  "rows" : [
    [
      0
    ]
  ]
}

root@perf3 ~ # time curl -XPOST "localhost:9200/user/_refresh"
{"_shards":{"total":64,"successful":32,"failed":0}}
real    0m13.505s
user    0m0.003s
sys     0m0.000s

root@perf3 ~ # curl -s -X POST "localhost:9200/_sql?format=json&pretty" -H 'Content-Type: application/json' -d'{"query": "select count(*) from user"}'
{
  "columns" : [
    {
      "name" : "count(*)",
      "type" : "long"
    }
  ],
  "rows" : [
    [
      50000000
    ]
  ]
}

All in all, Manticore is 2x faster than Elasticsearch in terms of data ingestion performance. And the data is searchable immediately after the batch is loaded, not 2 minutes later. The scripts used for this test can be found here.

What it’s written in

Open source

JSON vs SQL

Both Elasticsearch and Manticore can do both SQL and JSON, but the difference is:

Startup time

In some cases, you need to be able to launch a service quickly. For example, in IoT (Internet of things) or some ETL scenarios.

Near-real-time vs real-time


As mentioned above, by defaultwhen you put data to Elasticsearch, it becomes searchable only after a second. This can be adjusted, but then the ingestion rate will become significantly slower, as you can see above.

Manticore always works in real-time mode.

Full-text search

Probably worth another article to explain it all. In short: both Manticore and Elasticsearch are good in terms of full-text search, have a lot in common, but there are a lot of differences, too. According to these objective tests (which is important when evaluating relevance) on almost default settings Manticore can give higher relevance than Elasticsearch. Here is the relevant pull request in BEIR(information retrieval benchmark).

Aggregations

Both Manticore and Elasticsearch provide rich aggregation functionality. You probably know what Elasticsearch can do, here’s what can be done in Manticore for you to compare:

Schemaless

Elasticsearch is famous for the fact that you can write anything into it. With Manticore Search, you have to create a scheme beforehand. Many Elasticsearch experts recommend using static mapping, for example, https://octoperf.com/blog/2018/09/21/optimizing-elasticsearch/#index-mapping:

One of the very first things you can do is to define your indice mapping statically.


But we find dynamic mapping important in the area of log management and analysis. Since we want Manticore to be easy to use for thatwe have plans to enable dynamic mapping in Manticore, too.

Integrations

# download manticore beta version with support for Kibana, check https://repo.manticoresearch.com/repository/kibana_beta/ for different OS versions
wget https://repo.manticoresearch.com/repository/kibana_beta/ubuntu/jammy.zip

# unarchive it
unzip jammy.zip

# install the packages
dpkg -i build/*

# switch Manticore to the mode supporting Kibana
mysql -P9306 -h0 -e "set global log_management = 0; set global log_management = 1;"

# start Kibana pointing it to Manticore Search instance listening on port 9308
docker run -d --name kibana --rm -e ELASTICSEARCH_HOSTS=http://127.0.0.1:9308 -p 5601:5601 --network=host docker.elastic.co/kibana/kibana:7.4.2

# install php and composer, download loading script and put into Manticore 1 million docs of fake users
apt install php composer php8.1-mysql
wget https://gist.githubusercontent.com/sanikolaev/13bf61bbe6c39350bded7c577216435f/raw/8d8029c0d99998c901973fd9ac66a6fb920deda7/load_manticore_sharded.php
composer require fakerphp/faker
php load_manticore_sharded.php 10000 16 1000000 16 1

# don't forget to create an index patter in Kibana (user*)

# run `docker stop kibana` to stop the Kibana server

If all went well you should see:


Replication


Sharding and distributed indexes

Unlike Elasticsearch, Manticore does not yet have automatic sharding, but combining multiple indexes into one for manual sharding is easier than in Elasticsearch:

Adding an index located on a remote node is also supported, just specify the remote host, port, and index name.

Ease of use and learning

Our thinking is that we don’t want our users, be it a developer or a devops to become experts in databases or search engines or have a PhD to be able to use Manticore products. We assume you have other things to do rather than spending hours trying to understand how this or that setting affects this or that functionality. Hence, Manticore Search should work fine in most cases even on defaults.


Our ultimate goal is to make Manticore Search as easy to use and learn as possible.

Cloud native

Imperative and declarative usage modes

In Elasticsearch, most things are only done through the API. There is no way (anymore) to add mappings to a configuration file so that they are available immediately after startup.

Manticore, like Kubernetes, supports two usage modes:

Percolate

Percolate or Persistent Query is when a table contains queries, not documents, and the search is performed on documents, not queries. The search results are queries that satisfy the documents. This type of search is useful for users’ subscriptions: if you subscribed, for example, to the query TV > 42 inches, then as soon as it appears on the site, you will be notified about it. Manticore provides the functionality for that as well as Elasticsearch. According to the tests we did a few years ago throughput of this type of search in Manticore is significantly higher than in Elasticsearch.

What’s next?

We are now developing the project in the following directions:

Conclusions

So, at the end of it all, what do we have? Manticore may now be of interest to those:

We are continuing!

⭐⭐⭐ Your star on GitHub supports the project and makes us think we are on the right path!⭐⭐⭐

Also Published Here