RegExum is Python wrapper that simplifies text search for Terabyte and Petabyte-scale textual datasets stored in one of the following DBMS:
- MongoDB - modern (yet mature) distributed document DB with good performance in most workloads
- ElasticSearch - the go-to DBMS-complementary indexing software for texts and categorical data,
- PostgreSQL - most feature-rich open-source relational DB,
- MySQL - the most commonly-used relational DB.
- regexum - Python wrappers for search-able containers backed by persistent DBs.
- benchmarks - benchmarking tools and performance results.
- assets - tiny datasets for testing purposes.
Some common databases have licences that prohibit sharing of benchmark results, so they were excluded from comparisons.
Name | Purpose | Implementation Language | Lines of Code (in /src/ ) |
---|---|---|---|
MongoDB | Documents | C++ | 3'900'000 |
Postgre | Tables | C | 1'300'000 |
ElasticSearch | Text | Java | 730'000 |
Unum | Graphs, Table, Text | C++ | 80'000 |
- Java-based document store built on top of Lucene text index.
- Widely considered high-performance solutions due to the lack of competition.
- Lucene was ported to multiple languages including projects like: CLucene and LucenePlusPlus.
- Very popular open-source project backed by the
$ESTC
publicly traded company.
- A distributed ACID document store.
- Internally uses the
BSON
binary format. - Very popular open-source project backed by the
$MDB
publicly traded company. - Provides bindings for most programming languages (including PyMongo for Python).
- Most common open-source SQL databases.
- Work well in single-node environment, but scale poorly out of the box.
- Mostly store search indexes in a form of a B-Tree. They generally provide good read performance, but are slow to update.
- New
re.pattern
-like object for queries and morelist
-like interface for DBs:- Finding the first match via
.index(re.pattern)
. - Streaming all matches via
.indexes(re.pattern)
. - Classical methods
.append(iterable)
and.extend(iterable)
for index extension.
- Finding the first match via
- Mixed Multithreaded Read/Write benchmarks.