Skip to content
This repository has been archived by the owner on Dec 19, 2023. It is now read-only.
/ RegExum Public archive

A Python wrapper for persistent DBMS that simplifies large-scale text search

Notifications You must be signed in to change notification settings

unum-cloud/RegExum

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RegExum

RegExum is Python wrapper that simplifies text search for Terabyte and Petabyte-scale textual datasets stored in one of the following DBMS:

  • MongoDB - modern (yet mature) distributed document DB with good performance in most workloads
  • ElasticSearch - the go-to DBMS-complementary indexing software for texts and categorical data,
  • PostgreSQL - most feature-rich open-source relational DB,
  • MySQL - the most commonly-used relational DB.

Project Structure

  • regexum - Python wrappers for search-able containers backed by persistent DBs.
  • benchmarks - benchmarking tools and performance results.
  • assets - tiny datasets for testing purposes.

Implementation Details & Included DBs

Some common databases have licences that prohibit sharing of benchmark results, so they were excluded from comparisons.

Name Purpose Implementation Language Lines of Code (in /src/)
MongoDB Documents C++ 3'900'000
Postgre Tables C 1'300'000
ElasticSearch Text Java 730'000
Unum Graphs, Table, Text C++ 80'000

ElasticSearch

  • Java-based document store built on top of Lucene text index.
  • Widely considered high-performance solutions due to the lack of competition.
  • Lucene was ported to multiple languages including projects like: CLucene and LucenePlusPlus.
  • Very popular open-source project backed by the $ESTC publicly traded company.

MongoDB

  • A distributed ACID document store.
  • Internally uses the BSON binary format.
  • Very popular open-source project backed by the $MDB publicly traded company.
  • Provides bindings for most programming languages (including PyMongo for Python).

Postgre, MySQL and other SQLs

  • Most common open-source SQL databases.
  • Work well in single-node environment, but scale poorly out of the box.
  • Mostly store search indexes in a form of a B-Tree. They generally provide good read performance, but are slow to update.

TODO

  • New re.pattern-like object for queries and more list-like interface for DBs:
    • Finding the first match via .index(re.pattern).
    • Streaming all matches via .indexes(re.pattern).
    • Classical methods .append(iterable) and .extend(iterable) for index extension.
  • Mixed Multithreaded Read/Write benchmarks.

About

A Python wrapper for persistent DBMS that simplifies large-scale text search

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages