2

I use pymongo (4.8.0) and I have a case where after my web crawler finishes scraping products it runs a pipeline in which I set indexes:

utils.py:

class DecimalCodec(TypeCodec):
    python_type = Decimal
    bson_type = Decimal128

    def transform_python(self, value):
        return Decimal128(value)
    
    def transform_bson(self, value):
        return value.to_decimal()

decimal_codec = DecimalCodec()
type_registry = TypeRegistry(\[decimal_codec\])
codec_options = CodecOptions(type_registry=type_registry)

def get_mongo_client():
    return MongoClient(settings.MONGODB_SERVER, settings.MONGODB_PORT)

def get_mongo_db():
    return get_mongo_client().get_database(
        settings.MONGODB_DB, codec_options=codec_options
    )

This is the crawler:

class BasePipeline:

def open_spider(self, spider):     
self.product_collection = get_mongo_db()[settings.PRODUCT_COLLECTION_NAME]
...
self.product_collection.create_index([("import_run_id", -1)])

Now the problem exists when I try to filter the collection by the run_id:

First I check if the indexes exists:

In [8]: list(get_mongo_db()[settings.PRODUCT_COLLECTION_NAME].list_indexes()) 
Out[8]: [ SON([('v', 2), ('key', SON([('_id', 1)])), ('name', '_id_')]), SON([('v', 2), ('key', SON([('import_run_id', -1)])), ('name', 'import_run_id_-1')]) ]

Cool, we can see that it does!

Now this collection contains about 300k products, let's see how fast the query performs:

In [7]: t1 = time.time()
...: x = [p for p in get_mongo_db()[settings.PRODUCT_COLLECTION_NAME].find(
...:     {"import_run_id": 245})]; t2 = time.time(); print(t2-t1)

56.496100664138794 secs (WAIT WHAT???) Shouldn't it be way faster like sec or few? Is it really the list comprehension that is so slow?

I've tried extracting the .explain.executionStats query (which I convert through GPT because it returns a RawBSON):

{
    'executionSuccess': True,  # Indicates if the query execution was successful
    'nReturned': 5885,  # Number of documents returned
    'executionTimeMillis': 61,  # Time taken to execute the query in milliseconds
    'totalKeysExamined': 5885,  # Total number of keys examined
    'totalDocsExamined': 5885,  # Total number of documents examined
    'executionStages': {  # Details of query execution stages
        'stage': 'FETCH',  # Stage of the execution
        'inputStage': {  # Details of the input stage
            'stage': 'IXSCAN',  # Index scan stage
            'keyPattern': {  # Index pattern used for the scan
                'import_run_id': -1
            },
            'indexName': 'import_run_id_-1',  # Name of the index
            'isMultiKey': False,  # Multi-key index status
            'multiKeyPaths': {  # Multi-key paths
                'import_run_id': []
            },
            'isUnique': False,  # Index uniqueness
            'isSparse': False,  # Index sparseness
            'isPartial': False,  # Partial index status
            'indexVersion': 2,  # Index version
            'direction': 'forward',  # Scan direction
            'indexBounds': {  # Index bounds for the scan
                'import_run_id': [245, 245]
            },
            'keysExamined': 5885,  # Number of keys examined
            'seeks': 1,  # Number of index seeks
            'dupsTested': 0,  # Duplicate keys tested
            'dupsDropped': 0  # Duplicate keys dropped
        }
    },
    'executionTimeMillisEstimate': 61,  # Estimated execution time
    'works': 5885,  # Total work done
    'advanced': 5885,  # Documents advanced
    'needTime': 0,  # Number of times the system needed more time
    'needYield': 0,  # Number of times the system yielded
    'saveState': 194,  # State saved
    'restoreState': 194,  # State restored
    'isEOF': 1,  # End of file status
    'docsExamined': 5885,  # Documents examined
    'alreadyHasObj': 0,  # Number of documents that were already in the cache
    'allPlansExecution': []  # List of all plans execution details (empty here)
}
9
  • 1
    It doesn't look like a database indexing issue to me. The output you shared shows that it used the index efficiently and identified the 5,885 desired documents in 61 milliseconds. How large are the documents? When you run the code are there any relevant entries in the database log file? (By the way, there's no particular reason that I'm aware of to have a single field index be defined in descending order. But I don't think that is going to have anything to do with your problem) Commented Jul 8 at 0:37
  • 1
    Agree with @user20042973 that this does not seem like a database indexing issue. The totalKeysExamined to totalDocsExamined ratio is at ideal ratio of 1:1, which means the index is optimal and the IXSCAN indicates the index is already leveraged. I would suggest diagnosing in 2 directions: 1. network latency and 2. the function get_mongo_db() For 2, it is possible that you are creating the mongo client every time when you do a query. Ideally, the mongo client should be created only once and be re-used in future queries.
    – ray
    Commented Jul 8 at 0:58
  • What happens with collection = get_mongo_db()[settings.PRODUCT_COLLECTION_NAME]; x = [p for p in collection.find({"import_run_id": 245})] ... or even x = list(collection.find({"import_run_id": 245}))?
    – rickhg12hs
    Commented Jul 8 at 1:48
  • @ray There can't be any network latency because this is running on localhost. I don't think that creating mongo client every time is also a bad idea because it's thread safe, there's a connection pool and there isn't any issues with that. Commented Jul 8 at 12:54
  • IMO it's something fishy going on because the executionStats identifies only 5885 items, but there's way more ~ 300k:In [21]: t1 = time.time();print(len([p for p in get_mongo_db()[settings.MONGODB_PRODUCT_COLLECTION_NAME].find({"import_run_id": 245})])); t2 = time.time(); print(t2-t1, 'sec') 323093 Commented Jul 8 at 12:56

0

Browse other questions tagged or ask your own question.