I use pymongo (4.8.0) and I have a case where after my web crawler finishes scraping products it runs a pipeline in which I set indexes:
utils.py:
class DecimalCodec(TypeCodec):
python_type = Decimal
bson_type = Decimal128
def transform_python(self, value):
return Decimal128(value)
def transform_bson(self, value):
return value.to_decimal()
decimal_codec = DecimalCodec()
type_registry = TypeRegistry(\[decimal_codec\])
codec_options = CodecOptions(type_registry=type_registry)
def get_mongo_client():
return MongoClient(settings.MONGODB_SERVER, settings.MONGODB_PORT)
def get_mongo_db():
return get_mongo_client().get_database(
settings.MONGODB_DB, codec_options=codec_options
)
This is the crawler:
class BasePipeline:
def open_spider(self, spider):
self.product_collection = get_mongo_db()[settings.PRODUCT_COLLECTION_NAME]
...
self.product_collection.create_index([("import_run_id", -1)])
Now the problem exists when I try to filter the collection by the run_id:
First I check if the indexes exists:
In [8]: list(get_mongo_db()[settings.PRODUCT_COLLECTION_NAME].list_indexes())
Out[8]: [ SON([('v', 2), ('key', SON([('_id', 1)])), ('name', '_id_')]), SON([('v', 2), ('key', SON([('import_run_id', -1)])), ('name', 'import_run_id_-1')]) ]
Cool, we can see that it does!
Now this collection contains about 300k products, let's see how fast the query performs:
In [7]: t1 = time.time()
...: x = [p for p in get_mongo_db()[settings.PRODUCT_COLLECTION_NAME].find(
...: {"import_run_id": 245})]; t2 = time.time(); print(t2-t1)
56.496100664138794
secs (WAIT WHAT???)
Shouldn't it be way faster like sec or few? Is it really the list comprehension that is so slow?
I've tried extracting the .explain.executionStats
query (which I convert through GPT because it returns a RawBSON):
{
'executionSuccess': True, # Indicates if the query execution was successful
'nReturned': 5885, # Number of documents returned
'executionTimeMillis': 61, # Time taken to execute the query in milliseconds
'totalKeysExamined': 5885, # Total number of keys examined
'totalDocsExamined': 5885, # Total number of documents examined
'executionStages': { # Details of query execution stages
'stage': 'FETCH', # Stage of the execution
'inputStage': { # Details of the input stage
'stage': 'IXSCAN', # Index scan stage
'keyPattern': { # Index pattern used for the scan
'import_run_id': -1
},
'indexName': 'import_run_id_-1', # Name of the index
'isMultiKey': False, # Multi-key index status
'multiKeyPaths': { # Multi-key paths
'import_run_id': []
},
'isUnique': False, # Index uniqueness
'isSparse': False, # Index sparseness
'isPartial': False, # Partial index status
'indexVersion': 2, # Index version
'direction': 'forward', # Scan direction
'indexBounds': { # Index bounds for the scan
'import_run_id': [245, 245]
},
'keysExamined': 5885, # Number of keys examined
'seeks': 1, # Number of index seeks
'dupsTested': 0, # Duplicate keys tested
'dupsDropped': 0 # Duplicate keys dropped
}
},
'executionTimeMillisEstimate': 61, # Estimated execution time
'works': 5885, # Total work done
'advanced': 5885, # Documents advanced
'needTime': 0, # Number of times the system needed more time
'needYield': 0, # Number of times the system yielded
'saveState': 194, # State saved
'restoreState': 194, # State restored
'isEOF': 1, # End of file status
'docsExamined': 5885, # Documents examined
'alreadyHasObj': 0, # Number of documents that were already in the cache
'allPlansExecution': [] # List of all plans execution details (empty here)
}
totalKeysExamined
tototalDocsExamined
ratio is at ideal ratio of 1:1, which means the index is optimal and theIXSCAN
indicates the index is already leveraged. I would suggest diagnosing in 2 directions: 1. network latency and 2. the functionget_mongo_db()
For 2, it is possible that you are creating the mongo client every time when you do a query. Ideally, the mongo client should be created only once and be re-used in future queries.collection = get_mongo_db()[settings.PRODUCT_COLLECTION_NAME]; x = [p for p in collection.find({"import_run_id": 245})]
... or evenx = list(collection.find({"import_run_id": 245}))
?5885 items
, but there's way more ~ 300k:In [21]: t1 = time.time();print(len([p for p in get_mongo_db()[settings.MONGODB_PRODUCT_COLLECTION_NAME].find({"import_run_id": 245})])); t2 = time.time(); print(t2-t1, 'sec') 323093