0

I have a big zip file which contains around 10,000,000+ lines of json objects. Every day I receive a new zip file which contains mostly exactly the same data but some of it (maybe around 2%) is different - a few json properties have changed etc. I need to store the data efficiently and it needs to provide fast reads. The application users themselves don't write to the database, but perform reads only.

As I said I receive new data daily and most of it is exactly the same, but some of it differs. Another caveat is that some of the data is in different order. I need to update the database with the new data, but discard the data which is the same. Example:

{"address": "123 Main St", "organization": "Real Estate Co", "building": {"type": "House", "leased": false}, "residents": [{"name": "John Doe", "age": 30}, {"name": "Jane Doe", "age": 28}], "facilities": ["Garden"], "location": {"city": "Metropolis", "country": "US"}, "risks": ["Burglary"]}
{"address": "456 Oak St", "organization": "Urban Apartments", "building": {"type": "Apartment", "leased": true}, "residents": [{"name": "Alice Smith", "age": 25}, {"name": "Bob Johnson", "age": 30}], "facilities": ["Fitness Center", "Swimming Pool"], "location": {"city": "Downtownville", "country": "US"}, "risks": ["Noise Complaints", "Fire Hazard"]}

The sample data here is not the real data which I'm dealing with but is quite similar. The address property in my data actually unique.

and for example if the data I receive tomorrow contains the following data:

{"address": "123 Main St", "organization": "Real Estate Co", "building": {"type": "House", "leased": false}, "residents": [{"name": "John Doe", "age": 30}, {"name": "Jane Doe", "age": 28}], "facilities": ["Garden"], "location": {"city": "Metropolis", "country": "US"}, "risks": ["Burglary"]}
{"address": "456 Oak St", "organization": "Urban Apartments", "building": {"type": "Apartment", "leased": false}, "residents": [{"name": "Alice Smith", "age": 25}, {"name": "Bob Johnson", "age": 30}], "facilities": ["Fitness Center", "Swimming Pool", "Burger Shop", "Basketball court"], "location": {"city": "Downtownville", "country": "US"}, "risks": ["Noise Complaints", "Fire Hazard"]}

then I would need to store the second json object because it's different from the yesterday's json object (NOTE: the 'unique' address is still the same but it has different properties).


The optimal solution to which I came up to is to store the data in one database table, and to create a composite key which consists of the unique address and automatically generated Date. So the sample data looks like this. (MongoDB example):

{"_id":{"address":"123 Main St","date":"06-01-2024"},"organization":"ABC Inc","as":{"number":12345,"organization":"Internet Services Ltd"},"client":{},"tunnels":[{"operator":"PROXY_OPERATOR","type":"VPN","anonymous":false}],"location":{"city":"Metropolis","country":"US"},"risks":["DATA_LEAK"]}

(MySQL example):

address VARCHAR(40) NOT NULL,
    date DATE DEFAULT CURRENT_DATE NOT NULL,
    address_data JSON NOT NULL,
    PRIMARY KEY (address, date)

And it would do a check whether the data is unique (not taking into account the date field) and then perform an insertion (not update) to the table.

So in a nutshell, every day I receive a new zip file which contains around 10 million lines of json data. I need to add new data to the database and discard the rest of it.

What I tried?

At first I tried to use MySQL database, which I ran using Docker. I created a table schema like this:

address VARCHAR(40) NOT NULL,
    date DATE DEFAULT CURRENT_DATE NOT NULL,
    address_data JSON NOT NULL,
    PRIMARY KEY (address, date)

Then I used the LOAD DATA command for bulk insertion:

LOAD DATA INFILE '/var/lib/mysql-files/data.json'
INTO TABLE table_name
FIELDS TERMINATED BY '\n'
LINES TERMINATED BY '\n'
(@json)
SET address = JSON_UNQUOTE(JSON_EXTRACT(@json, '$.address')),
    date = CURRENT_DATE,
    address_data = @json;

But with this some issues arouse, namely as per MySQL docs: It is not possible to load data files that use the ucs2, utf16, utf16le, or utf32 character set.

Some of the data contains characters of these character sets or unicode escapes that MySQL doesn't like which results in data insertion errors. ERROR 3141 (22032): Invalid JSON text in argument 1 to function json_extract: "Missing a comma or '}' after an object member." at position 86.

After some headaches I figured why not to use a MongoDB instead, as it should store the json object more effectively. Modified the data with jq, like this:

jq -c '. + { "_id": { "address": .address, "date": (now | strftime("%d-%m-%Y")) } }' data.json > formatted_data.json

To receive this:

{"_id":{"address":"123 Main St","date":"06-01-2024"},"organization": "Real Estate Co", "building": {"type": "House", "leased": false}, "residents": [{"name": "John Doe", "age": 30}, {"name": "Jane Doe", "age": 28}], "facilities": ["Garden"], "location": {"city": "Metropolis", "country": "US"}, "risks": ["Burglary"]}

I used the mongoimport command which inserted all the data successfully but now the issue is how can I insert the new data in bulk? As far as I know this is not possible with mongoimport as it doesn't support conditional operations. Some options are to use python or something to compare the data and add new but this can take quite a while

2
  • generate db update commands from a json diff patch or merge patch tool?
    – jhnc
    Commented Jan 5 at 14:01
  • 1
    Recommending tools/software is off-topic for this forum, as is anything opinion-based, so unless you have a specific code-related issue then it's difficult to help you. The one point I would make is that processing 10M+ records every day, when only a few have changed, is incredibly inefficient. I would therefore focus on your source system providing a delta file, if at all possible
    – NickW
    Commented Jan 5 at 15:56

0

Browse other questions tagged or ask your own question.