How to fix memory error while importing a very large csv file to mongodb in python?

Question

Given below is the code for importing a pipe delimited csv file to monogdb.

import csv
import json
from pymongo import MongoClient

url = "mongodb://localhost:27017"
client = MongoClient(url)
db = client.Office
customer = db.Customer
jsonArray = []

with open("Names.txt", "r") as csv_file:
    csv_reader = csv.DictReader(csv_file, dialect='excel', delimiter='|', quoting=csv.QUOTE_NONE)
    for row in csv_reader:
        jsonArray.append(row)
    jsonString = json.dumps(jsonArray, indent=1, separators=(",", ":"))
    jsonfile = json.loads(jsonString)
    customer.insert_many(jsonfile)

Below is the error I get when running the above code.

Traceback (most recent call last):
  File "E:\Anaconda Projects\Mongo Projects\Office Tool\csvtojson.py", line 16, in <module>
    jsonString = json.dumps(jsonArray, indent=1, separators=(",", ":"))
  File "C:\Users\Predator\anaconda3\lib\json\__init__.py", line 234, in dumps
    return cls(
  File "C:\Users\Predator\anaconda3\lib\json\encoder.py", line 201, in encode
    chunks = list(chunks)
MemoryError

I if modify the code with some indents under the for loop. The MongoDB gets imported with the same data all over again without stopping.

import csv
import json
from pymongo import MongoClient

url = "mongodb://localhost:27017"
client = MongoClient(url)
db = client.Office
customer = db.Customer
jsonArray = []

with open("Names.txt", "r") as csv_file:
    csv_reader = csv.DictReader(csv_file, dialect='excel', delimiter='|', quoting=csv.QUOTE_NONE)
    for row in csv_reader:
        jsonArray.append(row)
        jsonString = json.dumps(jsonArray, indent=1, separators=(",", ":"))
        jsonfile = json.loads(jsonString)
        customer.insert_many(jsonfile)

Using mongoimport might be the better option. Note, mongoimport also accepts input from STDIN, you could convert lines in python and then print to STDIN instead of writing a separate file. — Wernfried Domscheit
– Wernfried Domscheit, Commented Jan 15, 2022 at 10:54
I don't know python but perhaps insert like if (row % 1000 == 0) customer.insert_many(jsonfile), i.e. Insert documents in batches of 1000 — Wernfried Domscheit
– Wernfried Domscheit, Commented Jan 15, 2022 at 11:00
@Wernfried Domscheit this is the result ` Traceback (most recent call last): File "E:\Anaconda Projects\Mongo Projects\SDR Tool\csvtojson.py", line 18, in <module> if row % 1000 == 0: TypeError: unsupported operand type(s) for %: 'dict' and 'int' ` — CyberNoob
– CyberNoob, Commented Jan 15, 2022 at 11:18
Can you add a small sample of the file format - just 3 or 4 lines — Belly Buster
– Belly Buster, Commented Jan 15, 2022 at 11:48

Belly Buster · Accepted Answer · 2022-01-15 12:15:31Z

1

I would recommend you use pandas; it provides a "chunked" mode by setting a chunksize parameter which you can tweak depending on your memory limitations. insert_many() is also more efficient.

Plus the code become much simpler:

import pandas as pd
filename = "Names.txt"

with pd.read_csv(filename, chunksize=1000, delimiter='|') as reader:
    for chunk in reader:
        db.mycollection.insert_many(chunk.to_dict('records'))

If you post a file sample I can update to match.

answered Jan 15, 2022 at 12:15

Belly Buster

8,9142 gold badges12 silver badges25 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

CyberNoob Over a year ago

File "C:\Users\Predator\anaconda3\lib\site-packages\pandas\io\parsers\python_parser.py", line 722, in _alert_malformed raise ParserError(msg) pandas.errors.ParserError: Expected 48 fields in line 117143, saw 49

CyberNoob Over a year ago

Given Below is the code I used.

with pd.read_csv(csv_file, chunksize=1000, delimiter='|', engine='python', encoding='latin-1', quoting=csv.QUOTE_NONE) as reader: for chunk in reader:

Belly Buster Over a year ago

I can't really help more without seeing a file sample.

Belly Buster Over a year ago

Also that error is pretty straightforward - you have 48 fields in your headers and a line with 49 entries.

CyberNoob Over a year ago

This is because some text might have a coma inserted in between.

|

CyberNoob · Accepted Answer · 2022-01-18 07:15:12Z

1

The memory issue can be solved by inserting one record at a time.

import csv
import json

from pymongo import MongoClient

url_mongo = "mongodb://localhost:27017"
client = MongoClient(url_mongo)
db = client.Office
customer = db.Customer
jsonArray = []
file_txt = "Text.txt"
rowcount = 0
with open(file_txt, "r") as txt_file:
    csv_reader = csv.DictReader(txt_file, dialect="excel", delimiter="|", quoting=csv.QUOTE_NONE)
    for row in csv_reader:
        rowcount += 1
        jsonArray.append(row)
    for i in range(rowcount):
        jsonString = json.dumps(jsonArray[i], indent=1, separators=(",", ":"))
        jsonfile = json.loads(jsonString)
        customer.insert_one(jsonfile)
print("Finished")

Thank You All for Your Ideas

edited Jan 18, 2022 at 7:15

answered Jan 18, 2022 at 5:42

CyberNoob

371 silver badge8 bronze badges

Collectives™ on Stack Overflow

How to fix memory error while importing a very large csv file to mongodb in python?

2 Answers 2

9 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

9 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related