0

Given below is the code for importing a pipe delimited csv file to monogdb.

import csv
import json
from pymongo import MongoClient

url = "mongodb://localhost:27017"
client = MongoClient(url)
db = client.Office
customer = db.Customer
jsonArray = []

with open("Names.txt", "r") as csv_file:
    csv_reader = csv.DictReader(csv_file, dialect='excel', delimiter='|', quoting=csv.QUOTE_NONE)
    for row in csv_reader:
        jsonArray.append(row)
    jsonString = json.dumps(jsonArray, indent=1, separators=(",", ":"))
    jsonfile = json.loads(jsonString)
    customer.insert_many(jsonfile)

Below is the error I get when running the above code.

Traceback (most recent call last):
  File "E:\Anaconda Projects\Mongo Projects\Office Tool\csvtojson.py", line 16, in <module>
    jsonString = json.dumps(jsonArray, indent=1, separators=(",", ":"))
  File "C:\Users\Predator\anaconda3\lib\json\__init__.py", line 234, in dumps
    return cls(
  File "C:\Users\Predator\anaconda3\lib\json\encoder.py", line 201, in encode
    chunks = list(chunks)
MemoryError

I if modify the code with some indents under the for loop. The MongoDB gets imported with the same data all over again without stopping.

import csv
import json
from pymongo import MongoClient

url = "mongodb://localhost:27017"
client = MongoClient(url)
db = client.Office
customer = db.Customer
jsonArray = []

with open("Names.txt", "r") as csv_file:
    csv_reader = csv.DictReader(csv_file, dialect='excel', delimiter='|', quoting=csv.QUOTE_NONE)
    for row in csv_reader:
        jsonArray.append(row)
        jsonString = json.dumps(jsonArray, indent=1, separators=(",", ":"))
        jsonfile = json.loads(jsonString)
        customer.insert_many(jsonfile)
6
  • Using mongoimport might be the better option. Note, mongoimport also accepts input from STDIN, you could convert lines in python and then print to STDIN instead of writing a separate file. Commented Jan 15, 2022 at 10:54
  • I don't know python but perhaps insert like if (row % 1000 == 0) customer.insert_many(jsonfile), i.e. Insert documents in batches of 1000 Commented Jan 15, 2022 at 11:00
  • @Wernfried Domscheit this is the result ` Traceback (most recent call last): File "E:\Anaconda Projects\Mongo Projects\SDR Tool\csvtojson.py", line 18, in <module> if row % 1000 == 0: TypeError: unsupported operand type(s) for %: 'dict' and 'int' ` Commented Jan 15, 2022 at 11:18
  • Can you add a small sample of the file format - just 3 or 4 lines Commented Jan 15, 2022 at 11:48
  • ^^ Plus a header row (if you have one) Commented Jan 15, 2022 at 12:03

2 Answers 2

1

I would recommend you use pandas; it provides a "chunked" mode by setting a chunksize parameter which you can tweak depending on your memory limitations. insert_many() is also more efficient.

Plus the code become much simpler:

import pandas as pd
filename = "Names.txt"

with pd.read_csv(filename, chunksize=1000, delimiter='|') as reader:
    for chunk in reader:
        db.mycollection.insert_many(chunk.to_dict('records'))

If you post a file sample I can update to match.

Sign up to request clarification or add additional context in comments.

9 Comments

File "C:\Users\Predator\anaconda3\lib\site-packages\pandas\io\parsers\python_parser.py", line 722, in _alert_malformed raise ParserError(msg) pandas.errors.ParserError: Expected 48 fields in line 117143, saw 49
Given Below is the code I used. with pd.read_csv(csv_file, chunksize=1000, delimiter='|', engine='python', encoding='latin-1', quoting=csv.QUOTE_NONE) as reader: for chunk in reader:
I can't really help more without seeing a file sample.
Also that error is pretty straightforward - you have 48 fields in your headers and a line with 49 entries.
This is because some text might have a coma inserted in between.
|
1

The memory issue can be solved by inserting one record at a time.

import csv
import json

from pymongo import MongoClient

url_mongo = "mongodb://localhost:27017"
client = MongoClient(url_mongo)
db = client.Office
customer = db.Customer
jsonArray = []
file_txt = "Text.txt"
rowcount = 0
with open(file_txt, "r") as txt_file:
    csv_reader = csv.DictReader(txt_file, dialect="excel", delimiter="|", quoting=csv.QUOTE_NONE)
    for row in csv_reader:
        rowcount += 1
        jsonArray.append(row)
    for i in range(rowcount):
        jsonString = json.dumps(jsonArray[i], indent=1, separators=(",", ":"))
        jsonfile = json.loads(jsonString)
        customer.insert_one(jsonfile)
print("Finished")

Thank You All for Your Ideas

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.