1

I have some problems with parsing huge csv file into mysql databse.

Csv file looks like this:

ref1  data1  data2  data3...
ref1  data4  data5  data6...
ref2  data1  data2  data3 data4 data5..
ref2  data12 data13 data14
ref2  data21 data22...
.
.
.

Csv file has about 1 milion lines or about 7MB in zip file or about 150MB unzip.

My job is to parse the data from csv into mysql, but only the data/lines when references matches. Another problem is, that from multiple lines in csv i must parse it in only one line in mysql for one reference.

I tryed to do this with csv.reader and for loops on each references, but is ultra slow.

with con:
cur.execute("SELECT ref FROM users")
user=cur.fetchall()
for i in range(len(user)):
    with open('hugecsv.csv', mode='rb') as f:
        reader = csv.reader(f, delimiter=';')                               
        for row in reader:
            if(str(user[i][0])==row[0]):
                writer.writerow(row)

So i have all references which i would like to parse, in my list user. Which is the fastes way to parse?

Please help!

1
  • Please clarify "from multiple lines in csv i must parse it in only one line". Commented Dec 13, 2013 at 7:40

3 Answers 3

2

The first obvious bottleneck is that you are reopening and scanning the whole CSV file for each user in your database. Doing a single pass on the csv would be faster :

# faster lookup on users
cur.execute ("select ref from users")
users = set(row[0] for row in cur.fetchall())

with open("your/file.CSV") as f:
    r = reader(f)
    for row in r:
        if row[0] in users:
            do_something_with(row)
Sign up to request clarification or add additional context in comments.

2 Comments

sorry for my ignorance, what exactly set() do? Becuse python does not return any errors but variable users does't exsist when i run code
set is a builtin type, it's a collection of unique elements with fast (0(1)) lookup. But there was an error in my code (sorry answered from my phone), which I just fixed.
1

Use:

LOAD DATA INFILE 'EF_PerechenSkollekciyami.csv' TO `TABLE_NAME` FIELDS TERMINATED BY ';'

This is an internal query command in mysql.

I don't recommend you to use tabs to separate columns, and recommend you to change this by sed to ; or something another character. But you can try with tabs too.

2 Comments

Why don't you recommend tab separated columns? MySQL uses this my default. And why terminate with ;?
I get csv file every month from multiple companyes and i would like to use python to parse, becuse i need control on parsing (time stamp that program is running automaticly, log files with errors, another py programs to control resources...)
0

You haven't included all your logic. If you just want to import everything into a single table,

cur.execute("LOAD DATA INFILE 'path_to_file.csv' INTO TABLE my_table;")

MySQL does it directly. You can't get any faster than that.

Documentation

4 Comments

basicly i must filter my csv file and write into mysql just lines on which reference matches.
What about importing the CSV, and then running a SQL query to do the filtering?
This is option but i don't know how to do this becuse i have very dinamicly csv file. For example i don't know how many lines i have for one user/reference.
@djpiky you can load your file into temporary table with no indexes and then extract only the relevant records into the actual table.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.