2

In pandas read_csv, is there a way to specify eg. col1, col15, wholeline?

I am trying to import about 700000 rows of data from a text file which has hats '^' as delimiters, no text qualifiers and carriage return as line delimiter.

From the text file I need column 1, column 15 and then the whole line in three columns of a table/dataframe.

I've searched how to do this in pandas, but don't know it well enough to get the logic. I can import fine for all 26 columns, but that doesn't help my problem.

my_df = pd.read_csv("tablefile.txt", sep="^", lineterminator="\r",  low_memory=False)

Or I can use standard python to put the data into a table, but this takes about 4 hours for the 700000 rows. which is far too long for me.

count_1 = 0
for line in open('tablefile.txt'):
    if count_1 > 70:
        break
    else:
        col1id = re.findall('^(\d+)\^', line)
        col15id = re.findall('^.*\^.*\^(\d+)\^.*\^.*\^.*\^.*\^.*\^.*\^.*\^.*\^.*\^.*\^.*', line)
        line = line.strip()

        count_1 = count_1 + 1

        cur.execute('''INSERT INTO mytable (mycol1id, mycol15id, wholeline) VALUES (?, ?, ?)''', 
        (col1id[0], col15id[0], line, ) )

        conn.commit()
    print('row count_1=',count_1)

In pandas read_csv, is there a way to specify eg. col1, col15, wholeline?

As in above, col1 and col15 are digits and wholeline is a string

  • I do not want to rebuild the string after import as I might lose some characters in the process.

Thanks

EDIT: Committing to the database for each line was burning time.

2
  • When using only python, you should compile regular expressions once, outside your loop. This must speed up things Commented Feb 9, 2017 at 11:03
  • I don't understand how this would work, I thought re.findall(regex, object) required object to be created before calling re.findall. do you have an example? Commented Feb 9, 2017 at 14:24

3 Answers 3

3

Read in the full lines as one df using some quasi-separator (below im using &), then read again with usecols and specify the index for cols 1 and 15 and add them together.

my_df_full = pd.read_csv("tablefile.txt", sep="&", lineterminator="\r", low_memory=False)
my_df_full.columns = ['full_line']

my_df_cols = pd.read_csv("tablefile.txt", sep="^", lineterminator="\r", low_memory=False, usecols=[1,15])

my_df_full[['col1', 'col15']] = my_df_cols
Sign up to request clarification or add additional context in comments.

2 Comments

It's proving difficult to find a separator that isn't in the text, but i'll carry on looking.
@CArnold if you dont find any separator you can concatenate all the columns, a bit tedious but should work. See this: stackoverflow.com/questions/19377969/…. Im not sure if it works, but you can also try to remove low_memory = False and use strings as separators. sep="c_arnold_pandas"
1

First, you can compile your regular expressions to avoid parsing them for each line

import re

reCol1id = re.compile('^(\d+)\^')
reCol15id = re.compile('^.*\^.*\^(\d+)\^.*\^.*\^.*\^.*\^.*\^.*\^.*\^.*\^.*\^.*\^.*')

count_1 = 0
for line in open('tablefile.txt'):
    if count_1 > 70:
        break
    else:
        col1id = reCol1id.findall(line)[0]
        col15id = reCol15id.findall(line)[0]
        line = line.strip()

        count_1 += 1

        cur.execute('''INSERT INTO mytable (mycol1id, mycol15id, wholeline) VALUES (?, ?, ?)''', 
        (col1id, col15id, line, ) )

        conn.commit()
    print('row count_1=',count_1)

Comments

0

I put the conn.commit() on the outside of the for loop. It reduced the load time to a few minutes, though I'm guessing it's less safe.

Anyway thanks for the help.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.