pandas dataframe read_csv, specify columns and keep whole line as a string

Question

In pandas read_csv, is there a way to specify eg. col1, col15, wholeline?

I am trying to import about 700000 rows of data from a text file which has hats '^' as delimiters, no text qualifiers and carriage return as line delimiter.

From the text file I need column 1, column 15 and then the whole line in three columns of a table/dataframe.

I've searched how to do this in pandas, but don't know it well enough to get the logic. I can import fine for all 26 columns, but that doesn't help my problem.

my_df = pd.read_csv("tablefile.txt", sep="^", lineterminator="\r",  low_memory=False)

Or I can use standard python to put the data into a table, but this takes about 4 hours for the 700000 rows. which is far too long for me.

count_1 = 0
for line in open('tablefile.txt'):
    if count_1 > 70:
        break
    else:
        col1id = re.findall('^(\d+)\^', line)
        col15id = re.findall('^.*\^.*\^(\d+)\^.*\^.*\^.*\^.*\^.*\^.*\^.*\^.*\^.*\^.*\^.*', line)
        line = line.strip()

        count_1 = count_1 + 1

        cur.execute('''INSERT INTO mytable (mycol1id, mycol15id, wholeline) VALUES (?, ?, ?)''', 
        (col1id[0], col15id[0], line, ) )

        conn.commit()
    print('row count_1=',count_1)

In pandas read_csv, is there a way to specify eg. col1, col15, wholeline?

As in above, col1 and col15 are digits and wholeline is a string

I do not want to rebuild the string after import as I might lose some characters in the process.

Thanks

EDIT: Committing to the database for each line was burning time.

When using only python, you should compile regular expressions once, outside your loop. This must speed up things — Guillaume Jacquenot
– Guillaume Jacquenot, Commented Feb 9, 2017 at 11:03
I don't understand how this would work, I thought re.findall(regex, object) required object to be created before calling re.findall. do you have an example? — CArnold
– CArnold, Commented Feb 9, 2017 at 14:24

Karl Anka · Accepted Answer · 2017-02-09 11:15:58Z

3

Read in the full lines as one df using some quasi-separator (below im using &), then read again with usecols and specify the index for cols 1 and 15 and add them together.

my_df_full = pd.read_csv("tablefile.txt", sep="&", lineterminator="\r", low_memory=False)
my_df_full.columns = ['full_line']

my_df_cols = pd.read_csv("tablefile.txt", sep="^", lineterminator="\r", low_memory=False, usecols=[1,15])

my_df_full[['col1', 'col15']] = my_df_cols

edited Feb 9, 2017 at 11:15

answered Feb 9, 2017 at 11:03

Karl Anka

2,9092 gold badges21 silver badges34 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

CArnold Over a year ago

It's proving difficult to find a separator that isn't in the text, but i'll carry on looking.

Karl Anka Over a year ago

@CArnold if you dont find any separator you can concatenate all the columns, a bit tedious but should work. See this: stackoverflow.com/questions/19377969/…. Im not sure if it works, but you can also try to remove low_memory = False and use strings as separators. sep="c_arnold_pandas"

Guillaume Jacquenot · Accepted Answer · 2017-02-09 14:39:56Z

1

First, you can compile your regular expressions to avoid parsing them for each line

import re

reCol1id = re.compile('^(\d+)\^')
reCol15id = re.compile('^.*\^.*\^(\d+)\^.*\^.*\^.*\^.*\^.*\^.*\^.*\^.*\^.*\^.*\^.*')

count_1 = 0
for line in open('tablefile.txt'):
    if count_1 > 70:
        break
    else:
        col1id = reCol1id.findall(line)[0]
        col15id = reCol15id.findall(line)[0]
        line = line.strip()

        count_1 += 1

        cur.execute('''INSERT INTO mytable (mycol1id, mycol15id, wholeline) VALUES (?, ?, ?)''', 
        (col1id, col15id, line, ) )

        conn.commit()
    print('row count_1=',count_1)

answered Feb 9, 2017 at 14:39

Guillaume Jacquenot

11.8k6 gold badges45 silver badges50 bronze badges

Comments

CArnold · Accepted Answer · 2017-02-20 15:27:09Z

0

I put the conn.commit() on the outside of the for loop. It reduced the load time to a few minutes, though I'm guessing it's less safe.

Anyway thanks for the help.

answered Feb 20, 2017 at 15:27

CArnold

5354 gold badges7 silver badges16 bronze badges

Collectives™ on Stack Overflow

pandas dataframe read_csv, specify columns and keep whole line as a string

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related