2

I am trying to read a text file using pd.read_csv

df = pd.read_csv('filename.txt', delimiter = "\t")

My text file (see below) has a few lines of text before the dataset I need to import begins. How do I skip the lines before the dataset headers? I don't want to use any solution that involves counting the number of lines I need to skip because I have to do this for multiple (similar, not same) text files. Any help is appreciated!

Note: I cannot upload the text file as it is confidential

========================================= 
hello 123
========================================= 
Dir: /x/y/z/RTchoice/release001/data 
Date: 17-Mar-2020 10:0:08 
Output File: /a/b/c/filename.txt 
N: 2842
-----------------------------------------
Subject col1    col2    col3    
001 10.00000    1.00000 3.00000 
002 11.00000    2.00000 4.00000
4
  • 1
    use the skiprows argmument. pd.read_csv('filename.txt', delimeter='\t', skiprows=8) Commented Mar 24, 2021 at 20:39
  • I don't want to use any solution that involves counting the number of lines I need to skip because I have to do this for multiple (similar, not same) text files. Do you think there is a way I can count the number of rows I need to skip without opening up the text files maybe? Thanks! Commented Mar 24, 2021 at 20:42
  • Then you have to identify the line somehow. Is it the '---------' that breaks the header from the data? You tell me. You can't just craft magic. There has to be some logic to it. Commented Mar 24, 2021 at 20:44
  • yeah I get that. I was just wondering if there is a more efficient way but maybe not. So I guess I could just index for the last '---------' and use that index in the skiprows argument Commented Mar 24, 2021 at 20:52

1 Answer 1

2

Here is an attempt to 'craft magic'. The idea is to try read_csv with different skiprows until it works

import pandas as pd
from io import StringIO
data = StringIO(
'''
========================================= 
hello 123
========================================= 
Dir: /x/y/z/RTchoice/release001/data 
Date: 17-Mar-2020 10:0:08 
Output File: /a/b/c/filename.txt 
N: 2842
-----------------------------------------
Subject col1    col2    col3    
001 10.00000    1.00000 3.00000 
002 11.00000    2.00000 4.00000
''')

for n in range(1000):
    try:
        data.seek(0)
        df = pd.read_csv(data, delimiter = "\s+", skiprows=n)
    except:
        print(f'skiprows = {n} failed (exception)')   
    else:
        if len(df.columns) == 1: # do not let it get away with a single-column df
            print(f'skiprows = {n} failed (single column)')
        else:   
            break
print('\n', df)

output:


skiprows = 0 failed (exception)
skiprows = 1 failed (exception)
skiprows = 2 failed (exception)
skiprows = 3 failed (exception)
skiprows = 4 failed (exception)
skiprows = 5 failed (exception)
skiprows = 6 failed (exception)
skiprows = 7 failed (exception)
skiprows = 8 failed (single column)

    Subject  col1  col2  col3
0        1  10.0   1.0   3.0
1        2  11.0   2.0   4.0
Sign up to request clarification or add additional context in comments.

4 Comments

this is dark magic, I agree.. but was inspired by your comment!
I got the reference (-:
see if it works in 'real life' for you! I myself would describe this as more 'hacky' than 'beautiful' but thanks :-)
yup, it worked! I had never encountered try/except/else before so this was also a good learning opportunity :]

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.