2

I have a bunch of txt files that i need to compile into a single master file. I use read_csv to extract the information inside. There are some rows to drop, and i was wondering if it's possible to use the skiprows feature without specifying the index number of rows that i want to drop, but rather to tell which one to drop according to its row content/value. Here's how the data looks like to illustrate my point.

Index     Column 1          Column 2
0         Rows to drop      Rows to drop
1         Rows to drop      Rows to drop
2         Rows to drop      Rows to drop
3         Rows to keep      Rows to keep
4         Rows to keep      Rows to keep
5         Rows to keep      Rows to keep
6         Rows to keep      Rows to keep
7         Rows to drop      Rows to drop
8         Rows to drop      Rows to drop
9         Rows to keep      Rows to keep
10        Rows to drop      Rows to drop
11        Rows to keep      Rows to keep
12        Rows to keep      Rows to keep
13        Rows to drop      Rows to drop
14        Rows to drop      Rows to drop
15        Rows to drop      Rows to drop

What is the most effective way to do this?

4 Answers 4

4

Is this what you want to achieve:

import pandas as pd
df = pd.DataFrame({'A':['row 1','row 2','drop row','row 4','row 5',
                        'drop row','row 6','row 7','drop row','row 9']})

df1 = df[df['A']!='drop row']

print (df)
print (df1)

Original Dataframe:

          A
0     row 1
1     row 2
2  drop row
3     row 4
4     row 5
5  drop row
6     row 6
7     row 7
8  drop row
9     row 9

New DataFrame with rows dropped:

       A
0  row 1
1  row 2
3  row 4
4  row 5
6  row 6
7  row 7
9  row 9

While you cannot skip rows based on content, you can skip rows based on index. Here are some options for you:

skip n number of row:

df = pd.read_csv('xyz.csv', skiprows=2)
#this will skip 2 rows from the top

skip specific rows:

df = pd.read_csv('xyz.csv', skiprows=[0,2,5])
#this will skip rows 1, 3, and 6 from the top
#remember row 0 is the 1st line

skip nth row in the file

#you can also skip by counts. 
#In below example, skip 0th row and every 5th row from there on

def check_row(a):
    if a % 5 == 0:
        return True
    return False

df = pd.read_csv('xyz.txt', skiprows= lambda x:check_row(x))

More details of this can be found in this link about skip rows

Sign up to request clarification or add additional context in comments.

2 Comments

That's quite similar to what i did, except i threw in some string slicing for the rows that i wanted to drop. But yes that's what i want to achieve, only i was wondering if skiprows could do that though.
You can skip specific indexes like this usersDf = pd.read_csv('users.csv', skiprows=[0,2,5]). In this case, it will skip rows 1, 3, and 6. Remember 0 represents 1st row. So you have to be very specific on which rows to skip
1

No. skiprows will not allow you to drop based on the row content/value.

Based on Pandas Documentation:

skiprows : list-like, int or callable, optional
Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file. If callable, the callable function will be evaluated against the row indices, returning True if the row should be skipped and False otherwise. An example of a valid callable argument would be lambda x: x in [0, 2].

1 Comment

I see. even with lambda, it still looks through indices? is it correct?
1

Since you cannot do that using skiprows, I could think of this way as efficient :

df = pd.read_csv(filePath)

df = df.loc[df['column1']=="Rows to keep"]

2 Comments

does loc return the index of that row?
@Alv It will not return the index, but the whole dataframe based on the condition inside. .loc is a property of dataframe through which you can access rows, index wise(location wise) based on filter condition . Read this for details.
-1

Not a python solution but, this would be absurdly simple to achieve using grep/bash.

printf "Index\tColumn\s1\tColumn\s2\n" > master_file
    for j in *.txt #bunch of text files
      do
        grep -v "Rows to drop" < "$j" >> master_file
    done

Depending on how many files you have and how large, this take a while, but you don't have to read the files into memory first, which presumably is the main reason you want to cherry pick your rows in the first place.

NOTE: You can use subprocess if you want the operation embedded into a python workflow - see here.

1 Comment

Would be helpful for me to improve my answers in future if I knew why I was downvoted. Just to add more context to this. Given this operation is also trivial in python, if the OP is showing an interest in using skiprows - it's likely that reading files into memory is an issue. Using grep to solve this problem is extremely memory efficient and solves the problem without reading files into memory - which it very much looks like the OP wants to do. So while not a python solution per se I still think it achieves the OPs aim. Happy to be corrected.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.