0

I import data from a CSV where I am replacing the empty fields with an 'EMPTYFIELD' value.

pd.read_csv('myFile.csv', usecols=['AAA', 'BBB', 'CCC'])
df =  df.fillna('EMPTYFIELD')

I am trying to create a dataframe that will have all the rows that contain an 'EMPTYFIELD' value. That implies that at least one column contains this value. I used the following and it works off course:

error = df[df.AAA.str.contains('EMPTYFIELD')]
error = error[error.BBB.str.contains('EMPTYFIELD')]
error = error[error.CCC.str.contains('EMPTYFIELD')] 

Now, I am trying to reduct the lines in my code. So, I was thinking of using a lambda instead without referencing to the columns (ideal):

error2 = df.apply(lambda x: 'EMPTYFIELD' if 'EMPTYFIELD' in x else x)

#error2 = df.apply(lambda x : any([ isinstance(e, 'EMPTYFIELD') for e in x ]), axis=1) 

and then I tried referencing the columns too:

error2 = df[usecols].apply(lambda x: 'EMPTYFIELD' if 'EMPTYFIELD' in x else x)

and

error2 = df[df[usecols].isin(['EMPTYFIELD'])]

None of the above work. I print the results in a new CSV file. I can see all the rows even if they contain the 'EMPTYFIELD' value.

UPD: This is my extended code. Some of the answers return an error possible because of the lines below:

varA      = 'AAA';
dfGrouped = df.groupby(varA, as_index=False).agg({'Start Date': 'min', 'End Date': 'max'}).copy()

varsToKeep = ['AAA', 'BBB', 'CCC', 'Start Date_grp', 'End Date_grp' ]
dfTemp = pd.merge(df, dfGrouped, how='inner', on='AAA', suffixes=(' ', '_grp'), copy=True)[varsToKeep]

errors = dfTemp[~np.logical_or.reduce([dfTemp[varsToKeep].str.contains('EMPTYFIELD') for varsToKeep in dfTemp])]

9
  • 2
    "Now, I am trying to reduct the rows" What does this mean? Commented Feb 15, 2018 at 16:56
  • 1
    Please provide a Minimal, Complete, Verifiable Example. Commented Feb 15, 2018 at 16:56
  • 1
    If you want to delete the rows with the bad field, why not just do df[~df.AAA.str.contains('EMPTYFIELD')]? Commented Feb 15, 2018 at 16:56
  • @pault I do not want to reference to the column names. AAA is one of the columns. Commented Feb 15, 2018 at 16:57
  • 1
    You can use dropna() to create a second DataFrame. Then filter the first one using the index of the second. Commented Feb 15, 2018 at 17:09

3 Answers 3

1

One way is to use np.logical_or.reduce. Here is an example:

import pandas as pd, numpy as np

df = pd.DataFrame([['A', 'B', 'C', 'D'],
                   ['E', 'F', 'G', 'H'],
                   ['G', 'A', 'D', 'I'],
                   ['L', 'K', 'A', 'J'],
                   ['S', 'T', 'U', 'V']],
                  columns=['COL1', 'COL2', 'COL3' ,'COL4'])

df[~np.logical_or.reduce([df[col].astype(str).str.contains('A') for col in df])]

#   COL1 COL2 COL3 COL4
# 1    E    F    G    H
# 4    S    T    U    V
Sign up to request clarification or add additional context in comments.

4 Comments

@Tried that. I am getting an error possibly because this (UPDATE in my question) affects the code.
Can only use .str accessor with string values, which use np.object_ dtype in pandas
@ApoloRadomer. OK, it explains itself. I've updated code now for you to try.
Neat and tidy. I will search the exact logic behind this. I mean each "command" does.
1

Here's an illustration of how to use dropna() as I mentioned in the comments:

df = pd.DataFrame(
    {'A': [5,3,5,6], 
     'B': [None, "foo", "bar", "foobar"], 
     'C': ["foo","bar",None, "bat"]
    }
)
no_errors = df.dropna()
errors = df[~(df.index.isin(no_errors.index))]

Which results in the following 2 dataframes:

print(no_errors)
#   A       B    C
#1  3     foo  bar
#3  6  foobar  bat

print(errors)
#   A     B     C
#0  5  None   foo
#2  5   bar  None

Now if you want, you can call fillna() on the error DataFrame.

1 Comment

That works if I first use the dropna() and then group by before printing. Based on one column (A), let's say we have an ID 11111. We have 5 rows for this ID without empty fields. Then we have 1 row of which one column (B) has an empty value. I want all the 6 rows for this ID to count as an error.
0

As I mention using apply , data from jp

df[~df.apply(lambda x : x.str.contains('A')).any(1)]
Out[491]: 
  COL1 COL2 COL3 COL4
1    E    F    G    H
4    S    T    U    V

1 Comment

That returned an error: ('Can only use .str accessor with string values, which use np.object_ dtype in pandas', 'occurred at index AAA'). Before using your line, I am grouping by and then I am using the following: pd.merge(df, dfGrouped, how='inner', on='AAA', suffixes=(' ', '_grp'), copy=True)[varsToKeep]

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.