1

I have a csv file that contains large amount of data,but the data that contained in csv file is not cleaned.The example of csv data is as follows

country     branch      no_of_employee     total_salary    count_DOB   count_email
  x            a            30                 2500000        20            25
  x            b            20                 350000         15            20
  y            c            30                 4500000        30            30
  z            d            40                 5500000        40            40
  z            e            10                 1000000        10            10
  z            f            15                 1500000        15            15

after applying the group by i am not getting the proper result .

df = data_df.groupby(['country', 'customer_branch']).count()

the result is in the form of

country  branch    no of employees   
x          1           30   
x          1           20
y          1           30
z          3           65

country x is repeating twise.This is because of source file data,in source file the country field contains "X" and "X ". That is why it display X twise .How can i ignore this problem using pandas

4
  • df['country'] = df['country'].str.strip(' ') should work Commented Feb 24, 2016 at 9:45
  • @EdChum df['country'] = df['country'].str.strip(' ').count() will it work Commented Feb 24, 2016 at 9:46
  • No, the idea is you clean your data prior to the groupby Commented Feb 24, 2016 at 9:46
  • @EdCum Thank you it works Commented Feb 24, 2016 at 9:53

1 Answer 1

3

You can call the vectorised str.strip to trim leading and trailing whitespaces:

df['country'] = df['country'].str.strip(' ')

So the above should work to clean your data and then you can call groupby to get the desired results or set_index so you can sum on an index level which looks like what you really want

Example:

In [4]:
df = pd.DataFrame({'country':['x', 'x ','y','z','z','z'], 'branch':list('abcdef'), 'no_of_employee':[30,20,30,40,10,15]})
df

Out[4]:
  branch country  no_of_employee
0      a       x              30
1      b      x               20
2      c       y              30
3      d       z              40
4      e       z              10
5      f       z              15

In [9]:
df['country'] = df['country'].str.strip()
df.set_index(['country', 'branch']).sum(level=0)

Out[9]:
         no_of_employee
country                
x                    50
y                    30
z                    65
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.