How to edit source csv file data using pandas

Question

I have a csv file that contains large amount of data,but the data that contained in csv file is not cleaned.The example of csv data is as follows

country     branch      no_of_employee     total_salary    count_DOB   count_email
  x            a            30                 2500000        20            25
  x            b            20                 350000         15            20
  y            c            30                 4500000        30            30
  z            d            40                 5500000        40            40
  z            e            10                 1000000        10            10
  z            f            15                 1500000        15            15

after applying the group by i am not getting the proper result .

df = data_df.groupby(['country', 'customer_branch']).count()

the result is in the form of

country  branch    no of employees   
x          1           30   
x          1           20
y          1           30
z          3           65

country x is repeating twise.This is because of source file data,in source file the country field contains "X" and "X ". That is why it display X twise .How can i ignore this problem using pandas

@EdChum df['country'] = df['country'].str.strip(' ').count() will it work — Edwin Baby
– Edwin Baby, Commented Feb 24, 2016 at 9:46

EdChum · Accepted Answer · 2016-02-24 09:52:57Z

3

You can call the vectorised str.strip to trim leading and trailing whitespaces:

df['country'] = df['country'].str.strip(' ')

So the above should work to clean your data and then you can call groupby to get the desired results or set_index so you can sum on an index level which looks like what you really want

Example:

In [4]:
df = pd.DataFrame({'country':['x', 'x ','y','z','z','z'], 'branch':list('abcdef'), 'no_of_employee':[30,20,30,40,10,15]})
df

Out[4]:
  branch country  no_of_employee
0      a       x              30
1      b      x               20
2      c       y              30
3      d       z              40
4      e       z              10
5      f       z              15

In [9]:
df['country'] = df['country'].str.strip()
df.set_index(['country', 'branch']).sum(level=0)

Out[9]:
         no_of_employee
country                
x                    50
y                    30
z                    65

edited Feb 24, 2016 at 9:52

answered Feb 24, 2016 at 9:45

EdChum

397k204 gold badges836 silver badges583 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How to edit source csv file data using pandas

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related