Changing CSV files in python

Question

I have a bunch of CSV files with 4 line headers. In these files, I want to change the values in the sixth column based on the values in the second column. For example, if the second column, under the name PRODUCT is Banana, I would want to change the value in the same row under TIME to 10m. If the the product was Apple I would want the time to be 15m and so on.

When    12:07           
Area    Produce             
Store Name   FF             
Eatfresh                    
PN  PRODUCT NUMBER  INV ENT TIME
1    Banana 600000                5m
2    Apple  400000       F4        8m                                              
3    Pair                       6m
4    Banana 4000     G3            7m
5    Watermelon 700000          13m
6    Orange 12000               2m
7    Apple  1650000         6m

Desired Output

When    12:07           
Area    Produce             
Store Name   FF             
Eatfresh                    
PN  PRODUCT NUMBER  INV ENT TIME
1    Banana 600000                10m
2    Apple  400000       F4        15m                                              
3    Pair                       6m
4    Banana 4000     G3            10m
5    Watermelon 700000          13m
6    Orange 12000               2m
7    Apple  1650000         15m

I want to output all of them to be outputed to a directory call NTime. Here is what I have thus far, but being new to coding, I don't really understand a great deal and have gotten stuck on how to make the actual changes. I found Python/pandas idiom for if/then/else and it seems similar to what I want to do, but I don't completely understand what is going on.

import pandas as pd
import glob
import os

fns = glob.glob('*.csv') 
colname1 = 'PRODUCT'
colname2 = 'TIME'

for csv in fns:
    s = pd.read_csv(csv, usecols=[colname1], squeeze=True, skiprows=4, header=0)



    with open(os.path.join('NTime', fn), 'wb') as f:

Can someone help me?

Phillip Cloud · Accepted Answer · 2013-08-29 23:56:10Z

You can do this with a combination of groupby, replace and a dict

In [76]: from pandas import DataFrame

In [77]: fruits = ['banana', 'apple', 'pear', 'banana', 'watermelon', 'orange', 'apple']

In [78]: times = ['5m', '8m', '6m', '7m', '13m', '2m', '6m']

In [79]: time_map = {'banana': '10m', 'apple': '15m', 'pear': '5m'}

In [80]: df = DataFrame({'fruits': fruits, 'time': times})
Out[80]:
       fruits time
0      banana   5m
1       apple   8m
2        pear   6m
3      banana   7m
4  watermelon  13m
5      orange   2m
6       apple   6m

In [81]: def replacer(g, time_map):
   ....:     tv = g.time.values
   ....:     return g.replace(to_replace=tv, value=time_map.get(g.name, tv))

In [82]: df.groupby('fruits').apply(replacer, time_map)
Out[82]:
       fruits time
0      banana  10m
1       apple  15m
2        pear   5m
3      banana  10m
4  watermelon  13m
5      orange   2m
6       apple  15m

You said you're new to programming so I'll explain what's going on.

df.groupby('fruits') splits the DataFrame into subsets (which are DataFrames or Series objects) using the values of the fruits column.
The apply method applies a function to each of the aforementioned subsets and concatenates the result (if needed).
replacer is where the "magic" happens: each group's time values get replaced (to_replace) with the new value that's defined in time_map. The get method of dicts allows you to provide a default value if the key you're searching for (the fruit name in this case) is not there. nan is commonly used for this purpose, but here I'm actually just using the time that was already there if there isn't a new one defined for it in the time_map dict.

One thing to note is my use of g.name. This doesn't normally exist as an attribute on DataFrames (you can of course define it yourself if you want to), but is there so you can perform computations that may require the group name. In this case that's the "current" fruit you're looking at when you apply your function.

If you have a new value for each fruit or you write in the old values manually you can shorten this to a one-liner:

In [130]: time_map = {'banana': '10m', 'apple': '15m', 'pear': '5m', 'orange': '10m', 'watermelon': '100m'
}

In [131]: s = Series(time_map, name='time')

In [132]: s[df.fruits]
Out[132]:
fruits
banana         10m
apple          15m
pear            5m
banana         10m
watermelon    100m
orange         10m
apple          15m
Name: time, dtype: object

In [133]: s[df.fruits].reset_index()
Out[133]:
       fruits  time
0      banana   10m
1       apple   15m
2        pear    5m
3      banana   10m
4  watermelon  100m
5      orange   10m
6       apple   15m

Paul Meinshausen · Accepted Answer · 2013-08-29 21:54:39Z

0

Assuming that your data is in a Pandas DataFrame and looks something like this:

PN  PRODUCT NUMBER  INV ENT TIME
1    Banana 600000                10m
2    Apple  400000       F4        15m                                              
3    Pair                       6m
4    Banana 4000     G3            10m
5    Watermelon 700000          13m
6    Orange 12000               2m
7    Apple  1650000         15m

Then you should be able to do manipulate values in one column based on values in another column (same row) using simple loops like this:

for numi, i in enumerate(df["PRODUCT"]):
    if i == "Banana":
        df["TIME"][numi] = "10m"
    if i == "Apple":
        df["TIME"][numi] = "15m"

The code first loops through the rows of the dataframe column "PRODUCT", with the row value stored as i and the row-number stored as numi. It then uses if statements to identify the different levels of interest in the Product column. For those rows with the levels of interest (eg "Banana" or "Apple"), it uses the row-numbers to change the value of another column in the same row.

There are lots of ways to do this, and depending on the size of your data and the number of levels (in this case "Products") you want to change, this isn't necessarily the most efficient way to do this. But since you're a beginner, this will probably be a good basic way of doing it for you to start with.

answered Aug 29, 2013 at 21:54

Paul Meinshausen

7991 gold badge8 silver badges13 bronze badges

1 Comment

Jeff Over a year ago

fyi: you shouldn't do this type of chained assignment, see here: pandas.pydata.org/pandas-docs/dev/…; also iterating over a frame is not efficient, this is an easily vectorized problem

Collectives™ on Stack Overflow

Changing CSV files in python

2 Answers 2

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related