select certain dates inside loop for .csv file

Question

Name,USAF,NCDC,Date,HrMn,I,Type,Dir,Q,I,Spd,Q

OXNARD,723927,93110,19590101,0000,4,SAO,270,1,N,3.1,1, 
OXNARD,723927,93110,19590101,0100,4,SAO,338,1,N,1.0,1, 
OXNARD,723927,93110,19590101,0200,4,SAO,068,1,N,1.0,1, 
OXNARD,723927,93110,19590101,0300,4,SAO,068,1,N,2.1,1, 
OXNARD,723927,93110,19590101,0400,4,SAO,315,1,N,1.0,1, 
OXNARD,723927,93110,19590101,0500,4,SAO,999,1,C,0.0,1, 
....

OXNARD,723927,93110,19590102,0000,4,SAO,225,1,N,2.1,1,
OXNARD,723927,93110,19590102,0100,4,SAO,248,1,N,2.1,1,
OXNARD,723927,93110,19590102,0200,4,SAO,999,1,C,0.0,1,
OXNARD,723927,93110,19590102,0300,4,SAO,068,1,N,2.1,1,

Here is a snippet of a csv file storing hourly wind speeds (Spd) in each row. What I'd like to do is select all hourly winds for each day in the csv file and store them into a temporary daily list storing all of that day's hourly values (24 if no missing values). Then I'll output the current day's list, create new empty list for the next day, locate hourly speeds in the next day, output that daily list, and so forth until the end of the file.

I'm struggling with a good method to do this. One thought I have is to read in line i, determine the date(YYYY-MM-DD), then read in line i+1 and see if that date matchs date i. If they match, then we're in the same day. If they don't, then we are onto the next day. But I can't even figure out how to read in the next line in the file...

Any suggestions to execute this method or a completely new (and better?!) method are most welcome. Thanks you in advance!

obs_in  = open(csv_file).readlines()
for i in range(1,len(obs_in)):        
# Skip over the header lines 
if not str(obs_in[i]).startswith("Identification") and not str(obs_in[i]).startswith("Name"):
    name,usaf,ncdc,date,hrmn,i,type,dir,q,i2,spd,q2,blank = obs_in[i].split(',')
    current_dt  = datetime.date(int(date[0:4]),int(date[4:6]),int(date[6:8]))
    current_spd = spd 
    # Read in next line's date: is it in the same day?
    # If in the same day, then append spd into tmp daily list 
    # If not, then start a new list for the next day

have a list and store lines until date change. when date changes, dump what's in list to file, refresh the list, then move on — yosukesabai
– yosukesabai, Commented Dec 17, 2011 at 22:02
so at the end, do you want to have a whole bunch of files with 24 lines, and filename becomes something like spd19590101.csv, spd19590102.csv etc? — yosukesabai
– yosukesabai, Commented Dec 17, 2011 at 22:20
how can i mark when the date changes? i don't know how to read in the next line and extract that date to see if it's different that the previous line's date. ultimately, i want one list of 24 values per date (YYYYMMDD), output that list, then move onto the next day, have a new empty list, populate the new list with the next 24 values, output it... — N1B4
– N1B4, Commented Dec 17, 2011 at 22:38
you dont read the next line. you just read and put the data into buffer, but remember the date of the previous line. then when you process new line, compare the date of previous line. when date changed. you flush the buffer to file, clear the buffer, then start storing lines to buffer. — yosukesabai
– yosukesabai, Commented Dec 17, 2011 at 22:42

mtrw · Accepted Answer · 2011-12-18 21:59:18Z

2

You can take advantage of the well-ordered nature of the data file and use csv.dictreader. Then you can build up a dictionary of the windspeeds organized by date quite simply, which you can process as you like. Note that the csv reader returns strings, so you might want to convert to other types as appropriate while you assemble the list.

import csv
from collections import defaultdict
bydate = defaultdict(list)
rdr = csv.DictReader(open('winds.csv','rt'))
for k in rdr:
    bydate[k['Date']].append(float(k['Spd']))

print(bydate)
defaultdict(<type 'list'>, {'19590101': [3.1000000000000001, 1.0, 1.0, 2.1000000000000001, 1.0, 0.0], '19590102': [2.1000000000000001, 2.1000000000000001, 0.0, 2.1000000000000001]})

You can obviously change the argument to the append call to a tuple, for instance append((float(k['Spd']), datetime.datetime.strptime(k['Date']+k['HrMn'],'%Y%m%D%H%M)) so that you can also collect the times.

If the file has extraneous spaces, you can use the skipinitialspace parameter: rdr = csv.DictReader(open('winds.csv','rt'), fieldnames=ff, skipinitialspace=True). If this still doesn't work, you can pre-process the header line:

bydate = defaultdict(list)
with open('winds.csv', 'rt') as f:
    fieldnames = [k.strip() for k in f.readline().split(', ')]
    rdr = csv.DictReader(f, fieldnames=fieldnames, skipinitialspace=True)
    for k in rdr:
        bydate[k['Date']].append(k['Spd'])
return bydate

bydate is accessed like a regular dictionary. To access a specific day's data, do bydate['19590101']. To get the list of dates that were processed, you can do bydate.keys().

If you want to convert them to Python datetime objects at the time of reading the file, you can import datetime, then replace the assignment line with bydate[datetime.datetime.strptime(k['Date'], '%Y%m%d')].append(k['Spd']).

edited Dec 18, 2011 at 21:59

answered Dec 17, 2011 at 23:13

mtrw

35.3k7 gold badges66 silver badges73 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

N1B4 Over a year ago

Thanks for the suggestion, mtrw! Follow up question: I have some trailing and leading white space in the actual csv file (I deleted them manually when pasting the snippet above), so that in order for the above script to work, line 6 needs to be: bydate[k['Date ']].append(k[' Spd']). How can I remove the white space in the read-in, so I can just use 'Date' and 'Spd' in line 6?

N1B4 Over a year ago

Also, how do you then extract the speeds for just 19590101, for example? (I'm a total newbie to DictReader)

N1B4 Over a year ago

Skipinitialspace=True appears to only remove leading whitespace - is there a corresponding command to remove both trailing and leading whitespace?

mtrw Over a year ago

Did you try splitting fieldname line and stripping the whitespace, as shown in the second example?

N1B4 Over a year ago

The second example works perfectly; I realized that my header line was not separated by commas, so I changed the line to read .split(). Speeds are stored as a list, but I can't take the average of them. TypeError: cannot perform reduce with flexible type dates = bydate.keys() for dt in dates: mean_spd = mean(bydate[dt])

|

yosukesabai · Accepted Answer · 2011-12-17 22:56:06Z

It can be something like this.

def dump(buf, date):
    """dumps buffered line into file 'spdYYYYMMDD.csv'"""
    if len(buf) == 0: return
    with open('spd%s.csv' % date, 'w') as f:
        for line in buf:
             f.write(line)

obs_in  = open(csv_file).readlines()
# buf stores one day record
buf = []
# date0 is meant for time stamp for the buffer
date0 = None
for i in range(1,len(obs_in)):        
    # Skip over the header lines 
    if not str(obs_in[i]).startswith("Identification") and \
        not str(obs_in[i]).startswith("Name"):
        name,usaf,ncdc,date,hrmn,ii,type,dir,q,i2,spd,q2,blank = \
            obs_in[i].split(',')
        current_dt  = datetime.date(int(date[0:4]),int(date[4:6]),int(date[6:8]))
        current_spd = spd 
        # see if the time stamp of current record is different.  if it is different
        # dump the buffer, and also set the time stamp of buffer
        if date != date0:
            dump(buf, date0)
            buf = []
            date0 = date
        # you change this.  i am simply writing entire line
        buf.append(obs_in[i])

# when you get out the buffer should be filled with the last day's record.  
# so flush that too.
dump(buf, date0)

I also found that i have to use ii instead of i for the filed "I" of the data, as you used i for loop counter.

ClimateUnboxed · Accepted Answer · 2023-01-30 08:33:22Z

0

I know this question is from years ago but just wanted to point out that a small bash script can neatly perform this task. I copied your example into a file called data.txt and this is the script:

#!/bin/bash
date=19590101
date_end=19590102
while [[ $date -le $date_end ]] ; do
  grep ",${date}," data.txt > file_${date}.txt
  date=`date +%Y%m%d -d ${date}+1day` # NOTE: MAC-OSX date differs
done

Note that this won't work on MAC as for some reason the date command implementation is different, so on MAC you either need to use gdate (from coreutils) or change the options to match those for date on MAC.

If there are dates missing from the file the grep command produces an empty file - this link shows ways to avoid this: how to stop grep creating empty file if no results

edited Jan 30, 2023 at 8:33

answered Feb 8, 2018 at 15:54

ClimateUnboxed

8,1866 gold badges48 silver badges100 bronze badges

Collectives™ on Stack Overflow

select certain dates inside loop for .csv file

3 Answers 3

6 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

6 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related