I've got a CSV file that contains rows for every minute of the day for multiple days. It is generated by a data acquisition system that sometimes misses a few rows.
The data looks like this - a datetime field followed by some integers
"2017-01-07 03:00:02","7","3","2","13","0"
"2017-01-07 03:01:02","7","3","2","13","0"
"2017-01-07 03:02:02","7","3","2","12","0"
"2017-01-07 03:07:02","7","3","2","12","0"
"2017-01-07 03:08:02","6","3","2","12","1"
"2017-01-07 03:09:02","7","3","2","12","0"
"2017-01-07 03:10:02","6","3","2","11","1"
There's missing rows in the above (real data) example. As the data doesn't change very much between samples, I'd like to just copy the last valid data in to the missing rows. The problem I'm having is detecting which rows are missing.
I'm processing the CSV with a python program I've cobbled together (I'm very new to python). This works to process the data I have.
import csv
import datetime
with open("minutedata.csv", 'rb') as f:
reader = csv.reader(f, delimiter=',')
for row in reader:
date = datetime.datetime.strptime (row [0],"%Y-%m-%d %H:%M:%S")
v1 = int(row[1])
v2 = int(row[2])
v3 = int(row[3])
v4 = int(row[4])
v5 = int(row[5])
...(process values)...
...(save data)...
I'm unsure how to check if the current row is next in sequence, or comes after some missing rows.
Edit to add :
I'm trying to use Pandas now thanks to jeremycg for the pointer to that.
I've added a header row to the CSV, so now it looks like:
time,v1,v2,v3,v4,v5
"2017-01-07 03:00:02","7","3","2","13","0"
"2017-01-07 03:01:02","7","3","2","13","0"
"2017-01-07 03:02:02","7","3","2","12","0"
"2017-01-07 03:07:02","7","3","2","12","0"
"2017-01-07 03:08:02","6","3","2","12","1"
"2017-01-07 03:09:02","7","3","2","12","0"
"2017-01-07 03:10:02","6","3","2","11","1"
The processing code is now:
import pandas as pd
import io
z = pd.read_csv('minutedata.csv')
z['time'] = pd.to_datetime(z['time'])
z.set_index('time').reindex(pd.date_range(min(z['time']), max(z['time']),freq="1min")).ffill()
for row in z:
date = datetime.datetime.strptime (row [0],"%Y-%m-%d %H:%M:%S")
v1 = int(row[1])
v2 = int(row[2])
v3 = int(row[3])
v4 = int(row[4])
v5 = int(row[5])
...(process values)...
...(save data)...
but this errors out:
Traceback (most recent call last):
File "process_day.py", line 14, in <module>
z.set_index('time').reindex(pd.date_range(min(z['time']), max(z['time']), freq="1min")).ffill()
File "/usr/local/lib/python2.7/site-packages/pandas/core/frame.py", line 2821, in reindex
**kwargs)
File "/usr/local/lib/python2.7/site-packages/pandas/core/generic.py", line 2259, in reindex fill_value, copy).__finalize__(self)
File "/usr/local/lib/python2.7/site-packages/pandas/core/frame.py", line 2767, in _reindex_axes
fill_value, limit, tolerance)
File "/usr/local/lib/python2.7/site-packages/pandas/core/frame.py", line 2778, in _reindex_index allow_dups=False)
File "/usr/local/lib/python2.7/site-packages/pandas/core/generic.py", line 2371, in _reindex_with_indexers copy=copy)
File "/usr/local/lib/python2.7/site-packages/pandas/core/internals.py", line 3839, in reindex_indexer self.axes[axis]._can_reindex(indexer)
File "/usr/local/lib/python2.7/site-packages/pandas/indexes/base.py", line 2494, in _can_reindex raise ValueError("cannot reindex from a duplicate axis")
ValueError: cannot reindex from a duplicate axis
I'm lost as to what it is now claiming is broken.
See comment further down for this fix for this.
The working code is now :
import pandas as pd
import datetime
z = pd.read_csv('minutedata1.csv')
z = z[~z.time.duplicated()]
z['time'] = pd.to_datetime(z['time'])
z.set_index('time').reindex(pd.date_range(min(z['time']), max(z['time']),freq="1min")).ffill()
for index,row in z.iterrows():
date = datetime.datetime.strptime (row [0],"%Y-%m-%d %H:%M:%S")
v1 = int(row[1])
v2 = int(row[2])
v3 = int(row[3])
v4 = int(row[4])
v5 = int(row[5])
...(process values)...
...(save data)...
My sincere thanks to everyone that helped. - David