Patch over missing rows in CSV file in Python

Question

I've got a CSV file that contains rows for every minute of the day for multiple days. It is generated by a data acquisition system that sometimes misses a few rows.

The data looks like this - a datetime field followed by some integers

"2017-01-07 03:00:02","7","3","2","13","0"
"2017-01-07 03:01:02","7","3","2","13","0"
"2017-01-07 03:02:02","7","3","2","12","0"
"2017-01-07 03:07:02","7","3","2","12","0"
"2017-01-07 03:08:02","6","3","2","12","1"
"2017-01-07 03:09:02","7","3","2","12","0"
"2017-01-07 03:10:02","6","3","2","11","1"

There's missing rows in the above (real data) example. As the data doesn't change very much between samples, I'd like to just copy the last valid data in to the missing rows. The problem I'm having is detecting which rows are missing.

I'm processing the CSV with a python program I've cobbled together (I'm very new to python). This works to process the data I have.

import csv
import datetime

with open("minutedata.csv", 'rb') as f:
reader = csv.reader(f, delimiter=',')
for row in reader:
    date = datetime.datetime.strptime (row [0],"%Y-%m-%d %H:%M:%S")
    v1 = int(row[1])
    v2 = int(row[2])
    v3 = int(row[3])
    v4 = int(row[4])
    v5 = int(row[5])
    ...(process values)...

...(save data)...

I'm unsure how to check if the current row is next in sequence, or comes after some missing rows.

Edit to add :

I'm trying to use Pandas now thanks to jeremycg for the pointer to that.

I've added a header row to the CSV, so now it looks like:

time,v1,v2,v3,v4,v5
"2017-01-07 03:00:02","7","3","2","13","0"
"2017-01-07 03:01:02","7","3","2","13","0"
"2017-01-07 03:02:02","7","3","2","12","0"
"2017-01-07 03:07:02","7","3","2","12","0"
"2017-01-07 03:08:02","6","3","2","12","1"
"2017-01-07 03:09:02","7","3","2","12","0"
"2017-01-07 03:10:02","6","3","2","11","1"

The processing code is now:

import pandas as pd
import io
z = pd.read_csv('minutedata.csv')
z['time'] = pd.to_datetime(z['time'])
z.set_index('time').reindex(pd.date_range(min(z['time']), max(z['time']),freq="1min")).ffill()
for row in z:
    date = datetime.datetime.strptime (row [0],"%Y-%m-%d %H:%M:%S")
    v1 = int(row[1])
    v2 = int(row[2])
    v3 = int(row[3])
    v4 = int(row[4])
    v5 = int(row[5])
    ...(process values)...

...(save data)...

but this errors out:

Traceback (most recent call last):
File "process_day.py", line 14, in <module>
z.set_index('time').reindex(pd.date_range(min(z['time']), max(z['time']), freq="1min")).ffill()
File "/usr/local/lib/python2.7/site-packages/pandas/core/frame.py", line 2821, in reindex
**kwargs)
File "/usr/local/lib/python2.7/site-packages/pandas/core/generic.py", line 2259, in reindex fill_value, copy).__finalize__(self)
File "/usr/local/lib/python2.7/site-packages/pandas/core/frame.py", line 2767, in _reindex_axes
fill_value, limit, tolerance)
File "/usr/local/lib/python2.7/site-packages/pandas/core/frame.py", line 2778, in _reindex_index allow_dups=False)
File "/usr/local/lib/python2.7/site-packages/pandas/core/generic.py", line 2371, in _reindex_with_indexers copy=copy)
File "/usr/local/lib/python2.7/site-packages/pandas/core/internals.py", line 3839, in reindex_indexer self.axes[axis]._can_reindex(indexer)
File "/usr/local/lib/python2.7/site-packages/pandas/indexes/base.py", line 2494, in _can_reindex raise ValueError("cannot reindex from a duplicate axis")
ValueError: cannot reindex from a duplicate axis

I'm lost as to what it is now claiming is broken.

See comment further down for this fix for this.

The working code is now :

import pandas as pd
import datetime

z = pd.read_csv('minutedata1.csv')
z = z[~z.time.duplicated()]
z['time'] = pd.to_datetime(z['time'])
z.set_index('time').reindex(pd.date_range(min(z['time']), max(z['time']),freq="1min")).ffill()
for index,row in z.iterrows():
    date = datetime.datetime.strptime (row [0],"%Y-%m-%d %H:%M:%S")
    v1 = int(row[1])
    v2 = int(row[2])
    v3 = int(row[3])
    v4 = int(row[4])
    v5 = int(row[5])
    ...(process values)...

...(save data)...

My sincere thanks to everyone that helped. - David

As you iterate, keep track of the timestamp in the previous row by storing it in a variable and updating it at the end of each iteration. — user554546
– user554546, Commented Jan 12, 2017 at 20:42

jeremycg · Accepted Answer · 2017-01-12 20:54:14Z

3

You should probably be using pandas for this, as it is made for this kind of stuff.

First read the csv:

import pandas as pd
import io
x = '''
time,a,b,c,d,e
"2017-01-07 03:00:02","7","3","2","13","0"
"2017-01-07 03:01:02","7","3","2","13","0"
"2017-01-07 03:02:02","7","3","2","12","0"
"2017-01-07 03:07:02","7","3","2","12","0"
"2017-01-07 03:08:02","6","3","2","12","1"
"2017-01-07 03:09:02","7","3","2","12","0"
"2017-01-07 03:10:02","6","3","2","11","1"''' #your data, with added headers
z = pd.read_csv(io.StringIO(x)) #you can use your file name here

now z is a pandas dataframe:

z.head()

time    a   b   c   d   e
0   2017-01-07 03:00:02 7   3   2   13  0
1   2017-01-07 03:01:02 7   3   2   13  0
2   2017-01-07 03:02:02 7   3   2   12  0
3   2017-01-07 03:07:02 7   3   2   12  0
4   2017-01-07 03:08:02 6   3   2   12  1

We want to: Convert the 'time' column to pd.datetime:

z['time'] = pd.to_datetime(z['time'])

Set the 'index' of the dataframe to be the time, then reindex over our range:

z = z.set_index('time').reindex(pd.date_range(min(z['time']), max(z['time']), freq="1min"))
z

a   b   c   d   e
2017-01-07 03:00:02 7.0 3.0 2.0 13.0    0.0
2017-01-07 03:01:02 7.0 3.0 2.0 13.0    0.0
2017-01-07 03:02:02 7.0 3.0 2.0 12.0    0.0
2017-01-07 03:03:02 NaN NaN NaN NaN NaN
2017-01-07 03:04:02 NaN NaN NaN NaN NaN
2017-01-07 03:05:02 NaN NaN NaN NaN NaN
2017-01-07 03:06:02 NaN NaN NaN NaN NaN
2017-01-07 03:07:02 7.0 3.0 2.0 12.0    0.0
2017-01-07 03:08:02 6.0 3.0 2.0 12.0    1.0
2017-01-07 03:09:02 7.0 3.0 2.0 12.0    0.0
2017-01-07 03:10:02 6.0 3.0 2.0 11.0    1.0

Then use .ffill() to fill in from the previous value:

z.ffill()

a   b   c   d   e
2017-01-07 03:00:02 7.0 3.0 2.0 13.0    0.0
2017-01-07 03:01:02 7.0 3.0 2.0 13.0    0.0
2017-01-07 03:02:02 7.0 3.0 2.0 12.0    0.0
2017-01-07 03:03:02 7.0 3.0 2.0 12.0    0.0
2017-01-07 03:04:02 7.0 3.0 2.0 12.0    0.0
2017-01-07 03:05:02 7.0 3.0 2.0 12.0    0.0
2017-01-07 03:06:02 7.0 3.0 2.0 12.0    0.0
2017-01-07 03:07:02 7.0 3.0 2.0 12.0    0.0
2017-01-07 03:08:02 6.0 3.0 2.0 12.0    1.0
2017-01-07 03:09:02 7.0 3.0 2.0 12.0    0.0
2017-01-07 03:10:02 6.0 3.0 2.0 11.0    1.0

or, all together:

z = pd.read_csv(io.StringIO(x))
z['time'] = pd.to_datetime(z['time'])
z.set_index('time').reindex(pd.date_range(min(z['time']), max(z['time']), freq="1min")).ffill()

answered Jan 12, 2017 at 20:54

jeremycg

25k6 gold badges67 silver badges77 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

David Over a year ago

Just tried this, see latest edit, but it dies in a messy way - any idea? Thanks for your help, Pandas looks very useful.

jeremycg Over a year ago

it looks like you have some duplicate timestamps in your file you can try adding the line z[~z.time.duplicated()] before z['time'] = pd.to_datetime(z['time'])

David Over a year ago

That's the same thing isn't it? Thanks for helping me with this - I'm going to look at this with fresh eyes tomorrow.

jeremycg Over a year ago

ah z = z[~z.time.duplicated()] previously we did the filter, but not the assigment

David Over a year ago

Thanks very, much that fixed it. I'd completely missed it was being assigned.

shish023 · Accepted Answer · 2017-01-12 21:24:25Z

2

Using pandas as suggested by jeremycg is recommended. Though if you are looking for a solution without pandas, here it goes:

import csv
import datetime

data = []

with open("minutedata.csv", newline='') as f:
    reader = csv.reader(f, delimiter=',')

    prev_date = None

    for row in reader:

        date = datetime.datetime.strptime(row[0], "%Y-%m-%d %H:%M:%S")

        if prev_date:
            diff = date - prev_date

            if diff > datetime.timedelta(minutes=1):

                for i in range((int(diff.total_seconds() / 60) - 1)):
                    new_date = prev_date + datetime.timedelta(minutes=i + 1)
                    new_row = [str(new_date)] + row[1:]

                    data.append(",".join(new_row))

        prev_date = date

        data.append(",".join(row))

print(data)

Explanation: We iterate through each row and check the current row's date with the previous row's date

diff = date - prev_date

If we see the difference is greater than 1 minute we enter a loop that runs for the range of the missing data

if diff > datetime.timedelta(minutes=1):

    for i in range((int(diff.total_seconds() / 60) - 1)):
        ...

We add calculate the missing values by adding minutes to the previous date

new_date = prev_date + datetime.timedelta(minutes=i + 1)
new_row = [str(new_date)] + row[1:]

And you are done!

answered Jan 12, 2017 at 21:24

shish023

5333 silver badges10 bronze badges

1 Comment

David Over a year ago

Thank you. I'm going to try to get the pandas way going as I have another use where pandas looks like it would be very useful. I like your solution, I can see why python is so popular - it's runnable pseudo code

Collectives™ on Stack Overflow

Patch over missing rows in CSV file in Python

2 Answers 2

5 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related