I would like to read a csv file every month from the government census website here, more specifically, the one named VIP-mf.zip. To save your eyes from a cumbersome df, you can download the zip file using this (its 1.5MB)
I only want to read the csv after the row that says 'DATA', which for this specific file is row 309. I can do that easily using:
import pandas as pd
df = pd.read_csv('VIP-mf.csv', skiprows=310)
the problem is next month, when the new csv is updated on the website — that skiprows parameter will have to be 311, or else it reads it incorrectly. I would like to have a dynamic skiprows parameter to be able to capture this change every month so I can automatically download and read it correctly.
I tried implementing a solution from this answer using this article by creating a function for the skiprows parameter using the following:
def fetch_skip(index):
if index == 'DATA':
return True
return False
df = pd.read_csv('VIP-mf.csv', skiprows= lambda x: fetch_skip(x))
but I get this error:
ParserError: Error tokenizing data. C error: Expected 4 fields in line 311, saw 7
Which I'm assuming is because the csv has "mini-tables" within a single csv. Even though, I only need the final "table" which has the column names:
['per_idx', 'cat_idx', 'dt_idx', 'et_idx', 'geo_idx', 'is_adj', 'val']
Thank you for your help.
P.S If there is another way to do this than fiddling with the skiprows parameter that also works.