0

I would like to read a csv file every month from the government census website here, more specifically, the one named VIP-mf.zip. To save your eyes from a cumbersome df, you can download the zip file using this (its 1.5MB)

I only want to read the csv after the row that says 'DATA', which for this specific file is row 309. I can do that easily using:

import pandas as pd
df = pd.read_csv('VIP-mf.csv', skiprows=310)

the problem is next month, when the new csv is updated on the website — that skiprows parameter will have to be 311, or else it reads it incorrectly. I would like to have a dynamic skiprows parameter to be able to capture this change every month so I can automatically download and read it correctly.

I tried implementing a solution from this answer using this article by creating a function for the skiprows parameter using the following:

def fetch_skip(index):
    if index == 'DATA':
        return True
    return False
df = pd.read_csv('VIP-mf.csv', skiprows= lambda x: fetch_skip(x))

but I get this error:

ParserError: Error tokenizing data. C error: Expected 4 fields in line 311, saw 7

Which I'm assuming is because the csv has "mini-tables" within a single csv. Even though, I only need the final "table" which has the column names:

['per_idx', 'cat_idx', 'dt_idx', 'et_idx', 'geo_idx', 'is_adj', 'val']

Thank you for your help. P.S If there is another way to do this than fiddling with the skiprows parameter that also works.

2 Answers 2

2

I found the answer in another question found here. I had to make a slight change.

def skip_to(fle, line,**kwargs):
    if os.stat(fle).st_size == 0:
        raise ValueError("File is empty")
    with open(fle) as f:
        pos = 0
        cur_line = f.readline()
        while not cur_line.startswith(line):
            pos = f.tell()
            cur_line = f.readline()
        f.seek(pos)
        return pd.read_csv(f, **kwargs)

Then I used:

df = skip_to('path_to_file.csv',"DATA", skiprows=1)
Sign up to request clarification or add additional context in comments.

Comments

1

An alternative, if you have access to a Linux machine, you could parse the data through the shell, use grep to get the data, and read it in via pandas. note that for the grep, I used a count of 100_000 to get the rows after the match; you can vary that depending on how many rows, you think the data will be.

I also assume (and I may be wrong) that the headers will always be per_idx,cat_idx.., which is what I used in grepping:

from io import StringIO
import pandas as pd
import subprocess

shell_string = """grep -A100000 "per_idx,cat_idx" /home/sam/Downloads/VIP-mf/VIP-mf.csv"""
data = subprocess.run(shell_string, 
                        shell = True, 
                        capture_output = True, 
                        text = True).stdout
df = pd.read_csv(StringIO(data))
df.head(5)
   per_idx  cat_idx  dt_idx  et_idx  geo_idx  is_adj      val
0        1        1       1       0        1       0  59516.0
1        1        2       1       0        1       0  25972.0
2        1        3       1       0        1       0  33545.0
3        1        4       1       0        1       0    989.0
4        1        5       1       0        1       0   3763.0



df.shape
   (65472, 7)

2 Comments

Thank you for this alternative solution. Unfortunately, I'm on a windows machine and the script will also run on a windows machine so I'm unable to use this method. Just to clarify, the shell_string variable is basically going through 100,000 rows after finding the headers? is there not a way to stop ones the rows are finished? Thanks.
no, after it finds the header, it returns a maximum of 100,000 rows, including and after the header. if it is less than 100, 000 rows it will return just that, if it is more than a 100, 000 rows, then it will return only 100, 000 rows. I used 100,000 rows as a test case, you could pass a higher number (just think of it as a max_nrows to return once the header is found)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.