dynamically skiprows in reading multiple csv files

Question

I would like to read a csv file every month from the government census website here, more specifically, the one named VIP-mf.zip. To save your eyes from a cumbersome df, you can download the zip file using this (its 1.5MB)

I only want to read the csv after the row that says 'DATA', which for this specific file is row 309. I can do that easily using:

import pandas as pd
df = pd.read_csv('VIP-mf.csv', skiprows=310)

the problem is next month, when the new csv is updated on the website — that skiprows parameter will have to be 311, or else it reads it incorrectly. I would like to have a dynamic skiprows parameter to be able to capture this change every month so I can automatically download and read it correctly.

I tried implementing a solution from this answer using this article by creating a function for the skiprows parameter using the following:

def fetch_skip(index):
    if index == 'DATA':
        return True
    return False
df = pd.read_csv('VIP-mf.csv', skiprows= lambda x: fetch_skip(x))

but I get this error:

ParserError: Error tokenizing data. C error: Expected 4 fields in line 311, saw 7

Which I'm assuming is because the csv has "mini-tables" within a single csv. Even though, I only need the final "table" which has the column names:

['per_idx', 'cat_idx', 'dt_idx', 'et_idx', 'geo_idx', 'is_adj', 'val']

Thank you for your help. P.S If there is another way to do this than fiddling with the skiprows parameter that also works.

DataPlug · Accepted Answer · 2021-04-09 22:52:52Z

2

I found the answer in another question found here. I had to make a slight change.

def skip_to(fle, line,**kwargs):
    if os.stat(fle).st_size == 0:
        raise ValueError("File is empty")
    with open(fle) as f:
        pos = 0
        cur_line = f.readline()
        while not cur_line.startswith(line):
            pos = f.tell()
            cur_line = f.readline()
        f.seek(pos)
        return pd.read_csv(f, **kwargs)

Then I used:

df = skip_to('path_to_file.csv',"DATA", skiprows=1)

answered Apr 9, 2021 at 22:52

DataPlug

3483 silver badges11 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

sammywemmy · Accepted Answer · 2021-04-10 02:23:54Z

1

An alternative, if you have access to a Linux machine, you could parse the data through the shell, use grep to get the data, and read it in via pandas. note that for the grep, I used a count of 100_000 to get the rows after the match; you can vary that depending on how many rows, you think the data will be.

I also assume (and I may be wrong) that the headers will always be per_idx,cat_idx.., which is what I used in grepping:

from io import StringIO
import pandas as pd
import subprocess

shell_string = """grep -A100000 "per_idx,cat_idx" /home/sam/Downloads/VIP-mf/VIP-mf.csv"""
data = subprocess.run(shell_string, 
                        shell = True, 
                        capture_output = True, 
                        text = True).stdout
df = pd.read_csv(StringIO(data))
df.head(5)
   per_idx  cat_idx  dt_idx  et_idx  geo_idx  is_adj      val
0        1        1       1       0        1       0  59516.0
1        1        2       1       0        1       0  25972.0
2        1        3       1       0        1       0  33545.0
3        1        4       1       0        1       0    989.0
4        1        5       1       0        1       0   3763.0



df.shape
   (65472, 7)

answered Apr 10, 2021 at 2:23

sammywemmy

28.9k4 gold badges21 silver badges35 bronze badges

2 Comments

DataPlug Over a year ago

Thank you for this alternative solution. Unfortunately, I'm on a windows machine and the script will also run on a windows machine so I'm unable to use this method. Just to clarify, the shell_string variable is basically going through 100,000 rows after finding the headers? is there not a way to stop ones the rows are finished? Thanks.

sammywemmy Over a year ago

no, after it finds the header, it returns a maximum of 100,000 rows, including and after the header. if it is less than 100, 000 rows it will return just that, if it is more than a 100, 000 rows, then it will return only 100, 000 rows. I used 100,000 rows as a test case, you could pass a higher number (just think of it as a max_nrows to return once the header is found)

Collectives™ on Stack Overflow

dynamically skiprows in reading multiple csv files

2 Answers 2

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related