2

in R, there is a common function called fread, which is used to read in tsv/csv/... files. It has a super useful argument called skip, which allows you to specify a string, and the row in which that string is found is then used as the header (useful if you specify a substring of the column names row)

I was wondering if there is a similar function in python because it seems extremely useful.

Cheers!

1
  • Have a look at pydatatable; it should offer the same functionality for fread Commented Oct 12, 2021 at 19:28

1 Answer 1

1

A technique I sometimes use (e.g. to filter faulty data, and when none of the other wonderful capabilities of pandas.read_csv() seem to address the case at hand) is to define a io.TextIOWrapper.

In your case, you could write:

class SkipUntilMatchWrapper(io.TextIOWrapper):
    def __init__(self, f, matcher, include_matching=False):
        super().__init__(f, line_buffering=True)
        self.f = f
        self.matcher = matcher
        self.include_matching = include_matching
        self.has_matched = False

    def read(self, size=None):
        while not self.has_matched:
            line = self.readline()
            if self.matcher(line):
                self.has_matched = True
                if self.include_matching:
                    return line
        return super().read(size)

Let's try it on a simple example:

# make an example
with open('sample.csv', 'w') as f:
    print('garbage 1', file=f)
    print('garbage 2', file=f)
    print('and now for some data', file=f)
    print('a,b,c', file=f)
    x = np.random.randint(0, 10, size=(5, 3))
    np.savetxt(f, x, fmt='%d', delimiter=',')

Read:

with open('sample.csv', 'rb') as f_orig:
    with SkipUntilMatchWrapper(f_orig, lambda s: 'a,b,c' in s, include_matching=True) as f:
        df = pd.read_csv(f)
>>> df
   a  b  c
0  2  7  8
1  7  3  3
2  3  6  9
3  0  6  0
4  4  0  9

Another way:

with open('sample.csv', 'rb') as f_orig:
    with SkipUntilMatchWrapper(f_orig, lambda s: 'for some data' in s) as f:
        df = pd.read_csv(f)
>>> df
   a  b  c
0  2  7  8
1  7  3  3
2  3  6  9
3  0  6  0
4  4  0  9
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.