Parse non-homogeneous CSV file with Python

Question

I have a CSV file structured like this:

# Samples 1
1,58
2,995
3,585

# Samples 2
15,87
16,952
17,256

# Samples 1
4,89
5,63
6,27

Is there any way in Python 3.x, how to parse a file structured like this without having to manually go through it line-by-line?

I'd like to have some function, which will automatically parse it considering the labels, like this:

>> parseLabeledCSV(['# Samples 1', '# Samples 2'], fileName)
[{1:58,2:995,3:585,4:89,5:63,6:27}, {15:57, 16:952, 17:256}]

What do you mean parse, split into columns? There are many python packages specialising in reading in csv data. — nbryans
– nbryans, Commented Jun 23, 2016 at 17:25
What did you mean by non-homogeneous? The rows look homogeneous to me: each has two integers. Please update your post with what the expected output are. Have you looked into the csv library module? — Hai Vu
– Hai Vu, Commented Jun 23, 2016 at 17:26
The edit significantly changes the meaning of the question. It was absolutely unclear these were key-value pairs initially. — Alyssa Haroldsen
– Alyssa Haroldsen, Commented Jun 23, 2016 at 17:31
@Eenoku Considering this seems to be a custom format, I'd say the safest bet is to just go line-by-line. — Alyssa Haroldsen
– Alyssa Haroldsen, Commented Jun 23, 2016 at 17:33

Ivonet · Accepted Answer · 2016-06-23 17:44:30Z

1

Something like this?

input="""# Samples 1
1,58
2,995
3,585

# Samples 2
15,87
16,952
17,256

# Samples 1
4,89
5,63
6,27"""


def parse(input):
    parsed = {}
    lines = input.split("\n")
    key = "# Unknown"
    for line in lines:
        if line is None or line == "": #  ignore empty line
            continue
        if line.startswith("#") :
            if not parsed.has_key(line):
                parsed[line] = {}
            key = line
            continue
        left, right = line.split(",")
        parsed[key][left] = right
    return parsed


if __name__ == '__main__':
    output = parse(input)
    print(output)

will output to:

{'# Samples 1': {'1': '58', '3': '585', '2': '995', '5': '63', '4': '89', '6': '27'}, '# Samples 2': {'15': '87', '17': '256', '16': '952'}}

edited Jun 23, 2016 at 17:44

answered Jun 23, 2016 at 17:32

Ivonet

2,7502 gold badges20 silver badges31 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

PaulMcG · Accepted Answer · 2016-06-30 22:10:43Z

groupby will do all the iterating and grouping for you. In this case, you only care about those contiguous groups of lines that contain a ',' (or are composed only of ',' and digits, or whatever other filter predicate you care to define):

input="""# Samples 1
1,58
2,995
3,585

# Samples 2
15,87
16,952
17,256

# Samples 1
4,89
5,63
6,27""".splitlines()

from itertools import groupby
import csv

results = []
for has_comma, data_lines in groupby(input, key=lambda s: ',' in s):
    if has_comma:
        results.append(dict(csv.reader(data_lines)))

This can even be collapsed to a single Python list comprehension statement:

results = [dict(csv.reader(data_lines)) 
            for has_comma, data_lines in groupby(input, key=lambda s: ',' in s) 
                if has_comma]

In both cases, print the results using:

for dd in results:
    print(dd)

to get:

{'1': '58', '3': '585', '2': '995'}
{'15': '87', '17': '256', '16': '952'}
{'5': '63', '4': '89', '6': '27'}

Collectives™ on Stack Overflow

Parse non-homogeneous CSV file with Python

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related