Parsing a text file into a pandas DataFrame

Question

I have a .txt file that looks like this:

SHT1 E:   T1:30.45°C    H1:59.14 %RH
SHT2 S:   T2:29.93°C    H2:67.38 %RH

SHT1 E:   T1:30.49°C    H1:58.87 %RH
SHT2 S:   T2:29.94°C    H2:67.22 %RH

SHT1 E:   T1:30.53°C    H1:58.69 %RH
SHT2 S:   T2:29.95°C    H2:67.22 %RH

I want to have a DataFrame that looks like this:

      T1     H1     T2     H2
0  30.45  59.14  29.93  67.38
1  30.49  58.87  29.94  67.22
2  30.53  58.69  29.95  67.22

I parse this by:

Reading up the text file line by line
Parsing the lines e.g. matching only the parts with T1, T2, H1, and H2, splitting by :, and removing °C and %RH
The above produces a list of lists each having two items
I flatten the list of lists
Just to chop it up into a list of four-item lists
Dump that to a df
Write to an Excel file

Here's the code:

import itertools

import pandas as pd


def read_lines(file_object) -> list:
    return [
        parse_line(line) for line in file_object.readlines() if line.strip()
    ]


def parse_line(line: str) -> list:
    return [
        i.split(":")[-1].replace("°C", "").replace("%RH", "")
        for i in line.strip().split()
        if i.startswith(("T1", "T2", "H1", "H2"))
    ]


def flatten(parsed_lines: list) -> list:
    return list(itertools.chain.from_iterable(parsed_lines))


def cut_into_pieces(flattened_lines: list, piece_size: int = 4) -> list:
    return [
        flattened_lines[i:i + piece_size] for i
        in range(0, len(flattened_lines), piece_size)
    ]


with open("your_text_data.txt") as data:
    df = pd.DataFrame(
        cut_into_pieces(flatten(read_lines(data))),
        columns=["T1", "H1", "T2", "H2"],
    )
    print(df)
    df.to_excel("your_table.xlsx", index=False)

This works and I get what I want but I feel like points 3, 4, and 5 are a bit of redundant work, especially creating a list of list just to flatten it and then chop it up again.

Question:

How could I simplify the whole parsing process? Or maybe most of the heavy-lifting can be done with pandas alone?

Also, any other feedback is more than welcomed.

riskypenguin · Accepted Answer · 2021-03-26 23:00:55Z

3

Disclaimer: I know this is a very liberal interpretation of a code review since it suggests an entirely different approach. I still thought it might provide a useful perspective when thinking about such problems in the future and reducing coding effort.

I would suggest the following approach using regex to extract all the numbers that match the format "12.34".

import re
import pandas as pd

with open("your_text_data.txt") as data_file:
    data_list = re.findall(r"\d\d\.\d\d", data_file.read())

result = [data_list[i:i + 4] for i in range(0, len(data_list), 4)]

df = pd.DataFrame(result, columns=["T1", "H1", "T2", "H2"])
print(df)
df.to_excel("your_table.xlsx", index=False)

This will of course only work for the current data format you provided. The code will need to be adjusted if the format of your data changes. For example: If relevant numbers may contain a varying number of digits, you might use the regex "\d+\.\d+" to match all numbers that contain at least one digit on either side of the decimal point.

Also please note the use of the context manager with open(...) as x:. Only code that accesses the object needs to and should be part of the managed context.

edited Mar 26, 2021 at 23:00

answered Mar 26, 2021 at 18:47

riskypenguin

3,4931 gold badge10 silver badges28 bronze badges

\$\begingroup\$ I absolutely don't mind that you've offered a new approach. I totally forgot about regex, I was so much into those lists of lists. This is short, simple, and does the job. Nice! Thank you for your time and insight. \$\endgroup\$

baduker
– baduker

2021-03-26 20:36:17 +00:00
Commented Mar 26, 2021 at 20:36
\$\begingroup\$ PS. You've got your imports the other way round. re should be first and then pandas. \$\endgroup\$

baduker
– baduker

2021-03-26 20:37:48 +00:00
Commented Mar 26, 2021 at 20:37
\$\begingroup\$ You're right, I fixed the import order! \$\endgroup\$

riskypenguin
– riskypenguin

2021-03-26 23:01:52 +00:00
Commented Mar 26, 2021 at 23:01

Add a comment |

RootTwo · Accepted Answer · 2021-03-26 23:55:25Z

You can use numpy.loadtxt() to read the data and numpy.reshape() to get the shape you want. The default is to split on whitespace and dtype of float. usecols are the columns we want. conveters is a dict mapping column nos. to functions to convert the column data; here they chop of the unwanted text. The .shape() converts the resulting numpy array from two columns to four columns (the -1 lets numpy calculate the number of rows).

src.seek(0)
data = np.loadtxt(src,
                  usecols=(2, 3), 
                  converters={2:lambda s:s[3:-2], 3:lambda s:s[3:]}
                 ).reshape(-1, 4)

Then just load it in a dataframe and name the columns:

df = pd.DataFrame(data, columns='T1 H1 T2 H2'.split())
df

Output:

       T1      H1      T2      H2
0   30.45   59.14   29.93   67.38
1   30.49   58.87   29.94   67.22
2   30.53   58.69   29.95   67.22

Stack Exchange Network

Parsing a text file into a pandas DataFrame

Question:

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Parsing a text file into a pandas DataFrame

Question:

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions