Python: How to read csv file with different separators?

Question

This is the first line of my txt.file

0.112296E+02-.121994E-010.158164E-030.158164E-030.000000E+000.340000E+030.328301E-010.000000E+00

There should be 8 columns, sometimes separated with '-', sometimes with '.'. It's very confusing, I just have to work with the file, I didn't generate it.

And second question: How can I work with the different columns? There is no header, so maybe:

df.iloc[:,0] .. ?

there is no csv writer that would produce a document like that. so it would have to be typed in. Also, it does not look like the - and . are separators. can it not be that - is just a negative number and . the decimal? in that case, there would be no separator rather a certain, fixed width. In your case 12 digits long (1.12E+01 , -1.22E-02 , 1.58E-04 , 1.58E-04 , 0.00E+00 , 3.40E+02 , 3.28E-02 , 0.00E+00) — Ma0
– Ma0, Commented Sep 14, 2016 at 8:32
it's not generated from a csv writer. It's from an Ansys code I didn't write. yes, '-' is the negativ exponent for e and '.' is the decimal. How can I do it, regardlessly? — s.ping
– s.ping, Commented Sep 14, 2016 at 8:39
That line seems to be bad enough for a human to parse! The whole line is data; there are no definite boundaries or delimiters defined. — Ébe Isaac
– Ébe Isaac, Commented Sep 14, 2016 at 8:44
Yes I know, but I need to work with it and have no idea how? — s.ping
– s.ping, Commented Sep 14, 2016 at 8:47

Hugues Fontenelle · Accepted Answer · 2016-09-15 15:54:47Z

4

As stated in comments, this is likely a list of numbers in scientific notation, that aren't separated by anything but simply glued together. It could be interpreted as:

0.112296E+02
-.121994E-010
.158164E-030
.158164E-030
.000000E+000
.340000E+030
.328301E-010
.000000E+00

or as

0.112296E+02
-.121994E-01
0.158164E-03
0.158164E-03
0.000000E+00
0.340000E+03
0.328301E-01
0.000000E+00

Assuming the second interpretation is better, the trick is to split evenly every 12 characters.

data = [line[i:i+12] for i in range(0, len(line), 12)]

If really the first interpretation is better, then I'd use a REGEX

import re
line = '0.112296E+02-.121994E-010.158164E-030.158164E-030.000000E+000.340000E+030.328301E-010.000000E+00'
pattern = '[+-]?\d??\.\d+E[+-]\d+'
data = re.findall(pattern, line)

Edit

Obviously, you'd need to iterate over each line in the file, and add it to your dataframe. This is a rather inefficient thing to do in Pandas. Therefore, if your preferred interpretation is the fixed width one, I'd go with @Ev. Kounis ' answer: df = pd.read_fwf(myfile, widths=[12]*8)

Otherwise, the inefficient way is:

df = pd.DataFrame(columns=range(8))
with open(myfile, 'r') as f_in:
    for i, lines in enumerate(f_in):
        data = re.findall(pattern, line)
        df.loc[i] = [float(d) for d in data]

The two things to notice here is that the DataFrame must be initialized with column names (here [0, 1, 2, 3..7] but perhaps you know of better identifiers); and that the regex gave us strings that must be casted to floats.

edited Sep 15, 2016 at 15:54

answered Sep 14, 2016 at 9:08

Hugues Fontenelle

5,4453 gold badges31 silver badges45 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Ma0 Over a year ago

after the E comes the sign and after that only 2 digits. Thisi is important since '-.121994E-010' and '-.121994E-01' don't convert to the same float. So yes, the second interpretation is better but you are printing the first, right?

Dunes Over a year ago

You can use \d{2} instead of \d+ at the end to make the regex capture the second form.

s.ping Over a year ago

The first solution seems to work pretty great. But how can I do it if my file contains more than one row? I need the solution column-wise

Hugues Fontenelle Over a year ago

I extended my answer, but this starts to be outside the scope of the initial question :-) (search for other QA)

Ma0 · Accepted Answer · 2016-09-14 09:01:08Z

As i said in the comments, it is not a case of multiple separators, it is just a fixed width format. Pandas has a method to read such files. try this:

df = pd.read_fwf(myfile, widths=[12]*8)
print(df)  # prints -> [0.112296E+02, -.121994E-01, 0.158164E-03, 0.158164E-03.1, 0.000000E+00, 0.340000E+03, 0.328301E-01, 0.000000E+00.1]

for the widths you have to provide the cell width which looks like its 12 and the number of columns which as you say must be 8.

As you might notice the results of the read are not perfect (notice the .1 just before the comma in the 4th and last element) but i am working on it.

Alternatively, you can do it "manually" like so:

myfile = r'C:\Users\user\Desktop\PythonScripts\a_file.csv'
width = 12
my_content = []
with open(myfile, 'r') as f_in:
    for lines in f_in:
        data = [float(lines[i * width:(i + 1) * width]) for i in range(len(lines) // width)]
        my_content.append(data)
print(my_content)  # prints -> [[11.2296, -0.0121994, 0.000158164, 0.000158164, 0.0, 340.0, 0.0328301, 0.0]]

and every row would be a nested list.

Salvatore · Accepted Answer · 2016-09-14 09:05:23Z

1

A possible solution is the following:

row = '0.112296E+02-.121994E-010.158164E-030.158164E-030.000000E+000.340000E+030.328301E-010.000000E+00'
chunckLen = 12
for i in range(0, len(row), chunckLen):
    print(row[0+i:chunckLen+i])

You can easly extend the code to handle more general cases.

answered Sep 14, 2016 at 9:05

Salvatore

461 silver badge4 bronze badges

Collectives™ on Stack Overflow

Python: How to read csv file with different separators?

3 Answers 3

4 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related