Convert data from .data file to .csv file and put data in columns using pandas

Question

I want to convert data from a .data file to a .csv file and put the data from the .data file in columns with values under them. However, the .data file has a specific format and I don't know how to put the text in it in columns. Here is how the .data file looks like:

column1  
column2  
column3  
column4  
column5  
column6  
column7  
column8  
column9  
column10  
column11  
column12  
column13  
........
column36

1243;6543;5754;5678;4567;4567;4567;2573;7532;6332;6432;6542;5542;7883;7643;4684;4568;4573  
3567;5533;6532;6432;7643;8635;7654;6543;8753;7643;7543;7543;7543;6543;6444;7543;6444;6444  

1243;6543;5754;5678;4567;4567;4567;2573;7532;6332;6432;6542;5542;7883;7643;4684;4568;4573  
3567;5533;6532;6432;7643;8635;7654;6543;8753;7643;7543;7543;7543;6543;6444;7543;6444;6444  

1243;6543;5754;5678;4567;4567;4567;2573;7532;6332;6432;6542;5542;7883;7643;4684;4568;4573  
3567;5533;6532;6432;7643;8635;7654;6543;8753;7643;7543;7543;7543;6543;6444;7543;6444;6444  

1243;6543;5754;5678;4567;4567;4567;2573;7532;6332;6432;6542;5542;7883;7643;4684;4568;4573  
3567;5533;6532;6432;7643;8635;7654;6543;8753;7643;7543;7543;7543;6543;6444;7543;6444;6444

The file as shown above has the names of 36 columns, each on 1 line. Under these are many datapoints, with 36 values in them that are separated by semicolons. The datapoints are 2 lines long and each datapoint is separated by a blank line. The .csv file must look like this:

column1,column2,column3,column4,column5,column6,column7,column8,column9,column10,column11,column12,column14,column15,column16,column17,column18,column20,column20,column21,column22,column23,column24,column25,column26,column27,column28,column29,column30,column31,column32,column33,column34,column35,column36
1243,6543,5754,5678,4567,4567,4567,2573,7532,6332,6432,6542,5542,7883,7643,4684,4568,4573,3567,5533,6532,6432,7643,8635,7654,6543,8753,7643,7543,7543,7543,6543,6444,7543,6444,6444
1243,6543,5754,5678,4567,4567,4567,2573,7532,6332,6432,6542,5542,7883,7643,4684,4568,4573,3567,5533,6532,6432,7643,8635,7654,6543,8753,7643,7543,7543,7543,6543,6444,7543,6444,6444
1243,6543,5754,5678,4567,4567,4567,2573,7532,6332,6432,6542,5542,7883,7643,4684,4568,4573,3567,5533,6532,6432,7643,8635,7654,6543,8753,7643,7543,7543,7543,6543,6444,7543,6444,6444
1243,6543,5754,5678,4567,4567,4567,2573,7532,6332,6432,6542,5542,7883,7643,4684,4568,4573,3567,5533,6532,6432,7643,8635,7654,6543,8753,7643,7543,7543,7543,6543,6444,7543,6444,6444

The first line of the .csv as shown above file must consist of 36 columns with the names in it separated by commas. The next lines must consist of all datapoints, each on 1 line and in which the 36 values must be separated by commas.

Can you use the software library 'pandas' for this? Anyways, this is my starting code:

with open("file.data") as fIn, open("file.csv", "w") as fOut:
    for r, line in enumerate(fIn):
        if not line:
            break

Thanks

Have you tried using pd.read_csv, you can look at the docs its basic pandas.pydata.org/pandas-docs/stable/generated/… — Bharath M Shetty
– Bharath M Shetty, Commented Dec 15, 2017 at 13:00
I dont know i am pretty new to this I have tried this: data = pd.read_csv("file.csv"). but i dont know what to do with it then — Motambe Markumbo
– Motambe Markumbo, Commented Dec 15, 2017 at 13:01

9dogs · Accepted Answer · 2017-12-15 14:51:48Z

5

Sure you can do it with pandas. You just need to read first N lines (36 in your case) to use them as header and read rest of the file like normal csv (pandas good at it). Then you can save pandas.DataFrame object to csv.

Since your data splitted into adjacent lines, we should split DataFrame we've read on two and stack them one next to other (horizontaly).

Consider the following code:

import pandas as pd

COLUMNS_COUNT = 36
# read first `COLUMNS_COUNT` lines to serve as a header
with open('data.data', 'r') as f:
    columns = [next(f).strip() for line in range(COLUMNS_COUNT)]
# read rest of the file to temporary DataFrame
temp_df = pd.read_csv('data.data', skiprows=COLUMNS_COUNT, header=None, delimiter=';', skip_blank_lines=True)
# split temp DataFrame on even and odd rows
even_df = temp_df.iloc[::2].reset_index(drop=True)
odd_df = temp_df.iloc[1::2].reset_index(drop=True)
# stack even and odd DataFrames horizontaly
df = pd.concat([even_df, odd_df], axis=1)
# assign column names
df.columns = columns
# save result DataFrame to csv
df.to_csv('out.csv', index=False)

UPD: code updated to correctly process data splitted onto two lines

edited Dec 15, 2017 at 14:51

answered Dec 15, 2017 at 13:06

9dogs

1,46610 silver badges18 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Bharath M Shetty Over a year ago

OP might need csv itself as an output. Maybe you need to add to_csv too

9dogs Over a year ago

good point, missed that. Also noticed one need to strip() lines before using it as col names

Motambe Markumbo Over a year ago

thanks! only 1 thing, the code doesn't take into account that each datapoint in the .data file is 2 lines long. So after the first line of the datapoint, the next value in the second line of the datapoint, value '3567' gets printed under the 'column1', but its supposed to go under 'column19', i hope you understand my question :)

Motambe Markumbo Over a year ago

Oh and the format looks like a great tabular, but the columns are just supposed to be separated by a comma and each value under them as well.

Motambe Markumbo Over a year ago

The print is supposed to look just like the format i gave in my question

|

Collectives™ on Stack Overflow

Convert data from .data file to .csv file and put data in columns using pandas

1 Answer 1

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related