4

I want to convert data from a .data file to a .csv file and put the data from the .data file in columns with values under them. However, the .data file has a specific format and I don't know how to put the text in it in columns. Here is how the .data file looks like:

column1  
column2  
column3  
column4  
column5  
column6  
column7  
column8  
column9  
column10  
column11  
column12  
column13  
........
column36

1243;6543;5754;5678;4567;4567;4567;2573;7532;6332;6432;6542;5542;7883;7643;4684;4568;4573  
3567;5533;6532;6432;7643;8635;7654;6543;8753;7643;7543;7543;7543;6543;6444;7543;6444;6444  

1243;6543;5754;5678;4567;4567;4567;2573;7532;6332;6432;6542;5542;7883;7643;4684;4568;4573  
3567;5533;6532;6432;7643;8635;7654;6543;8753;7643;7543;7543;7543;6543;6444;7543;6444;6444  

1243;6543;5754;5678;4567;4567;4567;2573;7532;6332;6432;6542;5542;7883;7643;4684;4568;4573  
3567;5533;6532;6432;7643;8635;7654;6543;8753;7643;7543;7543;7543;6543;6444;7543;6444;6444  

1243;6543;5754;5678;4567;4567;4567;2573;7532;6332;6432;6542;5542;7883;7643;4684;4568;4573  
3567;5533;6532;6432;7643;8635;7654;6543;8753;7643;7543;7543;7543;6543;6444;7543;6444;6444

The file as shown above has the names of 36 columns, each on 1 line. Under these are many datapoints, with 36 values in them that are separated by semicolons. The datapoints are 2 lines long and each datapoint is separated by a blank line. The .csv file must look like this:

column1,column2,column3,column4,column5,column6,column7,column8,column9,column10,column11,column12,column14,column15,column16,column17,column18,column20,column20,column21,column22,column23,column24,column25,column26,column27,column28,column29,column30,column31,column32,column33,column34,column35,column36
1243,6543,5754,5678,4567,4567,4567,2573,7532,6332,6432,6542,5542,7883,7643,4684,4568,4573,3567,5533,6532,6432,7643,8635,7654,6543,8753,7643,7543,7543,7543,6543,6444,7543,6444,6444
1243,6543,5754,5678,4567,4567,4567,2573,7532,6332,6432,6542,5542,7883,7643,4684,4568,4573,3567,5533,6532,6432,7643,8635,7654,6543,8753,7643,7543,7543,7543,6543,6444,7543,6444,6444
1243,6543,5754,5678,4567,4567,4567,2573,7532,6332,6432,6542,5542,7883,7643,4684,4568,4573,3567,5533,6532,6432,7643,8635,7654,6543,8753,7643,7543,7543,7543,6543,6444,7543,6444,6444
1243,6543,5754,5678,4567,4567,4567,2573,7532,6332,6432,6542,5542,7883,7643,4684,4568,4573,3567,5533,6532,6432,7643,8635,7654,6543,8753,7643,7543,7543,7543,6543,6444,7543,6444,6444

The first line of the .csv as shown above file must consist of 36 columns with the names in it separated by commas. The next lines must consist of all datapoints, each on 1 line and in which the 36 values must be separated by commas.

Can you use the software library 'pandas' for this? Anyways, this is my starting code:

with open("file.data") as fIn, open("file.csv", "w") as fOut:
    for r, line in enumerate(fIn):
        if not line:
            break

Thanks

2
  • Have you tried using pd.read_csv, you can look at the docs its basic pandas.pydata.org/pandas-docs/stable/generated/… Commented Dec 15, 2017 at 13:00
  • I dont know i am pretty new to this I have tried this: data = pd.read_csv("file.csv"). but i dont know what to do with it then Commented Dec 15, 2017 at 13:01

1 Answer 1

5

Sure you can do it with pandas. You just need to read first N lines (36 in your case) to use them as header and read rest of the file like normal csv (pandas good at it). Then you can save pandas.DataFrame object to csv.

Since your data splitted into adjacent lines, we should split DataFrame we've read on two and stack them one next to other (horizontaly).

Consider the following code:

import pandas as pd

COLUMNS_COUNT = 36
# read first `COLUMNS_COUNT` lines to serve as a header
with open('data.data', 'r') as f:
    columns = [next(f).strip() for line in range(COLUMNS_COUNT)]
# read rest of the file to temporary DataFrame
temp_df = pd.read_csv('data.data', skiprows=COLUMNS_COUNT, header=None, delimiter=';', skip_blank_lines=True)
# split temp DataFrame on even and odd rows
even_df = temp_df.iloc[::2].reset_index(drop=True)
odd_df = temp_df.iloc[1::2].reset_index(drop=True)
# stack even and odd DataFrames horizontaly
df = pd.concat([even_df, odd_df], axis=1)
# assign column names
df.columns = columns
# save result DataFrame to csv
df.to_csv('out.csv', index=False)

UPD: code updated to correctly process data splitted onto two lines

Sign up to request clarification or add additional context in comments.

6 Comments

OP might need csv itself as an output. Maybe you need to add to_csv too
good point, missed that. Also noticed one need to strip() lines before using it as col names
thanks! only 1 thing, the code doesn't take into account that each datapoint in the .data file is 2 lines long. So after the first line of the datapoint, the next value in the second line of the datapoint, value '3567' gets printed under the 'column1', but its supposed to go under 'column19', i hope you understand my question :)
Oh and the format looks like a great tabular, but the columns are just supposed to be separated by a comma and each value under them as well.
The print is supposed to look just like the format i gave in my question
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.