2

I want to convert a data set of an .dat file into csv file. The data format looks like,

Each row begins with the sentiment score followed by the text associated with that rating.

Image of the .dat file

I want the have sentiment value of (-1 or 1) to have a column and the text of review corresponding to the sentiment value to have an review to have an column.

WHAT I TRIED SO FAR

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np  
import csv

# read flash.dat to a list of lists
datContent = [i.strip().split() for i in open("train.dat").readlines()]

# write it as a new CSV file
with open("train.csv", "wb") as f:
    writer = csv.writer(f)
    writer.writerows(datContent)
def your_func(row):
    return row['Sentiments'] / row['Review']

columns_to_keep = ['Sentiments', 'Review']
dataframe = pd.read_csv("train.csv", usecols=columns_to_keep)
dataframe['new_column'] = dataframe.apply(your_func, axis=1)

print dataframe

Sample screen shot of the resulting train.csv it has an comma after every word in the review.

Output of the train.csv

4
  • 1
    So what did you learn about pandas' read_csv, it's a one-liner. Commented Oct 9, 2017 at 1:51
  • 1
    What is separating the score from the text? A space or a tab? Commented Oct 9, 2017 at 1:52
  • @sascha that keeps giving error prolly due to the fact its not .csv format. I did df = pd.read_csv("train.dat") Commented Oct 9, 2017 at 1:56
  • read_csv has parameters and csv is a very general format! But Evan is right; it might be easier if it's a tab; if it's a space; you can do it too; but it will be harder. Commented Oct 9, 2017 at 1:57

2 Answers 2

4

If all your rows follow that consistent format, you can use pd.read_fwf. This is a little safer than using read_csv, in the event that your second column also contains the delimiter you are attempting to split on.

df = pd.read_fwf('data.txt', header=None, 
        widths=[2, int(1e5)], names=['label', 'text'])

print(df)
   label                       text
0     -1  ieafxf  rjzy xfxk ymi wuy
1      1     lqqm  ceegjnbjpxnidygr
2     -1  zss awoj anxb rfw  kgbvnl

data.txt

-1  ieafxf  rjzy xfxk ymi wuy
+1  lqqm  ceegjnbjpxnidygr
-1  zss awoj anxb rfw  kgbvnl
Sign up to request clarification or add additional context in comments.

3 Comments

@COLDSPEED hey the problem lies in the fact I have no headers as in label and text, do I just make them up?
@KoushikProgrammer I know you don't have them, I made them up for you. You don't have to modify your data file.
@COLDSPEED thanks. Hey what is the purpose of int(1e15)?
0

As mentioned in the comments, read_csv would be appropriate here.

df = pd.read_csv('train_csv.csv', sep='\t', names=['Sentiments', 'Review'])

  Sentiments     Review
0         -1    alskjdf
1          1      asdfa
2          1       afsd
3         -1        sdf

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.