1

I have a CSV file where there is an array in each row. I would like to convert the row contents to columns i.e. a Matrix at the end (since I have multiple rows). I can do it using a for loop and csv.reader - but it's quite slow. So, I had an idea that Pandas would be faster, and that I could do the conversion without the need for a loop. I read the file and get a Datframe type of Size (200,1) - where each row contains 700 floats that are comma separated, e.g. [0.4, 0.5, 0.3, ....]

If I do a .value on the output I just get it converted to an Object Type - still not usable...

I just can't figure out how to convert this data into a Matrix...

Am I looking in the wrong direction here?

ranges = pd.read_csv(name,usecols=['ranges'])

What does work is this:

X = open(name)
csv_X=csv.reader(X)
ranges = []next(csv_X)#jump over the first row in the csv
for row in csv_X:
    ranges.append(ast.literal_eval(row[14]))
X.close()

But that is just really slow. So, my idea about using Pandas is to speed this up.

3
  • 2
    ranges = ranges.values Commented Feb 12, 2019 at 7:41
  • 1
    Possible duplicate of Convert pandas dataframe to NumPy array Commented Feb 12, 2019 at 7:43
  • 1
    @NihalSangeeth As written in the post, .values does not work... The two posts you refer to do not have the exact same problem, and thus the solution does not comply. I have a Dataframe where a single column contains a float array in each row. I need to convert these float arrays to columns in a matrix, i.e. since there are 700 values in the array and 200 rows, I would have a matrix of 200,700 in size. Commented Feb 12, 2019 at 8:38

1 Answer 1

2

With dataset looking like this:

                            range
0  [5, 5, 7, 5, 7, 2, 0, 4, 1, 6]
1  [1, 0, 6, 1, 1, 5, 7, 8, 6, 7]
2  [2, 0, 4, 6, 6, 6, 5, 1, 6, 5]
3  [5, 5, 2, 7, 1, 8, 7, 2, 8, 4]
4  [1, 5, 6, 6, 8, 2, 6, 6, 3, 1]

You can try:

pd.DataFrame(np.vstack(df.range.values))

which yields:

   0  1  2  3  4  5  6  7  8  9
0  5  5  7  5  7  2  0  4  1  6
1  1  0  6  1  1  5  7  8  6  7
2  2  0  4  6  6  6  5  1  6  5
3  5  5  2  7  1  8  7  2  8  4
4  1  5  6  6  8  2  6  6  3  1

Editted

If your rows are strings such as:

                ranges
0  8,9,7,6,3,2,4,1,8,3
1  7,9,9,2,1,6,4,1,8,2
2  9,3,0,9,7,7,0,9,9,6
3  0,7,1,0,5,5,1,2,4,2
4  3,3,8,0,8,7,3,6,6,2
5  9,3,7,6,5,7,8,3,8,7
6  1,6,7,8,5,6,7,0,7,8
7  5,5,0,9,2,1,5,4,3,4
8  3,8,9,8,6,3,8,5,9,8
9  8,5,1,7,1,4,8,1,6,4

Try:

pd.DataFrame(df.ranges.str.split(',').tolist())

which yields:

   0  1  2  3  4  5  6  7  8  9
0  8  9  7  6  3  2  4  1  8  3
1  7  9  9  2  1  6  4  1  8  2
2  9  3  0  9  7  7  0  9  9  6
3  0  7  1  0  5  5  1  2  4  2
4  3  3  8  0  8  7  3  6  6  2
5  9  3  7  6  5  7  8  3  8  7
6  1  6  7  8  5  6  7  0  7  8
7  5  5  0  9  2  1  5  4  3  4
8  3  8  9  8  6  3  8  5  9  8
9  8  5  1  7  1  4  8  1  6  4
Sign up to request clarification or add additional context in comments.

2 Comments

If I do that I end up with 200 rows and 1 column still - the format is still a DataFrame.... Only thing that changed is that my column name is now 0 rahter than ranges"
Maybe I'm just not getting it, but I'm left with a DataFrame that does have the columns expected, but I am unable to convert it to anything usefull for further calculations. I have added the "slow" code I am trying to replace in the original post...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.