Dataframe into numpy array with values comma seperated

Question

The Scenario

I've read a csv (which is \t seperated) into a Dataframe, which is now needed to be in a numpy array format for clustering without changing type

The Problem

So far as per tried references (below) I've failed to get the output as required. The two column's values I'm trying to fetch are in int64 / float64, as below

         uid   iid       rat
0        196   242  3.000000
1        186   302  3.000000
2         22   377  1.000000

I'm intrested in only iid and rat for the moment, and to pass it to Kmeans.fit() method and that too not with EPSILON in it. I need it in following format

Expected format

[[242, 3.000000],
[302, 3.000000],
[22, 1.000000]]

Unsucessful Attempt

X = values[:, 1:2]
Y = values[:, 2:3]
someArray = np.array([X,Y])
print someArray

and doesn't farewell on execution

[[[  2.42000000e+02]
  [  3.02000000e+02]
  [  3.77000000e+02]
  ..., 
  [  1.35200000e+03]
  [  1.62600000e+03]
  [  1.65900000e+03]]
 [[  3.00000000e+00]
  [  3.00000000e+00]
  [  1.00000000e+00]
  ..., 
  [  1.00000000e+00]
  [  1.00000000e+00]
  [  1.00000000e+00]]]

Unhelped references so far

EDIT 1

tried np_df = np.genfromtxt('AllData.csv', delimiter='\t', unpack=True) and got this

[[             nan   1.96000000e+02   1.86000000e+02 ...,   4.79000000e+02
    4.79000000e+02   4.79000000e+02]
 [             nan   2.42000000e+02   3.02000000e+02 ...,   1.36000000e+03
    1.39400000e+03   1.65200000e+03]
 [             nan   3.00000000e+00   3.00000000e+00 ...,   2.00000000e+00
    1.92803605e+00   1.00000000e+00]]

@ayhan please check the expected format, that just prints the columns 1 and 2 — T3J45
– T3J45, Commented Aug 10, 2017 at 18:29
I don't see any difference between the expected format and the output of my suggestion (except for 22 in the third row which I assumed was there as a mistake). — user2285236
– user2285236, Commented Aug 10, 2017 at 18:32

juanpa.arrivillaga · Accepted Answer · 2017-08-10 18:47:10Z

3

Use label-based selection and the .values attribute of the resulting pandas objects, which will be some sort of numpy array:

>>> df
   uid  iid  rat
0  196  242  3.0
1  186  302  3.0
2   22  377  1.0
>>> df.loc[:,['iid','rat']]
   iid  rat
0  242  3.0
1  302  3.0
2  377  1.0
>>> df.loc[:,['iid','rat']].values
array([[ 242.,    3.],
       [ 302.,    3.],
       [ 377.,    1.]])

Note, your integer column will get promoted to float.

Also note, this particular selection could be approached in different ways:

>>> df.iloc[:, 1:] # integer-position based
   iid  rat
0  242  3.0
1  302  3.0
2  377  1.0
>>> df[['iid','rat']] # plain indexing performs column-based selection
   iid  rat
0  242  3.0
1  302  3.0
2  377  1.0

I like label-based because it is more explicit.

Edit

The reason you aren't seeing commas is an artifact of how numpy arrays are printed:

>>> df[['iid','rat']].values
array([[ 242.,    3.],
       [ 302.,    3.],
       [ 377.,    1.]])
>>> print(df[['iid','rat']].values)
[[ 242.    3.]
 [ 302.    3.]
 [ 377.    1.]]

And actually, it is the difference between the str and repr results of the numpy array:

>>> print(repr(df[['iid','rat']].values))
array([[ 242.,    3.],
       [ 302.,    3.],
       [ 377.,    1.]])
>>> print(str(df[['iid','rat']].values))
[[ 242.    3.]
 [ 302.    3.]
 [ 377.    1.]]

edited Aug 10, 2017 at 18:47

answered Aug 10, 2017 at 18:23

juanpa.arrivillaga

97.6k14 gold badges141 silver badges190 bronze badges

Sign up to request clarification or add additional context in comments.

13 Comments

T3J45 Over a year ago

Tried print df.loc[:, ['iid','rat']].values and got

[[  2.42000000e+02   3.00000000e+00]  [  3.02000000e+02   3.00000000e+00]  [  3.77000000e+02   1.00000000e+00]  ...,   [  1.36000000e+03   2.00000000e+00]  [  1.39400000e+03   1.92803605e+00]  [  1.65200000e+03   1.00000000e+00]]

juanpa.arrivillaga Over a year ago

@Tejas that looks correct to me... what is the issue?

juanpa.arrivillaga Over a year ago

@Tejas but in general, don't dump that in a comment, it is unreadable. Edit your actual question

T3J45 Over a year ago

brother, that is not putting commas in between two values inside lists if you see.

T3J45 Over a year ago

Looks good with the added solution. That solves 50% of my problem, I need a wau out of those e+ or so called EPSILONS

|

BenT · Accepted Answer · 2017-08-10 18:25:59Z

2

Why don't you just import the 'csv' as a numpy array?

import numpy as np 
def read_file( fname): 
    return np.genfromtxt( fname, delimiter="/t", comments="%", unpack=True)

answered Aug 10, 2017 at 18:25

BenT

3,2103 gold badges24 silver badges47 bronze badges

3 Comments

T3J45 Over a year ago

tried np_df = np.genfromtxt('AllData.csv', delimiter='\t', unpack=True) and got this

[[             nan   1.96000000e+02   1.86000000e+02 ...,   4.79000000e+02     4.79000000e+02   4.79000000e+02]  [             nan   2.42000000e+02   3.02000000e+02 ...,   1.36000000e+03     1.39400000e+03   1.65200000e+03]  [             nan   3.00000000e+00   3.00000000e+00 ...,   2.00000000e+00     1.92803605e+00   1.00000000e+00]]

Ended up with a NaN, which I do not expect.

BenT Over a year ago

I can't actually see your data, so I do not know where that is coming from unless you have line numbers in your saved csv file that are strings. You could use indexing to remove the NaNs since the rest of the data looks correct.

DJK Over a year ago

I looks like for some reason the data is side ways, and you should transpose it and slice away the first row

jezrael · Accepted Answer · 2017-08-12 10:39:40Z

1

It seems you need read_csv for DataFrame first with filter only second and third column first and then convert to numpy array by values: import pandas as pd from sklearn.cluster import KMeans from pandas.compat import StringIO

temp=u"""col,iid,rat
4,1,0
5,2,4
6,3,3
7,4,1"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), usecols = [1,2])
print (df)
   iid  rat
0    1    0
1    2    4
2    3    3
3    4    1

X = df.values 
print (X)
[[1 0]
 [2 4]
 [3 3]
 [4 1]]

kmeans = KMeans(n_clusters=2)
a = kmeans.fit(X)
print (a)
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=2, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

answered Aug 12, 2017 at 10:39

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

2 Comments

T3J45 Over a year ago

Just for the record, the example I followed had commas in between elements of list and the lists, so does it mean that Python has a relief of such norms? btw that helped.

jezrael Over a year ago

comma is default separator in read csv, if want change it use sep='\t' for tab, sep='\s+' for one or more whitespaces.

Collectives™ on Stack Overflow

Dataframe into numpy array with values comma seperated

The Scenario

The Problem

EDIT 1

3 Answers 3

Edit

13 Comments

3 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

The Scenario

The Problem

EDIT 1

3 Answers 3

Edit

13 Comments

3 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related