2

The Scenario

I've read a csv (which is \t seperated) into a Dataframe, which is now needed to be in a numpy array format for clustering without changing type

The Problem

So far as per tried references (below) I've failed to get the output as required. The two column's values I'm trying to fetch are in int64 / float64, as below

         uid   iid       rat
0        196   242  3.000000
1        186   302  3.000000
2         22   377  1.000000

I'm intrested in only iid and rat for the moment, and to pass it to Kmeans.fit() method and that too not with EPSILON in it. I need it in following format

Expected format

[[242, 3.000000],
[302, 3.000000],
[22, 1.000000]]

Unsucessful Attempt

X = values[:, 1:2]
Y = values[:, 2:3]
someArray = np.array([X,Y])
print someArray

and doesn't farewell on execution

[[[  2.42000000e+02]
  [  3.02000000e+02]
  [  3.77000000e+02]
  ..., 
  [  1.35200000e+03]
  [  1.62600000e+03]
  [  1.65900000e+03]]
 [[  3.00000000e+00]
  [  3.00000000e+00]
  [  1.00000000e+00]
  ..., 
  [  1.00000000e+00]
  [  1.00000000e+00]
  [  1.00000000e+00]]]

Unhelped references so far

  1. This one
  2. This two
  3. This three
  4. This four

EDIT 1

tried np_df = np.genfromtxt('AllData.csv', delimiter='\t', unpack=True) and got this

[[             nan   1.96000000e+02   1.86000000e+02 ...,   4.79000000e+02
    4.79000000e+02   4.79000000e+02]
 [             nan   2.42000000e+02   3.02000000e+02 ...,   1.36000000e+03
    1.39400000e+03   1.65200000e+03]
 [             nan   3.00000000e+00   3.00000000e+00 ...,   2.00000000e+00
    1.92803605e+00   1.00000000e+00]]
8
  • values.iloc[:, 1:].values? Commented Aug 10, 2017 at 18:22
  • Is the file comma or tab delimited? Commented Aug 10, 2017 at 18:28
  • @ayhan please check the expected format, that just prints the columns 1 and 2 Commented Aug 10, 2017 at 18:29
  • @BenT already mentioned, \t or Tab seperated Commented Aug 10, 2017 at 18:30
  • I don't see any difference between the expected format and the output of my suggestion (except for 22 in the third row which I assumed was there as a mistake). Commented Aug 10, 2017 at 18:32

3 Answers 3

3

Use label-based selection and the .values attribute of the resulting pandas objects, which will be some sort of numpy array:

>>> df
   uid  iid  rat
0  196  242  3.0
1  186  302  3.0
2   22  377  1.0
>>> df.loc[:,['iid','rat']]
   iid  rat
0  242  3.0
1  302  3.0
2  377  1.0
>>> df.loc[:,['iid','rat']].values
array([[ 242.,    3.],
       [ 302.,    3.],
       [ 377.,    1.]])

Note, your integer column will get promoted to float.

Also note, this particular selection could be approached in different ways:

>>> df.iloc[:, 1:] # integer-position based
   iid  rat
0  242  3.0
1  302  3.0
2  377  1.0
>>> df[['iid','rat']] # plain indexing performs column-based selection
   iid  rat
0  242  3.0
1  302  3.0
2  377  1.0

I like label-based because it is more explicit.

Edit

The reason you aren't seeing commas is an artifact of how numpy arrays are printed:

>>> df[['iid','rat']].values
array([[ 242.,    3.],
       [ 302.,    3.],
       [ 377.,    1.]])
>>> print(df[['iid','rat']].values)
[[ 242.    3.]
 [ 302.    3.]
 [ 377.    1.]]

And actually, it is the difference between the str and repr results of the numpy array:

>>> print(repr(df[['iid','rat']].values))
array([[ 242.,    3.],
       [ 302.,    3.],
       [ 377.,    1.]])
>>> print(str(df[['iid','rat']].values))
[[ 242.    3.]
 [ 302.    3.]
 [ 377.    1.]]
Sign up to request clarification or add additional context in comments.

13 Comments

Tried print df.loc[:, ['iid','rat']].values and got [[ 2.42000000e+02 3.00000000e+00] [ 3.02000000e+02 3.00000000e+00] [ 3.77000000e+02 1.00000000e+00] ..., [ 1.36000000e+03 2.00000000e+00] [ 1.39400000e+03 1.92803605e+00] [ 1.65200000e+03 1.00000000e+00]]
@Tejas that looks correct to me... what is the issue?
@Tejas but in general, don't dump that in a comment, it is unreadable. Edit your actual question
brother, that is not putting commas in between two values inside lists if you see.
Looks good with the added solution. That solves 50% of my problem, I need a wau out of those e+ or so called EPSILONS
|
2

Why don't you just import the 'csv' as a numpy array?

import numpy as np 
def read_file( fname): 
    return np.genfromtxt( fname, delimiter="/t", comments="%", unpack=True) 

3 Comments

tried np_df = np.genfromtxt('AllData.csv', delimiter='\t', unpack=True) and got this [[ nan 1.96000000e+02 1.86000000e+02 ..., 4.79000000e+02 4.79000000e+02 4.79000000e+02] [ nan 2.42000000e+02 3.02000000e+02 ..., 1.36000000e+03 1.39400000e+03 1.65200000e+03] [ nan 3.00000000e+00 3.00000000e+00 ..., 2.00000000e+00 1.92803605e+00 1.00000000e+00]] Ended up with a NaN, which I do not expect.
I can't actually see your data, so I do not know where that is coming from unless you have line numbers in your saved csv file that are strings. You could use indexing to remove the NaNs since the rest of the data looks correct.
I looks like for some reason the data is side ways, and you should transpose it and slice away the first row
1

It seems you need read_csv for DataFrame first with filter only second and third column first and then convert to numpy array by values: import pandas as pd from sklearn.cluster import KMeans from pandas.compat import StringIO

temp=u"""col,iid,rat
4,1,0
5,2,4
6,3,3
7,4,1"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), usecols = [1,2])
print (df)
   iid  rat
0    1    0
1    2    4
2    3    3
3    4    1

X = df.values 
print (X)
[[1 0]
 [2 4]
 [3 3]
 [4 1]]

kmeans = KMeans(n_clusters=2)
a = kmeans.fit(X)
print (a)
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=2, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

2 Comments

Just for the record, the example I followed had commas in between elements of list and the lists, so does it mean that Python has a relief of such norms? btw that helped.
comma is default separator in read csv, if want change it use sep='\t' for tab, sep='\s+' for one or more whitespaces.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.