0

My goal is to convert this list of strings to a Numpy Array.

I want to convert the first 2 columns to numerical data (integer)

list1 = [['380850', '625105', 'Dota 2'],
      ['354804', '846193', "PLAYERUNKNOWN'S BATTLEGROUNDS"],
      ['204354', '467109', 'Counter-Strike: Global Offensive']
     ]

dt = np.dtype('i,i,U')
cast_array = np.array([tuple(row) for row in sl], dtype=dt)
print(cast_array)

The result is ...

[OUT] [(380850, 625105, '') (354804, 846193, '') (204354, 467109, '')]

I am losing the string data. I am interested in

  1. Understanding why the string data is getting dropped
  2. Finding any solution that converts the first 2 columns to integer type in a numpy array

This answer gave me the approach but doesn't seem to work for strings

5
  • I agree, you can do it with Pandas. But all I was thinking there may be a performance improvement using the underlying numpy data structures and I am using this as a learning experience / test to see if it can be done this way Commented Oct 17, 2018 at 14:05
  • 1
    There'll certainly be some performance issues. But, as far as I'm aware, that is also the reason you need to specify the (maximum) string size: using a dynamic string size will lower performance (I guess it makes things more complicated to iterate over under the hood, in C). Commented Oct 17, 2018 at 14:08
  • Thank you. Your first comment had the answer. I have posted a solution in case it helps others in the future. Commented Oct 17, 2018 at 14:14
  • Arrghhh!! Now I can't slice the array by columns. Which was the main reason for converting in the first place. If I type cast_array[:, 0] I get an error because I have produced an array of tuples rather than a proper 2D np.array Commented Oct 17, 2018 at 14:30
  • A proper 2d array can't mix numeric a d string dtypes. What you get with dt is a structured array with fields, not columns. Commented Oct 17, 2018 at 15:12

3 Answers 3

0

Thanks to user: 9769953's comment above, this is the solution.

#when specifying strings you need to specify the length (derived from longest string in the list)
dtypestr = 'int, int, U' + str(max([len(i[2]) for i in plist1]))

cast_array = np.array([tuple(row) for row in plist1], dtype=dtypestr)

print(np.array(cast_array))
Sign up to request clarification or add additional context in comments.

Comments

0

The simplest way to do that at high level is to use pandas, as said in comments, which will silently manage tricky problems :

In [64]: df=pd.DataFrame(list1)

In [65]: df2=df.apply(pd.to_numeric,errors='ignore')

In [66]: df2
Out[66]: 
        0       1                                 2
0  380850  625105                            Dota 2
1  354804  846193     PLAYERUNKNOWN'S BATTLEGROUNDS
2  204354  467109  Counter-Strike: Global Offensive

In [67]: df2.dtypes
Out[67]: 
0     int64
1     int64
2    object
dtype: object

df2.iloc[:,:2].values will be the numpy array, You can use all numpy accelerations on this part.

Comments

0

Your dtype is not what you expect it to be - you're running into https://github.com/numpy/numpy/issues/8969:

>>> dt = np.dtype('i,i,U')
>>> dt
dtype([('f0', '<i4'), ('f1', '<i4'), ('f2', '<U')])
>>> dt['f2'].itemsize
0  # 0-length strings!

You need to either specify a maximum number of characters

>>> dt = np.dtype('i,i,16U')

Or use an object type to store variable length strings:

>>> dt = np.dtype('i,i,O')

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.