AttributeError: 'numpy.int64' object has no attribute '_get_object_id'

Question

I have a dataset in pyspark for which I create a row_num column, so my data looks like:

#data:
+-----------------+-----------------+-----+------------------+-------+
|F1_imputed       |F2_imputed       |label|          features|row_num|
+-----------------+-----------------+-----+------------------+-------+
|        -0.002353|           0.9762|    0|[-0.002353,0.9762]|      1|
|           0.1265|           0.1176|    0|   [0.1265,0.1176]|      2|
|         -0.08637|          0.06524|    0|[-0.08637,0.06524]|      3|
|          -0.1428|           0.4705|    0|  [-0.1428,0.4705]|      4|
|          -0.1015|           0.6811|    0|  [-0.1015,0.6811]|      5|
|         -0.01146|           0.8273|    0| [-0.01146,0.8273]|      6|
|           0.0853|           0.2525|    0|   [0.0853,0.2525]|      7|
|           0.2186|           0.2725|    0|   [0.2186,0.2725]|      8|
|           -0.145|           0.3592|    0|   [-0.145,0.3592]|      9|
|          -0.1176|           0.4225|    0|  [-0.1176,0.4225]|     10|
+-----------------+-----------------+-----+------------------+-------+

I'm trying to filter out a random selection of rows using:

count = data.count()
sample = [np.random.choice(np.arange(count), replace=True, size=50)]
filtered = data.filter(data.row_num.isin(sample))

However the second line gives an error:

AttributeError: 'numpy.int64' object has no attribute '_get_object_id'

What is causing this? I use the same filtering code to spilt the rows by label (binary column of ones and zeros) which does work, but reapplying the code now doesn't work for sampling

Why is sample a list? Is there more to the traceback?

hpaulj
– hpaulj

2021-05-29 21:34:37 +00:00
Commented May 29, 2021 at 21:34 — hpaulj
– hpaulj, Commented May 29, 2021 at 21:34

mck · Accepted Answer · 2021-05-29 17:26:24Z

6

Numpy data types don't interact well with Spark. You can convert them to Python data types using .tolist() before calling .isin:

sample = np.random.choice(np.arange(count), replace=True, size=50).tolist()

answered May 29, 2021 at 17:26

mck

42.7k13 gold badges44 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

AttributeError: 'numpy.int64' object has no attribute '_get_object_id'

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related