1

I have a dataset in pyspark for which I create a row_num column, so my data looks like:

#data:
+-----------------+-----------------+-----+------------------+-------+
|F1_imputed       |F2_imputed       |label|          features|row_num|
+-----------------+-----------------+-----+------------------+-------+
|        -0.002353|           0.9762|    0|[-0.002353,0.9762]|      1|
|           0.1265|           0.1176|    0|   [0.1265,0.1176]|      2|
|         -0.08637|          0.06524|    0|[-0.08637,0.06524]|      3|
|          -0.1428|           0.4705|    0|  [-0.1428,0.4705]|      4|
|          -0.1015|           0.6811|    0|  [-0.1015,0.6811]|      5|
|         -0.01146|           0.8273|    0| [-0.01146,0.8273]|      6|
|           0.0853|           0.2525|    0|   [0.0853,0.2525]|      7|
|           0.2186|           0.2725|    0|   [0.2186,0.2725]|      8|
|           -0.145|           0.3592|    0|   [-0.145,0.3592]|      9|
|          -0.1176|           0.4225|    0|  [-0.1176,0.4225]|     10|
+-----------------+-----------------+-----+------------------+-------+

I'm trying to filter out a random selection of rows using:

count = data.count()
sample = [np.random.choice(np.arange(count), replace=True, size=50)]
filtered = data.filter(data.row_num.isin(sample))

However the second line gives an error:

AttributeError: 'numpy.int64' object has no attribute '_get_object_id'

What is causing this? I use the same filtering code to spilt the rows by label (binary column of ones and zeros) which does work, but reapplying the code now doesn't work for sampling

1
  • Why is sample a list? Is there more to the traceback? Commented May 29, 2021 at 21:34

1 Answer 1

6

Numpy data types don't interact well with Spark. You can convert them to Python data types using .tolist() before calling .isin:

sample = np.random.choice(np.arange(count), replace=True, size=50).tolist()
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.