I have a dataset in pyspark for which I create a row_num column, so my data looks like:
#data:
+-----------------+-----------------+-----+------------------+-------+
|F1_imputed |F2_imputed |label| features|row_num|
+-----------------+-----------------+-----+------------------+-------+
| -0.002353| 0.9762| 0|[-0.002353,0.9762]| 1|
| 0.1265| 0.1176| 0| [0.1265,0.1176]| 2|
| -0.08637| 0.06524| 0|[-0.08637,0.06524]| 3|
| -0.1428| 0.4705| 0| [-0.1428,0.4705]| 4|
| -0.1015| 0.6811| 0| [-0.1015,0.6811]| 5|
| -0.01146| 0.8273| 0| [-0.01146,0.8273]| 6|
| 0.0853| 0.2525| 0| [0.0853,0.2525]| 7|
| 0.2186| 0.2725| 0| [0.2186,0.2725]| 8|
| -0.145| 0.3592| 0| [-0.145,0.3592]| 9|
| -0.1176| 0.4225| 0| [-0.1176,0.4225]| 10|
+-----------------+-----------------+-----+------------------+-------+
I'm trying to filter out a random selection of rows using:
count = data.count()
sample = [np.random.choice(np.arange(count), replace=True, size=50)]
filtered = data.filter(data.row_num.isin(sample))
However the second line gives an error:
AttributeError: 'numpy.int64' object has no attribute '_get_object_id'
What is causing this? I use the same filtering code to spilt the rows by label (binary column of ones and zeros) which does work, but reapplying the code now doesn't work for sampling
samplea list? Is there more to the traceback?