My database contains one column of strings. I'm going to create a new column based on part of string of other columns. For example:
"content" "other column"
The father has two dogs father
One cat stay at home of my mother mother
etc. etc.
I thought to create an array with words who interessed me. For example: people=[mother,father,etc.]
Then, I iterate on column "content" and extract the word to insert on new column:
def extract_people(df):
column=[]
people=[mother,father,etc.]
for row in df.select("content").collect():
for word in people:
if str(row).find(word):
column.append(word)
break
return pd.Series(column)
f_pyspark = df_pyspark.withColumn('people', extract_people(df_pyspark))
This code don't work and give me this error on the collect():
22/01/26 11:34:04 ERROR Executor: Exception in task 2.0 in stage 3.0 (TID 36)
java.lang.OutOfMemoryError: Java heap space
Maybe because my file is too large, have 15 million of row. How I may make the new column in different mode?
.collect()unless your data is small. It fetch all the data on the driver node and thus, does not work in any distributed way. Do you really needs Spark for this ? It can be done in pure Python.