0

My database contains one column of strings. I'm going to create a new column based on part of string of other columns. For example:

         "content"                             "other column"
The father has two dogs                            father
One cat stay at home of my mother                  mother
etc.                                               etc.

I thought to create an array with words who interessed me. For example: people=[mother,father,etc.]

Then, I iterate on column "content" and extract the word to insert on new column:



def extract_people(df):
    column=[]
    people=[mother,father,etc.]
    for row in df.select("content").collect():
        for word in people:
            if str(row).find(word):
                column.append(word)
                break
    return pd.Series(column)


f_pyspark = df_pyspark.withColumn('people', extract_people(df_pyspark))

This code don't work and give me this error on the collect():

22/01/26 11:34:04 ERROR Executor: Exception in task 2.0 in stage 3.0 (TID 36)
java.lang.OutOfMemoryError: Java heap space

Maybe because my file is too large, have 15 million of row. How I may make the new column in different mode?

2
  • Do not use .collect() unless your data is small. It fetch all the data on the driver node and thus, does not work in any distributed way. Do you really needs Spark for this ? It can be done in pure Python. Commented Jan 26, 2022 at 13:46
  • If you really want to use Spark, there is other threads about this: stackoverflow.com/questions/61636254/… Commented Jan 26, 2022 at 13:54

1 Answer 1

1

Using the following dataframe as an example

+---------------------------------+
|content                          |
+---------------------------------+
|Thefatherhas two dogs            |
|The fatherhas two dogs           |
|Thefather has two dogs           |
|Thefatherhastwodogs              |
|One cat stay at home of my mother|
|One cat stay at home of mymother |
|Onecatstayathomeofmymother       |
|etc.                             |
|my feet smell                    |
+---------------------------------+

You can do the following

from pyspark.sql import functions

arr = ["father", "mother", "etc."]

expression = (
   "CASE " + 
    "".join(["WHEN content LIKE '%{}%' THEN '{}' ".format(val, val) for val in arr]) + 
     "ELSE 'None' END")

df = df.withColumn("other_column", functions.expr(expression))
df.show()
+---------------------------------+------------+
|content                          |other_column|
+---------------------------------+------------+
|Thefatherhas two dogs            |father      |
|The fatherhas two dogs           |father      |
|Thefather has two dogs           |father      |
|Thefatherhastwodogs              |father      |
|One cat stay at home of my mother|mother      |
|One cat stay at home of mymother |mother      |
|Onecatstayathomeofmymother       |mother      |
|etc.                             |etc.        |
|my feet smell                    |None        |
+---------------------------------+------------+
Sign up to request clarification or add additional context in comments.

11 Comments

Great answer, but i tell you the problem not correctly. In particular I need to search a word in text even if it is attached to another word. For example:
This should work for that case as well! I altered the example to show that it works when the word is not surrounded by spaces
Great answer, but I tell my problem not correctly. In particular I need to write the word in the new column even if it is attached to another word. FOR EXAMPLE: If the text is -> The fatherhas two dogs | The text to attached in new column should be: father | But, in this case new column will contains None. | How need modify the query?
Maybe, have to modify LIKE with another word.
Thank you very much. It working!!!
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.