How create new column in Spark using Python, based on other column?

Question

My database contains one column of strings. I'm going to create a new column based on part of string of other columns. For example:

         "content"                             "other column"
The father has two dogs                            father
One cat stay at home of my mother                  mother
etc.                                               etc.

I thought to create an array with words who interessed me. For example: people=[mother,father,etc.]

Then, I iterate on column "content" and extract the word to insert on new column:



def extract_people(df):
    column=[]
    people=[mother,father,etc.]
    for row in df.select("content").collect():
        for word in people:
            if str(row).find(word):
                column.append(word)
                break
    return pd.Series(column)


f_pyspark = df_pyspark.withColumn('people', extract_people(df_pyspark))

This code don't work and give me this error on the collect():

22/01/26 11:34:04 ERROR Executor: Exception in task 2.0 in stage 3.0 (TID 36)
java.lang.OutOfMemoryError: Java heap space

Maybe because my file is too large, have 15 million of row. How I may make the new column in different mode?

Do not use .collect() unless your data is small. It fetch all the data on the driver node and thus, does not work in any distributed way. Do you really needs Spark for this ? It can be done in pure Python. — Gaarv
– Gaarv, Commented Jan 26, 2022 at 13:46
If you really want to use Spark, there is other threads about this: stackoverflow.com/questions/61636254/… — Gaarv
– Gaarv, Commented Jan 26, 2022 at 13:54

BoomBoxBoy · Accepted Answer · 2022-01-26 17:20:36Z

1

Using the following dataframe as an example

+---------------------------------+
|content                          |
+---------------------------------+
|Thefatherhas two dogs            |
|The fatherhas two dogs           |
|Thefather has two dogs           |
|Thefatherhastwodogs              |
|One cat stay at home of my mother|
|One cat stay at home of mymother |
|Onecatstayathomeofmymother       |
|etc.                             |
|my feet smell                    |
+---------------------------------+

You can do the following

from pyspark.sql import functions

arr = ["father", "mother", "etc."]

expression = (
   "CASE " + 
    "".join(["WHEN content LIKE '%{}%' THEN '{}' ".format(val, val) for val in arr]) + 
     "ELSE 'None' END")

df = df.withColumn("other_column", functions.expr(expression))
df.show()
+---------------------------------+------------+
|content                          |other_column|
+---------------------------------+------------+
|Thefatherhas two dogs            |father      |
|The fatherhas two dogs           |father      |
|Thefather has two dogs           |father      |
|Thefatherhastwodogs              |father      |
|One cat stay at home of my mother|mother      |
|One cat stay at home of mymother |mother      |
|Onecatstayathomeofmymother       |mother      |
|etc.                             |etc.        |
|my feet smell                    |None        |
+---------------------------------+------------+

edited Jan 26, 2022 at 17:20

answered Jan 26, 2022 at 15:55

BoomBoxBoy

1,8951 gold badge9 silver badges25 bronze badges

Sign up to request clarification or add additional context in comments.

11 Comments

domenico Over a year ago

Great answer, but i tell you the problem not correctly. In particular I need to search a word in text even if it is attached to another word. For example:

BoomBoxBoy Over a year ago

This should work for that case as well! I altered the example to show that it works when the word is not surrounded by spaces

domenico Over a year ago

Great answer, but I tell my problem not correctly. In particular I need to write the word in the new column even if it is attached to another word. FOR EXAMPLE: If the text is -> The fatherhas two dogs | The text to attached in new column should be: father | But, in this case new column will contains None. | How need modify the query?

domenico Over a year ago

Maybe, have to modify LIKE with another word.

domenico Over a year ago

Thank you very much. It working!!!

|

Collectives™ on Stack Overflow

How create new column in Spark using Python, based on other column?

1 Answer 1

11 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

11 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related