Creating new Pyspark dataframe from substrings of column in existing dataframe

Question

I have a Pyspark dataframe as below and need to create a new dataframe with only one column made up of all the 7 digit numbers in the original dataframe. The values are all strings. Column1 should be ignored. Ignoring non-numbers and single 7 digit numbers in Column2 is fairly straightforward, but for the values that have two separate 7 digit numbers, I'm having difficulty pulling them out individually. This needs to be automated and able to run on other similar dataframes. The numbers are always 7 digits and always begin with a '1.' Any tips?

+-----------+--------------------+
|    COLUMN1|             COLUMN2|
+-----------+--------------------+
|     Value1|           Something|
|     Value2|     1057873 1057887|
|     Value3| Something Something|
|     Value4|                null|
|     Value5|             1312039|
|     Value6|     1463451 1463485|
|     Value7|     Not In Database|
|     Value8|     1617275 1617288|
+-----------+--------------------+

The resulting dataframe should be as below:

+-------+
|Column1|
+-------+
|1057873|
|1057887|
|1312039|
|1463451|
|1463485|
|1617275|
|1617288|
+-------+

UPDATE:

The responses are great, but unfortunately I'm using a older version of Spark that doesn't agree. I used the below to solve the problem, though it's a bit clunky...it works.

from pyspark.sql import functions as F

new_df = df.select(df.COLUMN2)

new_df = new_df.withColumn('splits', F.split(new_df.COLUMN2, ' '))

new_df = new_df.select(F.explode(new_df.splits).alias('column1'))

new_df = new_df.filter(new_df.column1.rlike('\d{7}'))

anky · Accepted Answer · 2022-03-09 17:03:33Z

3

Here is an approach with higher order lambda functions for spark 2.4+ wherein we split the column by space and then filter the words which starts with 0-9 and are length n (7), then explode:

n = 7
df.selectExpr(f"""explode(filter(split(COLUMN2,' '),x-> 
            x rlike '^[0-9]+' and length(x)={n})) as COLUMN1""").show(truncate=False)

+-------+
|COLUMN1|
+-------+
|1057873|
|1057887|
|1312039|
|1463451|
|1463485|
|1617275|
|1617288|
+-------+

edited Mar 9, 2022 at 17:03

answered Mar 9, 2022 at 16:50

anky

75.3k11 gold badges46 silver badges76 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Dr.Data Over a year ago

I'm running Spark 2.3.0.cloudera3 and there may be issues with versions. For example, my setup does not recognize the '->'

wwnde Over a year ago

@anky stated 2.4+, any reasons why you may not want to upgrade. Beware of issues with legacy code.

anky Over a year ago

@Dr.Data may be try df.selectExpr("""explode(regexp_extract_all(COLUMN2,'[0-9]{7}',0))""").show() not sure if it will run in 2.3 but you can give it a shot

Dr.Data Over a year ago

Nope, the regexp_extract_all is undefined for 2.3, apparently. I'd upgrade, but I'm not in charge of that! I did find a way to do it, though not as straight-forward as these one-liners (I'll update the OP). It does the trick. I'm sure the above responses work great for 2.4+, though.

wwnde · Accepted Answer · 2022-03-09 23:42:34Z

3

I like @nky and voted for it. An alternative Can also use pysparks exists in a higher order function in 3.0+

new = df.selectExpr("explode(split(COLUMN2,' ')) as COLUMN1").where(F.expr("exists(array(COLUMN1), element ->  element rlike '([0-9]{7})')"))

new.show()

+-------+
|COLUMN1|
+-------+
|1057873|
|1057887|
|1312039|
|1463451|
|1463485|
|1617275|
|1617288|
+-------+

answered Mar 9, 2022 at 23:42

wwnde

26.7k6 gold badges22 silver badges38 bronze badges

Comments

mozway · Accepted Answer · 2022-03-09 16:35:25Z

2

IIUC, you could use a regex and str.extractall:

df2 = (df['COLUMN2'].str.extractall(r'(\b\d{7}\b)')[0]
      .reset_index(drop=True).to_frame(name='COLUMN1')
      )

output:

regex:

(      start capturing
\b     word boundary
\d{7}  7 digits       # or 1\d{6} for "1" + 6 digits
\b     word boundary
)      end capture

answered Mar 9, 2022 at 16:35

mozway

267k13 gold badges56 silver badges106 bronze badges

1 Comment

Dr.Data Over a year ago

Apologies. I should have specified pyspark df, rather than pandas. This does work great for pandas though and I'll use it if I can't figure this out directly in pyspark.

Collectives™ on Stack Overflow

Creating new Pyspark dataframe from substrings of column in existing dataframe

3 Answers 3

4 Comments

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related