concatenating multiple rows Pyspark

Question

I need to merge following data into one row:

vector_no_stopw_df.select("filtered").show(3, truncate=False)



  +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|filtered                                                                                                                                                                                                                          |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[, problem, population]                                                                                                                                                                                                           |
|[tyler, notes, global, population, increase, sharply, next, century, , almost, growth, occurring, relatively, underdeveloped, africa, south, asia, , contrast, , population, actually, decline, countries]                        |
|[many, economists, uncomfortable, population, issues, , perhaps, arent, covered, depth, standard, graduate, curriculum, , touch, topics, may, culturally, controversial, even, politically, incorrect, thats, unfortunate, future]|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

so that it looks like

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|filtered                                                                                                                                                                                                                          |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[, problem, population,tyler, notes, global, population, increase, sharply, next, century, , almost, growth, occurring, relatively, underdeveloped, africa, south, asia, , contrast, , population, actually, decline, countries,many, economists, uncomfortable, population, issues, , perhaps, arent, covered, depth, standard, graduate, curriculum, , touch, topics, may, culturally, controversial, even, politically, incorrect, thats, unfortunate, future]|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I know this is trivial. but cannot find solution. I tried concat_ws it did not work as well.

The concat_ws, I executed generates (vector_no_stopw_df.select(concat_ws(',', vector_no_stopw_df.filtered)).collect()) as follows:

[Row(concat_ws(,, filtered)='one,big,advantages,economist,long,time,council,economic,advisers,,years,ago,ive,gotten,know,follow,lot,people,thinking,,started,cea,august,,finished,july,,,first,academic,year,,fellow,senior,economists,paul,krugman,,lawrence,summers'),
 Row(concat_ws(,, filtered)='isnt,going,happen,anytime,soon,meantime,,tax,system,puts,place,much,higher,marginal,rates,people,acknowledge,people,keep,focusing,federal,income,taxes,alone,,marginal,rates,top,around,,percent,leaves,state'),
 Row(concat_ws(,, filtered)=',,

Here is the solution just in case if somebody else needs it:

I went ahead and used python's itertools library.

vector_no_stopw_df_count=vector_no_stopw_df.select("filtered").collect()
vector_no_stopw_df_count[0].filtered
vector_no_stopw_list=[i.filtered for i in vector_no_stopw_df_count]

flatten the list

from itertools import chain
flattenlist= list(chain.from_iterable(vector_no_stopw_list))
flattenlist[:20]

Result:

['',
 'problem',
 'population',
 'tyler',
 'notes',
 'global',
 'population',
 'increase',
 'sharply',
 'next',
 'century',
 '',
 'almost',
 'growth',
 'occurring',
 'relatively',
 'underdeveloped',
 'africa',
 'south',
 'asia']

What is the condition that groups the rows? How do you know which 3 rows need to be merged into one? — pault
– pault, Commented Jul 9, 2018 at 17:36
I added a crude pythonic way to solve this. I wanted to solve it via pyspark but could not solve it, so went ahead used itertools. See above — lpt
– lpt, Commented Jul 9, 2018 at 18:12
And in fact, i do not want to merge which with which. I just wanted to create a flattened list. — lpt
– lpt, Commented Jul 9, 2018 at 18:12

karthikr · Accepted Answer · 2018-07-08 02:05:46Z

1

In a way, you are looking for the reverse of explode.

You can use collect_list for this:

from pyspark.sql.functions as F
df.groupBy(<somecol>).agg(F.collect_list('filtered').alias('aggregated_filters'))

answered Jul 8, 2018 at 2:05

karthikr

100k26 gold badges208 silver badges191 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

concatenating multiple rows Pyspark

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related