1

I need to merge following data into one row:

vector_no_stopw_df.select("filtered").show(3, truncate=False)



  +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|filtered                                                                                                                                                                                                                          |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[, problem, population]                                                                                                                                                                                                           |
|[tyler, notes, global, population, increase, sharply, next, century, , almost, growth, occurring, relatively, underdeveloped, africa, south, asia, , contrast, , population, actually, decline, countries]                        |
|[many, economists, uncomfortable, population, issues, , perhaps, arent, covered, depth, standard, graduate, curriculum, , touch, topics, may, culturally, controversial, even, politically, incorrect, thats, unfortunate, future]|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

so that it looks like

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|filtered                                                                                                                                                                                                                          |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[, problem, population,tyler, notes, global, population, increase, sharply, next, century, , almost, growth, occurring, relatively, underdeveloped, africa, south, asia, , contrast, , population, actually, decline, countries,many, economists, uncomfortable, population, issues, , perhaps, arent, covered, depth, standard, graduate, curriculum, , touch, topics, may, culturally, controversial, even, politically, incorrect, thats, unfortunate, future]|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I know this is trivial. but cannot find solution. I tried concat_ws it did not work as well.

The concat_ws, I executed generates (vector_no_stopw_df.select(concat_ws(',', vector_no_stopw_df.filtered)).collect()) as follows:

[Row(concat_ws(,, filtered)='one,big,advantages,economist,long,time,council,economic,advisers,,years,ago,ive,gotten,know,follow,lot,people,thinking,,started,cea,august,,finished,july,,,first,academic,year,,fellow,senior,economists,paul,krugman,,lawrence,summers'),
 Row(concat_ws(,, filtered)='isnt,going,happen,anytime,soon,meantime,,tax,system,puts,place,much,higher,marginal,rates,people,acknowledge,people,keep,focusing,federal,income,taxes,alone,,marginal,rates,top,around,,percent,leaves,state'),
 Row(concat_ws(,, filtered)=',,

Here is the solution just in case if somebody else needs it:

I went ahead and used python's itertools library.

vector_no_stopw_df_count=vector_no_stopw_df.select("filtered").collect()
vector_no_stopw_df_count[0].filtered
vector_no_stopw_list=[i.filtered for i in vector_no_stopw_df_count]

flatten the list

from itertools import chain
flattenlist= list(chain.from_iterable(vector_no_stopw_list))
flattenlist[:20]

Result:

['',
 'problem',
 'population',
 'tyler',
 'notes',
 'global',
 'population',
 'increase',
 'sharply',
 'next',
 'century',
 '',
 'almost',
 'growth',
 'occurring',
 'relatively',
 'underdeveloped',
 'africa',
 'south',
 'asia']
3
  • What is the condition that groups the rows? How do you know which 3 rows need to be merged into one? Commented Jul 9, 2018 at 17:36
  • I added a crude pythonic way to solve this. I wanted to solve it via pyspark but could not solve it, so went ahead used itertools. See above Commented Jul 9, 2018 at 18:12
  • And in fact, i do not want to merge which with which. I just wanted to create a flattened list. Commented Jul 9, 2018 at 18:12

1 Answer 1

1

In a way, you are looking for the reverse of explode.

You can use collect_list for this:

from pyspark.sql.functions as F
df.groupBy(<somecol>).agg(F.collect_list('filtered').alias('aggregated_filters'))
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.