I have two Spark dataframes that look as following:
> cities_df
+----------+---------------------------+
| city_id| cities|
+----------+---------------------------+
| 22 |[Milan, Turin, Rome] |
+----------+---------------------------+
| 15 |[Naples, Florence, Genoa] |
+----------+---------------------------+
| 43 |[Houston, San Jose, Boston]|
+----------+---------------------------+
| 56 |[New York, Dallas, Chicago]|
+----------+---------------------------+
> countries_df
+----------+----------------------------------+
|country_id| countries|
+----------+----------------------------------+
| 680 |{'country': [56, 43], 'add': []} |
+----------+----------------------------------+
| 11 |{'country': [22, 15], 'add': [32]}|
+----------+----------------------------------+
Country values from the countries_df are the city ids from the cities_df dataframe.
I need to merge these dataframes to replace the city id for country with their values from the cities_df dataframe.
Expected output:
| country_id | countries | grouped_cities |
|---|---|---|
| 680 | {'country': [56, 43], 'add': []} | [New York, Dallas, Chicago, Houston, San Jose, Boston] |
| 11 | {'country': [22, 15], 'add': [32]} | [Milan, Turin, Rome, Naples, Florence, Genoa] |
Obtained grouped_cities value doesn't have to be an array type, it can be just a string.
How can I get this result using PySpark?