I have the following code to do the implementation of having multiple condition columns in a single dataframe.
small_list = ["INFY","TCS", "SBIN", "ICICIBANK"]
frame = spark_frame.where(col("symbol") == small_list[0]).select('close')
## spark frame is a pyspark.sql.dataframe.DataFrame
for single_stock in small_list[1:]:
print(single_stock)
current_stock = spark_frame.where(col("symbol") == single_stock).select(['close'])
current_stock.collect()
frame.collect()
frame = frame.withColumn(single_stock, current_stock.close)
But when I do frame.collect, I get:
[Row(close=736.85, TCS=736.85, SBIN=736.85, ICICIBANK=736.85),
Row(close=734.7, TCS=734.7, SBIN=734.7, ICICIBANK=734.7),
Row(close=746.0, TCS=746.0, SBIN=746.0, ICICIBANK=746.0),
Row(close=738.85, TCS=738.85, SBIN=738.85, ICICIBANK=738.85)]
Which is wrong since all the values belong to the first reference. What am I doing wrong, and what is the best way to solve this one?
Edit: The spark_frame looks like this
[Row(SYMBOL='LINC', SERIES=' EQ', TIMESTAMP=datetime.datetime(2021, 12, 20, 0, 0), PREVCLOSE=235.6, OPEN=233.95, HIGH=234.0, LOW=222.15, LAST=222.15, CLOSE=224.2, AVG_PRICE=226.63, TOTTRDQTY=6447, TOTTRDVAL=14.61, TOTALTRADES=206, DELIVQTY=5507, DELIVPER=85.42),
Row(SYMBOL='LINC', SERIES=' EQ', TIMESTAMP=datetime.datetime(2021, 12, 21, 0, 0), PREVCLOSE=224.2, OPEN=243.85, HIGH=243.85, LOW=222.85, LAST=226.0, CLOSE=225.6, AVG_PRICE=227.0, TOTTRDQTY=8447, TOTTRDVAL=19.17, TOTALTRADES=266, DELIVQTY=3401, DELIVPER=40.26),
Row(SYMBOL='SCHAEFFLER', SERIES=' EQ', TIMESTAMP=datetime.datetime(2020, 8, 6, 0, 0), PREVCLOSE=3593.9, OPEN=3611.85, HIGH=3618.35, LOW=3542.5, LAST=3594.95, CLOSE=3573.1, AVG_PRICE=3580.73, TOTTRDQTY=12851, TOTTRDVAL=460.16, TOTALTRADES=1886, DELIVQTY=9649, DELIVPER=75.08),
Row(SYMBOL='SCHAEFFLER', SERIES=' EQ', TIMESTAMP=datetime.datetime(2020, 8, 7, 0, 0), PREVCLOSE=3573.1, OPEN=3591.0, HIGH=3591.0, LOW=3520.0, LAST=3548.95, CLOSE=3543.85, AVG_PRICE=3554.6, TOTTRDQTY=2406, TOTTRDVAL=85.52, TOTALTRADES=688, DELIVQTY=1452, DELIVPER=60.35)]
Expected results should look like this:
[Row(LINC=224.2, SCHAEFFLER=3573.1,
Row(LINC=225.6, SCHAEFFLER=3543.85)]
spark_frame.show()which will try to show the subset of data, whilecollect()will collect and show all data which could result in OOM with larger data.