0

I'm trying to create a spark dataframe of (one column DT and one row with date of 2020-1-1) manually.

DT
=======
2020-01-01

However, it got the error of list index out of range?

spark = SparkSession.builder\
        .master(f'spark://{IP}:7077')\
        .config('spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version', '2')\
        .appName('g data')\
        .getOrCreate()

spark.conf.set('spark.sql.sources.partitionOverwriteMode', 'dynamic')

dates = spark.createDataFrame([(pd.to_datetime('2020-1-1'))], ['DT'])

Traceback:

 in brand_tagging_since_until(spark, since, until)
---> 81         dates = spark.createDataFrame([(pd.to_datetime('2020-1-1'))], ['DT'])

/usr/local/bin/spark/python/pyspark/sql/session.py in createDataFrame(self, data, schema, samplingRatio, verifySchema)
    746             rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)
    747         else:
--> 748             rdd, schema = self._createFromLocal(map(prepare, data), schema)
    749         jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
    750         jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), schema.json())

/usr/local/bin/spark/python/pyspark/sql/session.py in _createFromLocal(self, data, schema)
    419             if isinstance(schema, (list, tuple)):
    420                 for i, name in enumerate(schema):
--> 421                     struct.fields[i].name = name
    422                     struct.names[i] = name
    423             schema = struct
3
  • What are the row and columns you're trying to create? Is DT the column name in a single column dataframe, or a value in the row? Commented Jan 5, 2021 at 3:15
  • DT is the column name with type of datetime, it should have a single row of 2020-01-01. Commented Jan 5, 2021 at 3:18
  • Thanks. I'll add an answer. There are two seperate wrinkles here. Commented Jan 5, 2021 at 3:43

2 Answers 2

2

There are two issues here, though one is not surfaced in your example. Your immediate issue is that the constructor is expecting a , after the value in the tuple. But, just adding this naively will silently fail, as the constructor doesn't know what to do with a pandas Timestamp object.

from pyspark.sql import SparkSession
import pandas as pd
​
spark = SparkSession.builder.appName("timestamp").getOrCreate()
​
val = pd.to_datetime('2020-1-1')
spark.createDataFrame(
    data=[(val,)],
    schema=["DT"]
).show()
+---+
| DT|
+---+
| []|
+---+

You'll want to convert this to a raw Python datetime object beforehand if you want to use the constructor like this.

from pyspark.sql import SparkSession
import pandas as pd
​
spark = SparkSession.builder.appName("timestamp").getOrCreate()

val = pd.to_datetime('2020-1-1')
spark.createDataFrame(
    data=[(val.to_pydatetime(),)],
    schema=["DT"]
).show()
+-------------------+
|                 DT|
+-------------------+
|2020-01-01 00:00:00|
+-------------------+

With that said, it's not clear to me where this is most cleanly documented. If you're curious, you can see this requirement in the Spark codebase, or in the source code docs.

If you pass a pandas DataFrame to the constructor, this is handled under the hood.

df = pd.DataFrame({"DT": [val]})
spark.createDataFrame(
    data=df
).show()
+-------------------+
|                 DT|
+-------------------+
|2020-01-01 00:00:00|
+-------------------+
Sign up to request clarification or add additional context in comments.

1 Comment

I read the code and found eventually it will do dates = dates.toPandas()['DT'].values. Is it the final dates same as the df in your answer?
1

A more straightforward way to create the dataframe without relying on pandas:

import pyspark.sql.functions as F

dates = spark.createDataFrame([['2020-01-01']], ['DT']) \
             .withColumn('DT', F.col('DT').cast('timestamp'))

dates.show()
+-------------------+
|                 DT|
+-------------------+
|2020-01-01 00:00:00|
+-------------------+

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.