1

I have 3 string values and 4 lists that Im trying to create a dataframe.

Lists -

>>> exp_type
[u'expect_column_to_exist', u'expect_column_values_to_not_be_null', u'expect_column_values_to_be_of_type', u'expect_column_to_exist', u'expect_table_row_count_to_equal', u'expect_column_sum_to_be_between']

>>> expectations
[{u'column': u'country'}, {u'column': u'country'}, {u'column': u'country', u'type_': u'StringType'}, {u'column': u'countray'}, {u'value': 10}, {u'column': u'active_stores', u'max_value': 1000, u'min_value': 100}]

>>> results
[{}, {u'partial_unexpected_index_list': None, u'unexpected_count': 0, u'unexpected_percent': 0.0, u'partial_unexpected_list': [], u'partial_unexpected_counts': [], u'element_count': 102}, {u'observed_value': u'StringType'}, {}, {u'observed_value': 102}, {u'observed_value': 22075.0}]

>>> this_exp_success
[True, True, True, False, False, False]

Strings -

>>> is_overall_success
'False'

>>> data_source
'stores'
>>> 
>>> run_time
'2022-02-24T05:43:16.678220+00:00'

Trying to create a dataframe as below.

columns = ['data_source', 'run_time', 'exp_type', 'expectations', 'results', 'this_exp_success', 'is_overall_success']

dataframe = spark.createDataFrame(zip(data_source, run_time, exp_type, expectations, results, this_exp_success, is_overall_success), columns)

Error -

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/spark/python/pyspark/sql/session.py", line 748, in createDataFrame
    rdd, schema = self._createFromLocal(map(prepare, data), schema)
  File "/usr/lib/spark/python/pyspark/sql/session.py", line 416, in _createFromLocal
    struct = self._inferSchemaFromList(data, names=schema)
  File "/usr/lib/spark/python/pyspark/sql/session.py", line 348, in _inferSchemaFromList
    schema = reduce(_merge_type, (_infer_schema(row, names) for row in data))
  File "/usr/lib/spark/python/pyspark/sql/types.py", line 1101, in _merge_type
    for f in a.fields]
  File "/usr/lib/spark/python/pyspark/sql/types.py", line 1114, in _merge_type
    _merge_type(a.valueType, b.valueType, name='value of map %s' % name),
  File "/usr/lib/spark/python/pyspark/sql/types.py", line 1094, in _merge_type
    raise TypeError(new_msg("Can not merge type %s and %s" % (type(a), type(b))))
TypeError: value of map field results: Can not merge type <class 'pyspark.sql.types.LongType'> and <class 'pyspark.sql.types.StringType'>

Expected Output


+---------------+--------------------------------+--------------------------------------+---------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+------------------+                                                                    
| data_source   |  run_time                      |exp_type                              |expectations                                                               |results                                                                                                                                                                                |this_exp_success|is_overall_success|
+---------------+--------------------------------+--------------------------------------+---------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+------------------+
|stores         |2022-02-24T05:43:16.678220+00:00|expect_column_to_exist                |{u'column': u'country'}                                                    |{}                                                                                                                                                                                     |True            |False             |
|stores         |2022-02-24T05:43:16.678220+00:00|expect_column_values_to_not_be_null   |{u'column': u'country'}                                                    |{u'partial_unexpected_index_list': None, u'unexpected_count': 0, u'unexpected_percent': 0.0, u'partial_unexpected_list': [], u'partial_unexpected_counts': [], u'element_count': 102}  |True            |False             |
|stores         |2022-02-24T05:43:16.678220+00:00|expect_column_values_to_be_of_type    |{u'column': u'country', u'type_': u'StringType'}                           |{u'observed_value': u'StringType'}                                                                                                                                                     |True            |False             |
|stores         |2022-02-24T05:43:16.678220+00:00|expect_column_to_exist                |{u'column': u'countray'}                                                   |{}                                                                                                                                                                                     |False           |False             |
|stores         |2022-02-24T05:43:16.678220+00:00|expect_table_row_count_to_equal       |{u'value': 10}                                                             |{u'observed_value': 102}                                                                                                                                                               |False           |False             |
|stores         |2022-02-24T05:43:16.678220+00:00|expect_column_sum_to_be_between       |{u'column': u'active_stores', u'max_value': 1000, u'min_value': 100}       |{u'observed_value': 22075.0}                                                                                                                                                           |False           |False             |
+---------------+--------------------------------+--------------------------------------+---------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+------------------+
3
  • could you explain or give an example of the input ? what is the key that permit to join your 2 dataframes (strings and lists) Commented Feb 28, 2022 at 8:11
  • Hi @dnej - Thank you for the response. As you can see above that I have lists exp_type, expectations, results, this_exp_success with example values and also 3 strings is_overall_success, run_time, data_source. I have created another list with column name columns. Now I want to create a dataframe by zipping the lists exp_type, expectations, results, this_exp_success and strings is_overall_success, run_time, data_source. Thats where Im getting this error. Commented Feb 28, 2022 at 22:49
  • @dnej - I added expected output for convenience. Thanks! Commented Feb 28, 2022 at 23:10

1 Answer 1

1

A solution to your problem would be the following:

Zip the lists and append the strings as spark literals.

exp_type = [u'expect_column_to_exist', u'expect_column_values_to_not_be_null',
            u'expect_column_values_to_be_of_type', u'expect_column_to_exist', u'expect_table_row_count_to_equal',
            u'expect_column_sum_to_be_between']

expectations = [{u'column': u'country'}, {u'column': u'country'}, {u'column': u'country', u'type_': u'StringType'},
                {u'column': u'countray'}, {u'value': 10},
                {u'column': u'active_stores', u'max_value': 1000, u'min_value': 100}]

results = [{}, {u'partial_unexpected_index_list': None, u'unexpected_count': 0, u'unexpected_percent': 0.0,
                u'partial_unexpected_list': [], u'partial_unexpected_counts': [], u'element_count': 102},
           {u'observed_value': u'StringType'}, {}, {u'observed_value': 102}, {u'observed_value': 22075.0}]

df_schema = StructType([StructField("exp_type", StringType(), True),
                        StructField("expectations", MapType(StringType(), StringType()), True),
                        StructField("results", StringType(), True),
                        StructField("this_exp_success", StringType(), True)
                        ])

this_exp_success = [True, True, True, False, False, False]

result = spark.createDataFrame(
    zip(exp_type, expectations, results, this_exp_success), df_schema)\
    .withColumn('is_overall_success', lit('False'))\
    .withColumn('data_source', lit('stores'))\
    .withColumn('run_time', lit('2022-02-24T05:43:16.678220+00:00'))

result.show(truncate=False)

And the result is showing:

+-----------------------------------+--------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+------------------+-----------+--------------------------------+
|exp_type                           |expectations                                                  |results                                                                                                                                                      |this_exp_success|is_overall_success|data_source|run_time                        |
+-----------------------------------+--------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+------------------+-----------+--------------------------------+
|expect_column_to_exist             |[column -> country]                                           |{}                                                                                                                                                           |true            |False             |stores     |2022-02-24T05:43:16.678220+00:00|
|expect_column_values_to_not_be_null|[column -> country]                                           |{partial_unexpected_list=[], partial_unexpected_index_list=null, unexpected_count=0, element_count=102, unexpected_percent=0.0, partial_unexpected_counts=[]}|true            |False             |stores     |2022-02-24T05:43:16.678220+00:00|
|expect_column_values_to_be_of_type |[column -> country, type_ -> StringType]                      |{observed_value=StringType}                                                                                                                                  |true            |False             |stores     |2022-02-24T05:43:16.678220+00:00|
|expect_column_to_exist             |[column -> countray]                                          |{}                                                                                                                                                           |false           |False             |stores     |2022-02-24T05:43:16.678220+00:00|
|expect_table_row_count_to_equal    |[value -> 10]                                                 |{observed_value=102}                                                                                                                                         |false           |False             |stores     |2022-02-24T05:43:16.678220+00:00|
|expect_column_sum_to_be_between    |[column -> active_stores, min_value -> 100, max_value -> 1000]|{observed_value=22075.0}                                                                                                                                     |false           |False             |stores     |2022-02-24T05:43:16.678220+00:00|
+-----------------------------------+--------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+------------------+-----------+--------------------------------+

Sign up to request clarification or add additional context in comments.

1 Comment

with pleasure @bunnylorr

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.