2

I have table in Spark SQL in Databricks and I have a column as string. I converted as new columns as Array datatype but they still as one string. Datatype is array type in table schema

Column as String

Data1 
[2461][2639][2639][7700][7700][3953]

Converted to Array

Data_New
["[2461][2639][2639][7700][7700][3953]"]

String to array conversion

df_new = df.withColumn("Data_New", array(df["Data1"])) 

Then write as parquet and use as spark sql table in databricks

When I search for string using array_contains function I get results as false

select *
from table_name
where array_contains(Data_New,"[2461]")

When I search for all string then query turns the results as true

Please suggest if I can separate these string as array and can find any array using array_contains function.

3 Answers 3

2

Just remove leading and trailing brackets from the string then split by ][ to get an array of strings:

df = df.withColumn("Data_New", split(expr("rtrim(']', ltrim('[', Data1))"), "\\]\\["))
df.show(truncate=False)

+------------------------------------+------------------------------------+
|Data1                               |Data_New                            |
+------------------------------------+------------------------------------+
|[2461][2639][2639][7700][7700][3953]|[2461, 2639, 2639, 7700, 7700, 3953]|
+------------------------------------+------------------------------------+

Now use array_contains like this:

df.createOrReplaceTempView("table_name")

sql_query = "select * from table_name where array_contains(Data_New,'2461')"
spark.sql(sql_query).show(truncate=False)
Sign up to request clarification or add additional context in comments.

Comments

0

Actually this is not an array, this is a full string so you need a regex or similar

expr = "[2461]"
df_new.filter(df_new["Data_New"].rlike(expr))

Comments

-1

import

from pyspark.sql import functions as sf, types as st

create table

a = [["[2461][2639][2639][7700][7700][3953]"], [None]]
sdf = sc.parallelize(a).toDF(["col1"])
sdf.show()
+--------------------+
|                col1|
+--------------------+
|[2461][2639][2639...|
|                null|
+--------------------+

convert type

def spliter(x):
    if x is not None:
        return x[1:-1].split("][")
    else:
        return None
udf = sf.udf(spliter, st.ArrayType(st.StringType()))
sdf.withColumn("array_col1", udf("col1")).withColumn("check", sf.array_contains("array_col1", "2461")).show()
+--------------------+--------------------+-----+
|                col1|          array_col1|check|
+--------------------+--------------------+-----+
|[2461][2639][2639...|[2461, 2639, 2639...| true|
|                null|                null| null|
+--------------------+--------------------+-----+

4 Comments

Looks like this will work but when try to write dataframe to parquet file I get error. File "<command-1489663418629327>", line 2, in <lambda> TypeError: 'NoneType' object is not subscriptable Please suggest
do you have missing data like nan in your data frame? if that is the case, I need to modify the UDF to filter out the missing value
Yes, there is missing values in dataframe.
I updated the answer, please be in caution, if your df is very large,python udf will hurt your performance, so blackbishop's answer is better in terms of performance

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.