6

I have a MySql table with following schema:

id-int
path-varchar
info-json {"name":"pat", "address":"NY, USA"....}

I used JDBC driver to connect pyspark to MySql. I can retrieve data from mysql using

df = sqlContext.sql("select * from dbTable")

This query works all fine. My question is, how can I query on "info" column? For example, below query works all fine in MySQL shell and retrieve data but this is not supported in Pyspark (2+).

select id, info->"$.name" from dbTable where info->"$.name"='pat'

1 Answer 1

16
from pyspark.sql.functions import *
res = df.select(get_json_object(df['info'],"$.name").alias('name'))
res = df.filter(get_json_object(df['info'], "$.name") == 'pat')

There is already a function named get_json_object


For your situation:

df = spark.read.jdbc(url='jdbc:mysql://localhost:3306', table='test.test_json',
                     properties={'user': 'hive', 'password': '123456'})
df.createOrReplaceTempView('test_json')
res = spark.sql("""
select col_json,get_json_object(col_json,'$.name') from test_json
""")
res.show()

Spark sql is almost like HIVE sql, you can see

https://cwiki.apache.org/confluence/display/Hive/Home

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for your reply. This method only works when the data is loaded in a data frame. There are hundreds of thousands of records. It might not be efficient way to load the complete table and filter the data against it. Is there way to retrieve the data only matched (json search) with a query rather than loading the complete table?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.