1

I have a PySpark UDF which when I try to apply to each row for one of the df columns and get a new column, I get a [Ljava.lang.Object;@7e44638d (different value after the @ for each row)

Please see the udf below:

def getLocCoordinates(property_address):
    url = "https://maps.googleapis.com/maps/api/geocode/json"
    querystring = {f"address":property_address},"key":"THE_API_KEY"}
    response = requests.get(url, params=querystring)
    response_json = json.loads(response.text)

    for adr in response_json['results']:
        geometry = adr['geometry']
        coor = geometry['location']
        lat = coor['lat']
        lng = coor['lng']
        coors = lat, lng
        return coors

getCoorsUDF = udf(lambda x:getLocCoordinates(x))

df = df.withColumn("AddressCoordinates", getCoorsUDF(col("FullAddress") ) )

I tried:

  • getCoorsUDF = udf(getLocCoordinates, FloatType()) --> returns NULL for each row of the newly create "AddressCoordinates" column.

  • getCoorsUDF = udf(getLocCoordinates, StringType()) --> returns [Ljava.lang.Object;@

  • getCoorsUDF = udf(getLocCoordinates) --> returns [Ljava.lang.Object;@

The result looks like so:

Ref Num FullAddress AddressCoordinates
1234 Some Address [Ljava.lang.Object;@

This gets returned for each row in the dataframe.

Initially I was using the function in a Python notebook and it was working fine, lat and lng was returning for each adress. However, I had to move this to PySpark and I am hitting a brick wall here.

4
  • always put full error message because there are other useful information. Commented Apr 22 at 12:25
  • do you have correct indentations? return is inside for-loop and it finish work after first element. Commented Apr 22 at 12:27
  • Hi, yes 'return' is inside the 'for-loop' Commented Apr 22 at 12:32
  • There is no error message at the moment. Only the fact that the newly created column 'AddressCoordinates' returns 'Ljava.lang.Object;@.....] for each row instead of the outcome of the function. Commented Apr 22 at 12:33

1 Answer 1

2

I think that you're seeing the [Ljava.lang.Object;@... output because your UDF is returning a Python tuple ((lat, lng)), and PySpark doesn't know how to serialize that into a DataFrame column unless you explicitly define a return schema that Spark understands.

You should return a StructType with fields for lat and lng. For example you can do something like this:

from pyspark.sql.functions import udf, col
from pyspark.sql.types import StructType, StructField, DoubleType
import requests
import json

# defining return type for the UDF
location_schema = StructType([
    StructField("lat", DoubleType(), True),
    StructField("lng", DoubleType(), True)
])

def getLocCoordinates(property_address):
    url = "https://maps.googleapis.com/maps/api/geocode/json"
    params = {
        "address": property_address,
        "key": "YOUR_API_KEY"
    }
    try:
        response = requests.get(url, params=params)
        data = response.json()
        if data['results']:
            location = data['results'][0]['geometry']['location']
            return {"lat": location['lat'], "lng": location['lng']}
    except Exception as e:
        print(f"Error: {e}")
    return None

# registering the UDF with schema
getCoorsUDF = udf(getLocCoordinates, location_schema)

# now you apply the UDF
df = df.withColumn("AddressCoordinates", getCoorsUDF(col("FullAddress")))

# an option would be to extract lat and lng as separate columns
df = df.withColumn("Latitude", col("AddressCoordinates.lat")) \
       .withColumn("Longitude", col("AddressCoordinates.lng"))
Sign up to request clarification or add additional context in comments.

2 Comments

Hi, what you have suggested worked. Thank you so much for pointing out what exactly needed changing and for actually posting the code. I will explore it in detail. Thanks once again :)
You're welcome, I'm glad I could help :)

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.