Say I have a column filled with URLs like in the following:
+------------------------------------------+
|url |
+------------------------------------------+
|https://www.example1.com?param1=1¶m2=a|
|https://www.example2.com?param1=2¶m2=b|
|https://www.example3.com?param1=3¶m2=c|
+------------------------------------------+
What would be the best way of extracting the URL parameters from this column and adding them as columns to the dataframe to produce the below?
+-------------------------------------------+---------------+
| url| param1| param2|
+-------------------------------------------+---------------+
|https://www.example1.com?param1=1¶m2=a | 1| a|
|https://www.example2.com?param1=2¶m2=b | 2| b|
|https://www.example3.com?param1=3¶m2=c | 3| c|
|etc... | etc...| etc...|
+-------------------------------------------+---------------+
My Attempts
I can think of two possible methods of doing this, using functions.regexp_extract from the pyspark library or by using urllib.parse.parse_qs and urllib.parse.urlparse from the standard library. The former solution uses regex which is a finicky method of extracting parameters from strings but the latter would need to be wrapped in a UDF to be used.
from pyspark.sql import *
from pyspark.sql import functions as fn
df = spark.createDataFrame(
[
("https://www.example.com?param1=1¶m2=a",),
("https://www.example2.com?param1=2¶m2=b",),
("https://www.example3.com?param1=3¶m2=c",)
],
["url"]
)
Regex solution:
df2 = df.withColumn("param1", fn.regexp_extract('url', 'param1=(\d)', 1))
df2 = df2.withColumn("param2", fn.regexp_extract('url', 'param2=([a-z])', 1))
df2.show()
>> +------------------------------------------+------+------+
>> |url |param1|param2|
>> +------------------------------------------+------+------+
>> |https://www.example1.com?param1=1¶m2=a|1 |a |
>> |https://www.example2.com?param1=2¶m2=b|2 |b |
>> |https://www.example3.com?param1=3¶m2=c|3 |c |
>> +------------------------------------------+------+------+
UDF solution:
from urllib.parse import urlparse, parse_qs
from pyspark.sql.types import MapType, StringType
extract_params = udf(lambda x: {k: v[0] for k, v in parse_qs(urlparse(x).query).items()}, MapType(StringType(), StringType()))
df3 = df.withColumn(
"params", extract_params(df.url)
)
df3.withColumn(
"param1", df3.params['param1']
).withColumn(
"param2", df3.params['param2']
).drop("params").show()
>> +------------------------------------------+------+------+
>> |url |param1|param2|
>> +------------------------------------------+------+------+
>> |https://www.example1.com?param1=1¶m2=a|1 |a |
>> |https://www.example2.com?param1=2¶m2=b|2 |b |
>> |https://www.example3.com?param1=3¶m2=c|3 |c |
>> +------------------------------------------+------+------+
I'd like to use the versatility of a library like urllib but would also like the optimisability of writing it in pyspark functions. Is there a better method than the two I've tried so far?
urllibin audfmight be a better approach. also, if the url format is consistent, you can use the multiplesplits to get the desired result.parse_url. But can only be used with SQL andexpr.