0

I am trying to extract domains from URLs.

Input:

    import org.apache.spark.sql._
    import org.apache.spark.sql.functions._
    val b = Seq(
        ("subdomain.example.com/test.php"),
        ("example.com"),
        ("example.buzz"),
        ("test.example.buzz"),
        ("subdomain.example.co.uk"),
    ).toDF("raw_url")
    var c = b.withColumn("host", callUDF("parse_url", $"raw_url", lit("HOST"))).show()

Expected results:

    +--------------------------------+---------------+
    | raw_url                        | host          |
    +--------------------------------+---------------+
    | subdomain.example.com/test.php | example.com   |
    | example.com                    | example.com   | 
    | example.buzz                   | example.buzz  |
    | test.example.buzz              | example.buzz  |
    | subdomain.example.co.uk        | example.co.uk |
    +------------------------------- +---------------+

Any advice much appreciated.

EDIT: based on the tip from @AlexOtt I have got a few steps closer.

    import com.google.common.net.InternetDomainName
    import org.apache.spark.sql._
    import org.apache.spark.sql.functions._
    val b = Seq(
        ("subdomain.example.com/test.php"),
        ("example.com"),
        ("example.buzz"),
        ("test.example.buzz"),
        ("subdomain.example.co.uk"),
    ).toDF("raw_url")
    var c = b.withColumn("host", callUDF("InternetDomainName.from", $"raw_url", topPrivateDomain)).show()

However, I clearly have not implemented it correctly with withColumn. Here is the error:

error: not found: value topPrivateDomain var c = b.withColumn("host", callUDF("InternetDomainName.from", $"raw_url", topPrivateDomain)).show()

EDIT 2:

Got some good pointers from @sarveshseri and after cleaning up some syntax errors, the following code is able to remove the subdomains from most of the URLs.

    import org.apache.spark.sql.functions.udf
    import org.apache.spark.sql._
    import org.apache.spark.sql.functions._
    import com.google.common.net.InternetDomainName
    import java.net.URL

    val b = Seq(
       ("subdomain.example.com/test.php"),
       ("example.com"),
       //("example.buzz"),
       //("test.example.buzz"),
       ("subdomain.example.co.uk"),
       ).toDF("raw_url")

    val hostExtractUdf = org.apache.spark.sql.functions.udf { 
        (urlString: String) =>
        val url = new URL("https://" + urlString)
        val host = url.getHost
        InternetDomainName.from(host).topPrivateDomain().name()
    }

    var c = b.select("raw_url").withColumn("HOST", 
       hostExtractUdf(col("raw_url")))
        .show(false)

However, it still does not work as expected. Newer suffixes like .buzz and .site and .today cause the following error:

Caused by: java.lang.IllegalStateException: Not under a public suffix: example.buzz
2
  • 1
    you need to wrap some library that supports lookup via public suffix list: publicsuffix.org - the rules are quite complicated Commented Dec 13, 2021 at 18:42
  • Thanks for the tip @AlexOtt, that has somewhat helped. I found this stackoverflow.com/questions/45046265/…. However, I'm still stuck on how to apply InternetDomainName.from().topPrivateDomain to withColumn Commented Dec 14, 2021 at 14:56

2 Answers 2

3
+500

First you will need to add guava to dependencies in build.sbt.

libraryDependencies += "com.google.guava" % "guava" % "31.0.1-jre"

Now you can extract the host as follows,

import com.google.common.net.InternetDomainName
import org.apache.sedona.core.serde.SedonaKryoRegistrator
import org.apache.spark.SparkConf
import org.apache.spark.serializer.KryoSerializer
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

import java.net.URL

import spark.implicits._

val hostExtractUdf = org.apache.spark.sql.functions.udf { (urlString: String) =>
  val url = new URL("https://" + urlString)
  val host = url.getHost
  InternetDomainName.from(host).topPrivateDomain().toString
}

val b = sc.parallelize(Seq(
  ("a.b.com/c.php"),
  ("a.b.site/c.php"),
  ("a.b.buzz/c.php"),
  ("a.b.today/c.php"),
  ("b.com"),
  ("b.site"),
  ("b.buzz"),
  ("b.today"),
  ("a.b.buzz"),
  ("a.b.co.uk"),
  ("a.b.site")
)).toDF("raw_url")

val c = b.withColumn("HOST", hostExtractUdf(col("raw_url")))

c.show()

c.show output

+---------------+-------+
|        raw_url|   HOST|
+---------------+-------+
|  a.b.com/c.php|  b.com|
| a.b.site/c.php| b.site|
| a.b.buzz/c.php| b.buzz|
|a.b.today/c.php|b.today|
|          b.com|  b.com|
|         b.site| b.site|
|         b.buzz| b.buzz|
|        b.today|b.today|
|       a.b.buzz| b.buzz|
|      a.b.co.uk|b.co.uk|
|       a.b.site| b.site|
+---------------+-------+
Sign up to request clarification or add additional context in comments.

8 Comments

Is there any way to avoid using sbt?
You can use any other build tool like maven, gradle, pants, bazel etc. to build your peoject.
Good to know. Up until now, I have been using :load from spark-shell.
I cleaned up some syntax errors and got your code (mostly) working. However its unable to handle newer suffixes like .buzz and .today and .site.
Yes, I had only added the missing pieces in the code and it was not complete code. FIxed that. What do you mean by "unable to habdle .buzz, 'today and .site" ? You can see in the code that it handles these without issue.
|
2

Maybe you can use regex with Spark regexp_extract and regexp_replace built-in functions. Here's an example:

val c = b.withColumn(
  "HOST",
  regexp_extract(col("raw_url"), raw"^(?:https?:\/\/)?(?:[^@\n]+@)?(?:www.)?([^:\/\n?]+)", 1)
).withColumn(
  "sub_domain",
  regexp_extract(col("HOST"), raw"(.*?)\.(?=[^\/]*\..{2,5})/?.*", 1)
).withColumn(
  "HOST",
  expr("trim(LEADING '.' FROM regexp_replace(HOST, sub_domain, ''))")
).drop("sub_domain")

c.show(false)
//+-----------------------------------+-------------+
//|raw_url                            |HOST         |
//+-----------------------------------+-------------+
//|subdomain.example.com/test.php     |example.com  |
//|example.com                        |example.com  |
//|example.buzz                       |example.buzz |
//|test.example.buzz                  |example.buzz |
//|https://www.subdomain.example.co.uk|example.co.uk|
//|subdomain.domain.buzz              |domain.buzz  |
//|dev.example.today                  |example.today|
//+-----------------------------------+-------------+

The first one extracts the the full host name from the URL (including the subdomain). Then, using the regex taken from this answer, we search for the subdomain and replace it with blank.

Didn't test it for all possible cases but it works fine for the given examples in your question.

1 Comment

Although, I personally don't like the this self written regex appraoch for various reasons and will suggest to avoid such things in production code, you still have my upvote for the effort. Firstly, It is be very hard to correctly create. Second it will be very hard to maintain the ragex. Third, Guava is already doing this for you. The Guava implementation not only is maintained to follow the latest domain name specification, it also keeps track of all the ICANN public domains.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.