Extract domain from URLs using scala

Question

I am trying to extract domains from URLs.

Input:

    import org.apache.spark.sql._
    import org.apache.spark.sql.functions._
    val b = Seq(
        ("subdomain.example.com/test.php"),
        ("example.com"),
        ("example.buzz"),
        ("test.example.buzz"),
        ("subdomain.example.co.uk"),
    ).toDF("raw_url")
    var c = b.withColumn("host", callUDF("parse_url", $"raw_url", lit("HOST"))).show()

Expected results:

    +--------------------------------+---------------+
    | raw_url                        | host          |
    +--------------------------------+---------------+
    | subdomain.example.com/test.php | example.com   |
    | example.com                    | example.com   | 
    | example.buzz                   | example.buzz  |
    | test.example.buzz              | example.buzz  |
    | subdomain.example.co.uk        | example.co.uk |
    +------------------------------- +---------------+

Any advice much appreciated.

EDIT: based on the tip from @AlexOtt I have got a few steps closer.

    import com.google.common.net.InternetDomainName
    import org.apache.spark.sql._
    import org.apache.spark.sql.functions._
    val b = Seq(
        ("subdomain.example.com/test.php"),
        ("example.com"),
        ("example.buzz"),
        ("test.example.buzz"),
        ("subdomain.example.co.uk"),
    ).toDF("raw_url")
    var c = b.withColumn("host", callUDF("InternetDomainName.from", $"raw_url", topPrivateDomain)).show()

However, I clearly have not implemented it correctly with withColumn. Here is the error:

error: not found: value topPrivateDomain var c = b.withColumn("host", callUDF("InternetDomainName.from", $"raw_url", topPrivateDomain)).show()

EDIT 2:

Got some good pointers from @sarveshseri and after cleaning up some syntax errors, the following code is able to remove the subdomains from most of the URLs.

    import org.apache.spark.sql.functions.udf
    import org.apache.spark.sql._
    import org.apache.spark.sql.functions._
    import com.google.common.net.InternetDomainName
    import java.net.URL

    val b = Seq(
       ("subdomain.example.com/test.php"),
       ("example.com"),
       //("example.buzz"),
       //("test.example.buzz"),
       ("subdomain.example.co.uk"),
       ).toDF("raw_url")

    val hostExtractUdf = org.apache.spark.sql.functions.udf { 
        (urlString: String) =>
        val url = new URL("https://" + urlString)
        val host = url.getHost
        InternetDomainName.from(host).topPrivateDomain().name()
    }

    var c = b.select("raw_url").withColumn("HOST", 
       hostExtractUdf(col("raw_url")))
        .show(false)

However, it still does not work as expected. Newer suffixes like .buzz and .site and .today cause the following error:

Caused by: java.lang.IllegalStateException: Not under a public suffix: example.buzz

you need to wrap some library that supports lookup via public suffix list: publicsuffix.org - the rules are quite complicated — Alex Ott
– Alex Ott, Commented Dec 13, 2021 at 18:42
Thanks for the tip @AlexOtt, that has somewhat helped. I found this stackoverflow.com/questions/45046265/…. However, I'm still stuck on how to apply InternetDomainName.from().topPrivateDomain to withColumn — Wonko the Sane
– Wonko the Sane, Commented Dec 14, 2021 at 14:56

sarveshseri · Accepted Answer · 2021-12-16 12:02:44Z

3

+500

First you will need to add guava to dependencies in build.sbt.

libraryDependencies += "com.google.guava" % "guava" % "31.0.1-jre"

Now you can extract the host as follows,

import com.google.common.net.InternetDomainName
import org.apache.sedona.core.serde.SedonaKryoRegistrator
import org.apache.spark.SparkConf
import org.apache.spark.serializer.KryoSerializer
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

import java.net.URL

import spark.implicits._

val hostExtractUdf = org.apache.spark.sql.functions.udf { (urlString: String) =>
  val url = new URL("https://" + urlString)
  val host = url.getHost
  InternetDomainName.from(host).topPrivateDomain().toString
}

val b = sc.parallelize(Seq(
  ("a.b.com/c.php"),
  ("a.b.site/c.php"),
  ("a.b.buzz/c.php"),
  ("a.b.today/c.php"),
  ("b.com"),
  ("b.site"),
  ("b.buzz"),
  ("b.today"),
  ("a.b.buzz"),
  ("a.b.co.uk"),
  ("a.b.site")
)).toDF("raw_url")

val c = b.withColumn("HOST", hostExtractUdf(col("raw_url")))

c.show()

c.show output

+---------------+-------+
|        raw_url|   HOST|
+---------------+-------+
|  a.b.com/c.php|  b.com|
| a.b.site/c.php| b.site|
| a.b.buzz/c.php| b.buzz|
|a.b.today/c.php|b.today|
|          b.com|  b.com|
|         b.site| b.site|
|         b.buzz| b.buzz|
|        b.today|b.today|
|       a.b.buzz| b.buzz|
|      a.b.co.uk|b.co.uk|
|       a.b.site| b.site|
+---------------+-------+

edited Dec 16, 2021 at 12:02

answered Dec 15, 2021 at 7:09

sarveshseri

14k30 silver badges50 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

Wonko the Sane Over a year ago

Is there any way to avoid using sbt?

sarveshseri Over a year ago

You can use any other build tool like maven, gradle, pants, bazel etc. to build your peoject.

Wonko the Sane Over a year ago

Good to know. Up until now, I have been using :load from spark-shell.

Wonko the Sane Over a year ago

I cleaned up some syntax errors and got your code (mostly) working. However its unable to handle newer suffixes like .buzz and .today and .site.

sarveshseri Over a year ago

Yes, I had only added the missing pieces in the code and it was not complete code. FIxed that. What do you mean by "unable to habdle .buzz, 'today and .site" ? You can see in the code that it handles these without issue.

|

blackbishop · Accepted Answer · 2021-12-16 14:36:39Z

2

Maybe you can use regex with Spark regexp_extract and regexp_replace built-in functions. Here's an example:

val c = b.withColumn(
  "HOST",
  regexp_extract(col("raw_url"), raw"^(?:https?:\/\/)?(?:[^@\n]+@)?(?:www.)?([^:\/\n?]+)", 1)
).withColumn(
  "sub_domain",
  regexp_extract(col("HOST"), raw"(.*?)\.(?=[^\/]*\..{2,5})/?.*", 1)
).withColumn(
  "HOST",
  expr("trim(LEADING '.' FROM regexp_replace(HOST, sub_domain, ''))")
).drop("sub_domain")

c.show(false)
//+-----------------------------------+-------------+
//|raw_url                            |HOST         |
//+-----------------------------------+-------------+
//|subdomain.example.com/test.php     |example.com  |
//|example.com                        |example.com  |
//|example.buzz                       |example.buzz |
//|test.example.buzz                  |example.buzz |
//|https://www.subdomain.example.co.uk|example.co.uk|
//|subdomain.domain.buzz              |domain.buzz  |
//|dev.example.today                  |example.today|
//+-----------------------------------+-------------+

The first one extracts the the full host name from the URL (including the subdomain). Then, using the regex taken from this answer, we search for the subdomain and replace it with blank.

Didn't test it for all possible cases but it works fine for the given examples in your question.

edited Dec 16, 2021 at 14:36

answered Dec 15, 2021 at 17:13

blackbishop

32.8k11 gold badges61 silver badges86 bronze badges

1 Comment

sarveshseri Over a year ago

Although, I personally don't like the this self written regex appraoch for various reasons and will suggest to avoid such things in production code, you still have my upvote for the effort. Firstly, It is be very hard to correctly create. Second it will be very hard to maintain the ragex. Third, Guava is already doing this for you. The Guava implementation not only is maintained to follow the latest domain name specification, it also keeps track of all the ICANN public domains.

Collectives™ on Stack Overflow

Extract domain from URLs using scala

2 Answers 2

8 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

8 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related