0

Spark newbie here and hopefully you guys can give me some help. Thanks!

I am trying to extract a URL from a CSV file and the URL is located at the 16th column. The problem is that the URLs were written in a strange format as you can see from the print out from the code below. What is the best approach to get a the URL in correct format?

case class log(time_stamp: String, url: String )

val logText = sc.textFile("hdfs://...").map(s => s.split(",")).map( s => log(s(0).replaceAll("\"", ""),s(15).replaceAll("\"", ""))).toDF()


logText.registerTempTable("log")

val results = sqlContext.sql("SELECT * FROM log")
results.map(s => "URL: " + s(1)).collect().foreach(println)

URL: /XXX/YYY/ZZZ/http/www.domain.com/xyz/xyz
URL: /XX/YYY/ZZZ/https/sub.domain.com/xyz/xyz/xyz/xyz
URL: /XXX/YYY/ZZZ/http/www.domain.com/
URL: /VV/XXXX/YYY/ZZZ/https/sub.domain.com/xyz/xyz/xyz
5
  • 1
    Let me get this right, you want to cover the extracted URLs from /XXX/YYY/ZZZ/http/www.domain.com/xyz/xyz to a "normal" form like http://www.domain.com/xyz/xyz? Commented Jan 28, 2016 at 17:14
  • 1
    do you always have the same" /XXX/YYY/ZZZ/" in frond? Three sections of letters separated by a "/"? Commented Jan 28, 2016 at 17:20
  • No. The /XXX/YYY/ZZZ are pretty random. And it could be more than 3 sections. Commented Jan 28, 2016 at 18:42
  • More than 3 letters or always 3 letters? Commented Jan 28, 2016 at 20:30
  • That could be random too... it could be /vvvvv/xx/yyyy/zzzzzz/http/........ Commented Jan 28, 2016 at 22:40

2 Answers 2

6

You can try regexp_replace:

import org.apache.spark.sql.functions.regexp_replace

val df = sc.parallelize(Seq(
  (1L, "/XXX/YYY/ZZZ/http/www.domain.com/xyz/xyz"),
  (2L, "/XXX/YYY/ZZZ/https/sub.domain.com/xyz/xyz/xyz/xyz")
)).toDF("id", "url")

df
  .select(regexp_replace($"url", "^(/\\w+){3}/(https?)/", "$2://").alias("url"))
  .show(2, false)

// +--------------------------------------+
// |url                                   |
// +--------------------------------------+
// |http://www.domain.com/xyz/xyz         |
// |https://sub.domain.com/xyz/xyz/xyz/xyz|
// +--------------------------------------+

In Spark 1.4 you can try Hive UDF:

df.selectExpr("""regexp_replace(url, '^(/\w+){3}/(https?)/','$2://') AS url""")

If number of sections before http(s) can vary you adjust regexp by replacing {3} with * or range.

Sign up to request clarification or add additional context in comments.

7 Comments

Thanks zero323. I am using Spark 1.4.1, does it mean the regexp_replace will not be available for this version?
Yes it does. You can try corresponding Hive UDF but if you work with DataFrames you should upgrade to 1.5+ anyway.
Thanks but i got this following error java.util.NoSuchElementException: key not found: regexp_replace
Like I said it is Hive UDF. If it cannot be found there is a good chance you don't use HiveContext.
Thinks zero323. As you said, I dont have HiveContext but i can get around it by using the Scala's replaceAll function. But it seem that your suggested regexp doesn't match the first 3 sections of /xxx/yyy/zzz . I have been playing with different way but still no luck. Do you have any idea? Thanks again.
|
0

The question comes down to parsing the long strings and extracting the domain name. This solution will work as long as you don't have any of the random strings (XXX,YYYYY, etc.) be "http" and "https":

def getUrl(data: String): Option[String] = {
   val slidingPairs = data.split("/").sliding(2)
   slidingPairs.flatMap{ 
       case Array(x,y) => 
          if(x=="http" || x == "https") Some(y) else None
    }.toList.headOption
}

Here are some examples in the REPL:

scala> getUrl("/XXX/YYY/ZZZ/http/www.domain.com/xyz/xyz")
res8: Option[String] = Some(www.domain.com)

scala> getUrl("/XXX/YYY/ZZZ/https/sub.domain.com/xyz/xyz/xyz/xyz")
resX: Option[String] = Some(sub.domain.com)

scala> getUrl("/XXX/YYY/ZZZ/a/asdsd/asdase/123123213/https/sub.domain.com/xyz/xyz/xyz/xyz")
resX: Option[String] = Some(sub.domain.com)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.