Using Spark to extract part of the string in CSV format

Question

Spark newbie here and hopefully you guys can give me some help. Thanks!

I am trying to extract a URL from a CSV file and the URL is located at the 16th column. The problem is that the URLs were written in a strange format as you can see from the print out from the code below. What is the best approach to get a the URL in correct format?

case class log(time_stamp: String, url: String )

val logText = sc.textFile("hdfs://...").map(s => s.split(",")).map( s => log(s(0).replaceAll("\"", ""),s(15).replaceAll("\"", ""))).toDF()


logText.registerTempTable("log")

val results = sqlContext.sql("SELECT * FROM log")
results.map(s => "URL: " + s(1)).collect().foreach(println)

URL: /XXX/YYY/ZZZ/http/www.domain.com/xyz/xyz
URL: /XX/YYY/ZZZ/https/sub.domain.com/xyz/xyz/xyz/xyz
URL: /XXX/YYY/ZZZ/http/www.domain.com/
URL: /VV/XXXX/YYY/ZZZ/https/sub.domain.com/xyz/xyz/xyz

Let me get this right, you want to cover the extracted URLs from /XXX/YYY/ZZZ/http/www.domain.com/xyz/xyz to a "normal" form like http://www.domain.com/xyz/xyz? — marios
– marios, Commented Jan 28, 2016 at 17:14
do you always have the same" /XXX/YYY/ZZZ/" in frond? Three sections of letters separated by a "/"? — marios
– marios, Commented Jan 28, 2016 at 17:20
No. The /XXX/YYY/ZZZ are pretty random. And it could be more than 3 sections. — Wing
– Wing, Commented Jan 28, 2016 at 18:42
That could be random too... it could be /vvvvv/xx/yyyy/zzzzzz/http/........ — Wing
– Wing, Commented Jan 28, 2016 at 22:40

zero323 · Accepted Answer · 2016-01-28 22:16:07Z

6

You can try regexp_replace:

import org.apache.spark.sql.functions.regexp_replace

val df = sc.parallelize(Seq(
  (1L, "/XXX/YYY/ZZZ/http/www.domain.com/xyz/xyz"),
  (2L, "/XXX/YYY/ZZZ/https/sub.domain.com/xyz/xyz/xyz/xyz")
)).toDF("id", "url")

df
  .select(regexp_replace($"url", "^(/\\w+){3}/(https?)/", "$2://").alias("url"))
  .show(2, false)

// +--------------------------------------+
// |url                                   |
// +--------------------------------------+
// |http://www.domain.com/xyz/xyz         |
// |https://sub.domain.com/xyz/xyz/xyz/xyz|
// +--------------------------------------+

In Spark 1.4 you can try Hive UDF:

df.selectExpr("""regexp_replace(url, '^(/\w+){3}/(https?)/','$2://') AS url""")

If number of sections before http(s) can vary you adjust regexp by replacing {3} with * or range.

edited Jan 28, 2016 at 22:16

answered Jan 28, 2016 at 17:36

zero323

331k108 gold badges982 silver badges958 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Wing Over a year ago

Thanks zero323. I am using Spark 1.4.1, does it mean the regexp_replace will not be available for this version?

zero323 Over a year ago

Yes it does. You can try corresponding Hive UDF but if you work with DataFrames you should upgrade to 1.5+ anyway.

Wing Over a year ago

Thanks but i got this following error java.util.NoSuchElementException: key not found: regexp_replace

zero323 Over a year ago

Like I said it is Hive UDF. If it cannot be found there is a good chance you don't use HiveContext.

Wing Over a year ago

Thinks zero323. As you said, I dont have HiveContext but i can get around it by using the Scala's replaceAll function. But it seem that your suggested regexp doesn't match the first 3 sections of /xxx/yyy/zzz . I have been playing with different way but still no luck. Do you have any idea? Thanks again.

|

marios · Accepted Answer · 2016-01-29 23:27:07Z

The question comes down to parsing the long strings and extracting the domain name. This solution will work as long as you don't have any of the random strings (XXX,YYYYY, etc.) be "http" and "https":

def getUrl(data: String): Option[String] = {
   val slidingPairs = data.split("/").sliding(2)
   slidingPairs.flatMap{ 
       case Array(x,y) => 
          if(x=="http" || x == "https") Some(y) else None
    }.toList.headOption
}

Here are some examples in the REPL:

scala> getUrl("/XXX/YYY/ZZZ/http/www.domain.com/xyz/xyz")
res8: Option[String] = Some(www.domain.com)

scala> getUrl("/XXX/YYY/ZZZ/https/sub.domain.com/xyz/xyz/xyz/xyz")
resX: Option[String] = Some(sub.domain.com)

scala> getUrl("/XXX/YYY/ZZZ/a/asdsd/asdase/123123213/https/sub.domain.com/xyz/xyz/xyz/xyz")
resX: Option[String] = Some(sub.domain.com)

Collectives™ on Stack Overflow

Using Spark to extract part of the string in CSV format

2 Answers 2

7 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

7 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related