Spark newbie here and hopefully you guys can give me some help. Thanks!
I am trying to extract a URL from a CSV file and the URL is located at the 16th column. The problem is that the URLs were written in a strange format as you can see from the print out from the code below. What is the best approach to get a the URL in correct format?
case class log(time_stamp: String, url: String )
val logText = sc.textFile("hdfs://...").map(s => s.split(",")).map( s => log(s(0).replaceAll("\"", ""),s(15).replaceAll("\"", ""))).toDF()
logText.registerTempTable("log")
val results = sqlContext.sql("SELECT * FROM log")
results.map(s => "URL: " + s(1)).collect().foreach(println)
URL: /XXX/YYY/ZZZ/http/www.domain.com/xyz/xyz
URL: /XX/YYY/ZZZ/https/sub.domain.com/xyz/xyz/xyz/xyz
URL: /XXX/YYY/ZZZ/http/www.domain.com/
URL: /VV/XXXX/YYY/ZZZ/https/sub.domain.com/xyz/xyz/xyz
/XXX/YYY/ZZZ/http/www.domain.com/xyz/xyzto a "normal" form likehttp://www.domain.com/xyz/xyz?