Spark: adding column name to csv file fails

Question

I have "a.txt" which is in csv format and is separated by tabs:

16777216    16777471        -33.4940    143.2104
16777472    16778239    Fuzhou  26.0614 119.3061

Then I run:

sc.textFile("path/to/a.txt").map(line => line.split("\t")).toDF("startIP", "endIP", "City", "Longitude", "Latitude")

THen I got:

java.lang.IllegalArgumentException: requirement failed: The number of columns doesn't match. Old column names (1): value New column names (5): startIP, endIP, City, Longitude, Latitude at scala.Predef$.require(Predef.scala:224) at org.apache.spark.sql.Dataset.toDF(Dataset.scala:376) at org.apache.spark.sql.DatasetHolder.toDF(DatasetHolder.scala:40) ... 47 elided

If I just run:

res.map(line => line.split("\t")).take(2)

I got:

rdd: Array[Array[String]] = Array(Array(16777216, 16777471, "", -33.4940, 143.2104), Array(16777472, 16778239, Fuzhou, 26.0614, 119.3061))

What is wrong here?

Vidya · Accepted Answer · 2017-04-18 19:04:03Z

3

As @user7881163 notes, the error occurs because your split produces a single column whose value (hence the value name given by Spark) is the array of tokens produced by the split.

However, per comments from @zero323, just make sure you use the version of collect @user7881163 uses (the one that takes a partial function) if you are operating at scale because the other, far more commonly used collect will move all your data to the driver and overwhelm that machine. And if you aren't operating at scale, why use Spark at all?

This is a slightly different approach that also allows for missing city data:

sc.textFile("path/to/a.txt")
  .map(_.split("\t"))
  .map {
      case Array(startIP, endIP, city, longitude, latitude) => (startIP, endIP, Some(city), longitude, latitude)
      case Array(startIP, endIP, longitude, latitude) => (startIP, endIP, None, longitude, latitude)
  }.toDF("startIP", "endIP", "City", "Longitude", "Latitude")

edited Apr 18, 2017 at 19:04

answered Apr 18, 2017 at 0:35

Vidya

30.4k7 gold badges48 silver badges73 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

zero323 Over a year ago

collect transformation doesn't move any data to the driver. Also making pattern match exhaustive would be a good idea.

Vidya Over a year ago

First, as noted in the documentation, collect is an action--not a transformation, which matters--that does this: "Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data." That would be bad at scale as noted.

Vidya Over a year ago

Second, yes, pattern matches should be exhaustive. It didn't occur to me to veer outside the scope of the question to "Scala best practices"; maybe I should have. I only accounted for the data formats @derek indicated. So yeah, @derek, if you take this approach, make your pattern match exhaustive or manage exceptions properly with logging, Try, etc.

zero323 Over a year ago

You're not talking about the same collect :) github.com/apache/spark/blob/… which is equivalent of Seq.collect in the Scala collection API.

Vidya Over a year ago

Ahh OK. I have never seen that version of collect anywhere, but yes, that does keep things distributed. I've edited accordingly. Of course, this whole point isn't really germane to the question, and it would've been more aligned with Stack Overflow guidelines to simply edit the answer to make it better than engage in a long comment thread on a topic that isn't even essential to the question about the shape of the data being ingested.

zero323 · Accepted Answer · 2017-04-18 18:18:57Z

1

Try:

sc
  .textFile("path/to/a.txt")
  .map(line => line.split("\t"))
  .collect { case Array(startIP, endIP, City, Longitude, Latitude) => 
    (startIP, endIP, City, Longitude, Latitude) 
  }.toDF("startIP", "endIP", "City", "Longitude", "Latitude")

or just use csv source:

spark.read.option("delimiter", "\t").csv("path/to/a.txt")

Your current code creates a DataFrame with a single column of type array<string>. This is why it fails when you pass 5 names.

edited Apr 18, 2017 at 18:18

zero323

331k108 gold badges981 silver badges958 bronze badges

answered Apr 17, 2017 at 23:58

user7881163

111 bronze badge

1 Comment

zero323 Over a year ago

This should be {case Array(...) => ... } not {case Seq(...) => ... }

Gregor Doroschenko · Accepted Answer · 2017-12-14 08:03:31Z

0

You can try this example:

dataDF = sc.textFile("filepath").map(x=>x.split('\t').toDF();

data = dataDF.selectExpr("_1 as startIP", "_2 as endIP", "_3 as City", "_4 as Longitude", "_5 as Latitude");

edited Dec 14, 2017 at 8:03

Gregor Doroschenko

11.7k6 gold badges28 silver badges37 bronze badges

answered Dec 14, 2017 at 6:32

Yoga

1

Collectives™ on Stack Overflow

Spark: adding column name to csv file fails

3 Answers 3

5 Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related