0

I'm trying to open a text file, process each line and store the result in a multidimensional array.

My input file contains:

1 1 3 2  
2 2.2 3 1.8  
3 3 1.2 2.5   

and I want to create a 3x4 array like this:

(1, 1, 3, 2)  
(2, 2.2, 3 1.8)  
etc

My code is:

for (line <- Source.fromFile(inputFile).getLines) {
 var counters = line.split("\\s+")
 sc.parallelize(counters).saveAsTextFile(outputFile)
}

I am trying to save the results in a text but firstly I got an exception in the running process which is:

apache.hadoop.mapred.FileAlreadyExistsException:
  Output directory file:/home/user/Desktop/output.txt already exists

I guess that is about the parallelize but that was the only way I found to save an array.

Also, what is stored is not what I want. The file has two partition files that contain:

part1:

1  
1  

part2:

3  
2  

How can I create a multidimensional array from one dimension arrays and how can I save it in a text file?

1 Answer 1

1

You're creating a separate RDD (and saving it to file) for each line, instead of one RDD for the entire file. Also, since you're using Spark (see disclaimers) to write the file - you'd benefit from also using it to read it. Here's how you can fix it:

sc.textFile(inputFile)
  .map(_.split("\\s+").mkString(",")) // if you want result to be comma-delimited
  .repartition(1) // if you want to make sure output has one partition (file)
  .saveAsTextFile(outputFile)

A few disclaimers though:

  • If the file is indeed relatively small (so you can load it using fromFile) - why do you need Spark? Spark should usually be used for data that is too large for a single file / single process's memory to handle
  • You'll have to make sure outputFile doesn't exist before you run this - otherwise you'll see the same exception (Spark is careful not to override your data, so it fails if output file, which is actually a folder, already exists)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.