1

I've a csv file, which looks something like this

A B C
1 2 
2 4
3 2 5
1 2 3
4 5 6

When I'm reading this data into spark, it's considering column C as "string" because of "blanks" in the first few rows.

Could anybody please tell me how to load this file in SQL dataframe so that column c remains integer (or float)?

I'm using "sc.textFile" to read the data into spark, and then converting it into SQL dataframe.

I read this and this links. But they didn't help me much.

My code portion. In the last line of the code I'm getting the error.

myFile=sc.textFile(myData.csv)

header = myFile.first()
fields = [StructField(field_name, StringType(), True) for field_name in header.split(',')]
fields[0].dataType = FloatType()
fields[1].dataType = FloatType()
fields[2].dataType = FloatType()

schema = StructType(fields)

myFileCh = myFile.map(lambda k: k.split(",")).map(lambda p: (float(p[0]),float(p[1]),float(p[2])))

Thanks!

6
  • You would need to use pattern matching and cast to the desired type according to the content in c Commented May 24, 2016 at 10:50
  • @z-star: Thanks for your comment! But I didn't get what you are saying. I'm following this (nodalpoint.com/…) method to convert my data into SQL dataframe. The issue is coming when I'm trying to create "taxi_temp" part. In my dataset the last column is blank and I mentioned datatype as "float". So, it's saying can't convert "string" into "float". Commented May 24, 2016 at 11:31
  • o ok. can you please publish your code? Commented May 24, 2016 at 11:45
  • I've updated the code snippet in the main question. Commented May 24, 2016 at 12:06
  • You spilt the data on commas, but there are no commas in what you posted as your data Commented May 24, 2016 at 12:16

1 Answer 1

1

So the issue is with this unsafe casting. you could implement a short function that will perform a "safe" cast and return a defult value in case cast to fload fails.

def safe_cast(val, to_type, default=None):
try:
    return to_type(val)
except ValueError:
    return default

safe_cast('tst', float) # will return None
safe_cast('tst', float, 0.0) # will return 0.0

myFileCh = myFile.map(lambda k: k.split(",")).map(lambda p: (safe_cast(p[0], float),safe_cast(p[1], float),safe_cast(p[2], float)))
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.