Multiple nullValues in spark csv

Question

I have a csv file with "" (empty value) and "N/A" and "-" all in the same files. I want them all to be read into the dataframe as nulls. I know that there is an option in spark-csv "nullValue" , which allows me to treat one single string as null. But for me, that is not sufficient for obvious reasons.

There is a pending issue from spark, https://github.com/databricks/spark-csv/issues/333

which is still open. I was wondering about the most elegent way to get around the problem.

Is it critical that they be "read in" as Nulls or is it acceptable to read them into the dataframe (say as strings) and then convert to Nulls? — combinatorist
– combinatorist, Commented Nov 7, 2017 at 1:11
most elegant solution would be to use a replaceAll and make your data uniform. — philantrovert
– philantrovert, Commented Nov 7, 2017 at 8:55
@combinatorist , I want to read that against a schema and use it as a dataset. So certain fields that are integers by default contains values like "N/A" or "-" all of which I want to be parsed as null to be able to read into the interger field of my schema case class. So I'd prefer to do it when being read from the file into a dataset tself. — Vishnu Prathish
– Vishnu Prathish, Commented Nov 7, 2017 at 17:14
@philantrovert . I would do it as the last case. But Ideally, I want spark to handle the whole thing rather than a regular in memory replaceAll. — Vishnu Prathish
– Vishnu Prathish, Commented Nov 7, 2017 at 17:15
@VishnuPrathish, what if you read the field into a dataframe as a string, make Null replacements there, convert the field to an int, and then cast that dataframe as a dataset? — combinatorist
– combinatorist, Commented Nov 7, 2017 at 17:18

combinatorist · Accepted Answer · 2017-11-07 18:06:00Z

3

Reposted from my comment:

Read the field into a dataframe as a string
make Null replacements there
convert the field to an int
then cast that dataframe as a dataset

answered Nov 7, 2017 at 18:06

combinatorist

5651 gold badge4 silver badges19 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

sudip modi · Accepted Answer · 2023-08-21 10:00:12Z

0

For those who cant get it to work on databricks community edition notebook, You probably haven't mentioned the filename.

answered Aug 21, 2023 at 10:00

sudip modi

212 bronze badges

1 Comment

Community Over a year ago

As it’s currently written, your answer is unclear. Please edit to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers in the help center.

Collectives™ on Stack Overflow

Multiple nullValues in spark csv

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related