Convert byte array to string spark

Question

I have a csv file which contains multiple fields. There are few fields for which data is coming in byte array format(b'1234'). I want to remove these b'(convert from byte array to string).

I came to know that we can convert byte array to string in two ways.

>>> s1 = b'Hi'
>>> s2 = s1.decode('utf-8') 
>>> print(s2)
Hi


>>> s1 = b'Hi'
>>> s2 = str(s1, 'utf-8')
>>> print(s2)
Hi

As there so many fields in CSV out of which only few fields contains byte array, I can't directily apply the function to each and every field. I don't have idea of which are are byte array fields and which ate string, int fields.

Any suggestions to convert byte array to string in csv file. I'm trying to do this in spark.

My code snippet :

df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true') \
    .option("delimiter", ",")\
    .option("multiLine", "true")\
    .load("file.csv")

Input Data:

b'1234',b'123',Hi,"Hello",b'2FB',b'272',b'4943',b'786',2018-02-19,,Out,768,"Data",b'502',351,

As schema changes dynamically we don't have control of knowing which are byte array and which are string. I tried this solution, however it didn't worked for me (it converted all the fields to nan).

Bad parser usage (univocity I think it's all rubbish.)! How data is processed without using the iterator? Why do you use a parser if it is to be subjected to iteration? Deal with your own problem, instead of others ' problems. — dsgdfg
– dsgdfg, Commented Feb 19, 2018 at 11:51
I updated the code. I just want to point a sample code. However the actual request to byte array to string. I believe parsing won't effect this one. Still removed the that parsing logic. — data_addict
– data_addict, Commented Feb 19, 2018 at 16:17
You got bad writing pattern (if want work with index,csv never allowed empty element OR you want type blank field value try collect data manually(never use a module, a module only suppressed few errors(not all))) — dsgdfg
– dsgdfg, Commented Feb 21, 2018 at 7:32

Anahcolus · Accepted Answer · 2018-06-14 07:26:38Z

1

As you said you have a csv file containing bytes as

b'1234',b'123',Hi,"Hello",b'2FB',b'272',b'4943',b'786',2018-02-19,,Out,768,"Data",b'502',351,

The straight forward solution I see to this is to replace the b' and ' strings with empty strings and parse the data to form dataframe.

rdd = sc.textFile("path to your csv file")\
    .map(lambda x: x.replace("b'", "").replace("'", ""))

Updated

As @ixaxaar commented

A better way is to do lambda x: x[2:-1]

So you can just do

rdd = sc.textFile("path to your csv file").map(lambda x: x[2:-1])

edited Jun 14, 2018 at 7:26

answered Feb 25, 2018 at 8:55

Anahcolus

42.1k6 gold badges75 silver badges101 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

ixaxaar Over a year ago

A better way is to do lambda x: x[2:-1]

Anahcolus Over a year ago

Thanks @ixaxaar :) Updated the answer.

Collectives™ on Stack Overflow

Convert byte array to string spark

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related