csv header parsing in pyspark

Question

I am trying to read csv file as dataframe from Azure databricks. The header columns (when I open in excel) are as follows. All the header names are in the following format in the CSV file.

e.g.

"City_Name"ZYD_CABC2_EN:0TXTMD

Basically I want to include only strings within quotes as my header (City_Name) and ignore the second part of the string (ZYD_CABC2_EN:0TXTMD)

sales_df = spark.read.format("csv").load(input_path + '/sales_2020.csv', inferSchema = True, header=True)

mck · Accepted Answer · 2021-02-10 11:58:50Z

2

You can parse the column names after reading in the csv file, using regular expressions to extract the words between the quotes, and then using toDF to reassign all column names at once:

import re

# sales_df = spark.read.format("csv")...

sales_df = sales_df.toDF(*[re.search('"(.*)"', c).group(1) for c in df.columns])

answered Feb 10, 2021 at 11:58

mck

42.7k13 gold badges44 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

blackbishop · Accepted Answer · 2021-02-10 12:10:52Z

2

You can split the actual names using " to get the desired column names:

sales_df = sales_df.toDF(*[c.split('"')[1] for c in df.columns])

answered Feb 10, 2021 at 12:10

blackbishop

32.8k11 gold badges61 silver badges86 bronze badges

Collectives™ on Stack Overflow

csv header parsing in pyspark

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related