1

I am trying to read csv file as dataframe from Azure databricks. The header columns (when I open in excel) are as follows. All the header names are in the following format in the CSV file.

e.g.

"City_Name"ZYD_CABC2_EN:0TXTMD

Basically I want to include only strings within quotes as my header (City_Name) and ignore the second part of the string (ZYD_CABC2_EN:0TXTMD)

sales_df = spark.read.format("csv").load(input_path + '/sales_2020.csv', inferSchema = True, header=True)

2 Answers 2

2

You can parse the column names after reading in the csv file, using regular expressions to extract the words between the quotes, and then using toDF to reassign all column names at once:

import re

# sales_df = spark.read.format("csv")...

sales_df = sales_df.toDF(*[re.search('"(.*)"', c).group(1) for c in df.columns])
Sign up to request clarification or add additional context in comments.

Comments

2

You can split the actual names using " to get the desired column names:

sales_df = sales_df.toDF(*[c.split('"')[1] for c in df.columns])

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.