Dealing with commas within a field in a csv file using pyspark

Question

I have a csv data file containing commas within a column value. For example,

value_1,value_2,value_3  
AAA_A,BBB,B,CCC_C

Here, the values are "AAA_A","BBB,B","CCC_C". But, when trying to split the line by comma, it is giving me 4 values, i.e. "AAA_A","BBB","B","CCC_C".

How to get the right values after splitting the line by commas in PySpark?

How will you know which side BBB,B B will go?

Pankaj Arora
– Pankaj Arora

2016-02-23 06:50:02 +00:00
Commented Feb 23, 2016 at 6:50 — Pankaj Arora
– Pankaj Arora, Commented Feb 23, 2016 at 6:50

DanielVL · Accepted Answer · 2016-02-23 08:30:31Z

5

Use spark-csv class from databriks.

Delimiters between quotes, by default ("), are ignored.

Example:

val df = sqlContext.read
    .format("com.databricks.spark.csv")
    .option("header", "true") // Use first line of all files as header
    .option("inferSchema", "true") // Automatically infer data types
    .load("cars.csv")

For more info, review https://github.com/databricks/spark-csv

If your quote is (') instance of ("), you could configure with this class.

EDIT:

For python API:

df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('cars.csv')

Best regards.

edited Feb 23, 2016 at 8:30

answered Feb 23, 2016 at 8:20

DanielVL

2491 silver badge5 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

sammy Over a year ago

The field values are not within quotes actually. Hence, the delimiters are not within quotes. That's the main reason I am not getting the output in the correct format. Whenever I'm trying to split it using .map(lambda l: l.split(",")), it splits wherever it finds a delimiter.

DanielVL Over a year ago

I dont understand your question :S How you identify one value to other? (BBB,B) is all one value, or two (BBB, B) ?

sammy Over a year ago

BBB,B is all one value.

StephanieVela · Accepted Answer · 2018-07-23 16:03:01Z

5

I'm (really) new to Pyspark, but have been using Pandas for the past years. What I'm going to put here might not be ultimately the best solution, but it works for me so I think it's worth posting here.

I'm encountering the same issue loading in a CSV file with extra comma embedded in one special field, which triggered an error if using Pyspark, but had no problem if using Pandas. So I looked around for a solution to deal with this extra delimiter, and the following piece of code solved my issue:

df = sqlContext.read.format('csv').option('header','true').option('maxColumns','3').option('escape','"').load('cars.csv')

I personally like to force the 'maxColumns' parameter to allow only a specific number of columns. So if the "BBB,B" somehow got parsed into two strings, spark is going to give an error message and print the whole line for you. And the 'escape' option is the one that really fixed my issue. I don't know if this helps, but hopefully that's something to run experiments with.

answered Jul 23, 2018 at 16:03

StephanieVela

511 silver badge2 bronze badges

1 Comment

Comencau Over a year ago

Yes, same for me, 'escape' option was the solution whereas it was not working with 'quote' option only.

abby sobh · Accepted Answer · 2016-09-25 22:23:55Z

2

If you do not mind the extra package dependency, you could use Pandas to parse the CSV file. It handles internal commas just fine.

Dependencies:

from pyspark import SparkContext
from pyspark.sql import SQLContext
import pandas as pd

Read the whole file at once into a Spark DataFrame:

sc = SparkContext('local','example')  # if using locally
sql_sc = SQLContext(sc)

pandas_df = pd.read_csv('file.csv')  # assuming the file contains a header
# If no header:
# pandas_df = pd.read_csv('file.csv', names = ['column 1','column 2']) 
s_df = sql_sc.createDataFrame(pandas_df)

Or, even more data-consciously, you can chunk the data into a Spark RDD then DF:

chunk_100k = pd.read_csv('file.csv', chunksize=100000)

for chunky in chunk_100k:
    Spark_temp_rdd = sc.parallelize(chunky.values.tolist())
    try:
        Spark_full_rdd += Spark_temp_rdd
    except NameError:
        Spark_full_rdd = Spark_temp_rdd
    del Spark_temp_rdd

Spark_DF = Spark_full_rdd.toDF(['column 1','column 2'])

edited Sep 25, 2016 at 22:23

answered Sep 25, 2016 at 22:14

abby sobh

1,62420 silver badges17 bronze badges

1 Comment

Dan Over a year ago

Extra package dependency and extra memory usage that pandas gets by loading all the data into the driver node...

Collectives™ on Stack Overflow

Dealing with commas within a field in a csv file using pyspark

3 Answers 3

3 Comments

1 Comment

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related