7

I have a csv data file containing commas within a column value. For example,

value_1,value_2,value_3  
AAA_A,BBB,B,CCC_C  

Here, the values are "AAA_A","BBB,B","CCC_C". But, when trying to split the line by comma, it is giving me 4 values, i.e. "AAA_A","BBB","B","CCC_C".

How to get the right values after splitting the line by commas in PySpark?

1
  • How will you know which side BBB,B B will go? Commented Feb 23, 2016 at 6:50

3 Answers 3

5

Use spark-csv class from databriks.

Delimiters between quotes, by default ("), are ignored.

Example:

val df = sqlContext.read
    .format("com.databricks.spark.csv")
    .option("header", "true") // Use first line of all files as header
    .option("inferSchema", "true") // Automatically infer data types
    .load("cars.csv")

For more info, review https://github.com/databricks/spark-csv

If your quote is (') instance of ("), you could configure with this class.

EDIT:

For python API:

df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('cars.csv')

Best regards.

Sign up to request clarification or add additional context in comments.

3 Comments

The field values are not within quotes actually. Hence, the delimiters are not within quotes. That's the main reason I am not getting the output in the correct format. Whenever I'm trying to split it using .map(lambda l: l.split(",")), it splits wherever it finds a delimiter.
I dont understand your question :S How you identify one value to other? (BBB,B) is all one value, or two (BBB, B) ?
BBB,B is all one value.
5

I'm (really) new to Pyspark, but have been using Pandas for the past years. What I'm going to put here might not be ultimately the best solution, but it works for me so I think it's worth posting here.

I'm encountering the same issue loading in a CSV file with extra comma embedded in one special field, which triggered an error if using Pyspark, but had no problem if using Pandas. So I looked around for a solution to deal with this extra delimiter, and the following piece of code solved my issue:

df = sqlContext.read.format('csv').option('header','true').option('maxColumns','3').option('escape','"').load('cars.csv')

I personally like to force the 'maxColumns' parameter to allow only a specific number of columns. So if the "BBB,B" somehow got parsed into two strings, spark is going to give an error message and print the whole line for you. And the 'escape' option is the one that really fixed my issue. I don't know if this helps, but hopefully that's something to run experiments with.

1 Comment

Yes, same for me, 'escape' option was the solution whereas it was not working with 'quote' option only.
2

If you do not mind the extra package dependency, you could use Pandas to parse the CSV file. It handles internal commas just fine.

Dependencies:

from pyspark import SparkContext
from pyspark.sql import SQLContext
import pandas as pd

Read the whole file at once into a Spark DataFrame:

sc = SparkContext('local','example')  # if using locally
sql_sc = SQLContext(sc)

pandas_df = pd.read_csv('file.csv')  # assuming the file contains a header
# If no header:
# pandas_df = pd.read_csv('file.csv', names = ['column 1','column 2']) 
s_df = sql_sc.createDataFrame(pandas_df)

Or, even more data-consciously, you can chunk the data into a Spark RDD then DF:

chunk_100k = pd.read_csv('file.csv', chunksize=100000)

for chunky in chunk_100k:
    Spark_temp_rdd = sc.parallelize(chunky.values.tolist())
    try:
        Spark_full_rdd += Spark_temp_rdd
    except NameError:
        Spark_full_rdd = Spark_temp_rdd
    del Spark_temp_rdd

Spark_DF = Spark_full_rdd.toDF(['column 1','column 2'])

1 Comment

Extra package dependency and extra memory usage that pandas gets by loading all the data into the driver node...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.