12

I have a 10GB csv file in hadoop cluster with duplicate columns. I try to analyse it in SparkR so I use spark-csv package to parse it as DataFrame:

  df <- read.df(
    sqlContext,
    FILE_PATH,
    source = "com.databricks.spark.csv",
    header = "true",
    mode = "DROPMALFORMED"
  )

But since df have duplicate Email columns, if I want to select this column, it would error out:

select(df, 'Email')

15/11/19 15:41:58 ERROR RBackendHandler: select on 1422 failed
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : 
  org.apache.spark.sql.AnalysisException: Reference 'Email' is ambiguous, could be: Email#350, Email#361.;
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:278)
...

I want to keep the first occurrence of Email column and delete the latter, how can I do that?

4 Answers 4

16

The best way would be to change the column name upstream ;)

However, it seems that is not possible, so there are a couple of options:

  1. If the case of the columns are different("email" vs "Email") you can turn on case sensitivity:

         sql(sqlContext, "set spark.sql.caseSensitive=true")
    
  2. If the column names are exactly the same, you will need to manually specify the schema and skip the first row to avoid the headers:

    customSchema <- structType(
    structField("year", "integer"), 
    structField("make", "string"),
    structField("model", "string"),
    structField("comment", "string"),
    structField("blank", "string"))
    
    df <- read.df(sqlContext, "cars.csv", source = "com.databricks.spark.csv", header="true", schema = customSchema)
    
Sign up to request clarification or add additional context in comments.

Comments

6

You can add a simple line when you start a spark session, after successfully create spark session add this line to set spark config...

spark.conf.set("spark.sql.caseSensitive", "true")

Comments

1

Try renaming the column.

You can select it by position instead of the select call.

colnames(df)[column number of interest] <- 'deleteme'

Alternatively you could just drop the column directly

 newdf <- df[,-x]

Where x is the column number you don't want.

Update:

If the above don't work, you could set header to false and then use the first row to rename columns:

  df <- read.df(
    sqlContext,
    FILE_PATH,
    source = "com.databricks.spark.csv",
    header = "FALSE",
    mode = "DROPMALFORMED"
  )

#get first row to use as column names
mycolnames <- df[1,]

#edit the dup column *in situ*
mycolnames[x] <- 'IamNotADup'
colnames(df) <- df[1,]

# drop the first row:
df <- df[-1,]

1 Comment

I tried both, but they all lead to the same "Reference 'Email' is ambiguous" error I mentioned in the question.
1

You can also create a new dataframe using toDF.

Here's the same thing, for pyspark: Selecting or removing duplicate columns from spark dataframe

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.