Duplicate columns in Spark Dataframe

Question

I have a 10GB csv file in hadoop cluster with duplicate columns. I try to analyse it in SparkR so I use spark-csv package to parse it as DataFrame:

  df <- read.df(
    sqlContext,
    FILE_PATH,
    source = "com.databricks.spark.csv",
    header = "true",
    mode = "DROPMALFORMED"
  )

But since df have duplicate Email columns, if I want to select this column, it would error out:

select(df, 'Email')

15/11/19 15:41:58 ERROR RBackendHandler: select on 1422 failed
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : 
  org.apache.spark.sql.AnalysisException: Reference 'Email' is ambiguous, could be: Email#350, Email#361.;
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:278)
...

I want to keep the first occurrence of Email column and delete the latter, how can I do that?

shoeboxer · Accepted Answer · 2015-11-20 01:52:41Z

16

The best way would be to change the column name upstream ;)

However, it seems that is not possible, so there are a couple of options:

If the case of the columns are different("email" vs "Email") you can turn on case sensitivity:
```
     sql(sqlContext, "set spark.sql.caseSensitive=true")
```

If the column names are exactly the same, you will need to manually specify the schema and skip the first row to avoid the headers:

customSchema <- structType(
structField("year", "integer"), 
structField("make", "string"),
structField("model", "string"),
structField("comment", "string"),
structField("blank", "string"))

df <- read.df(sqlContext, "cars.csv", source = "com.databricks.spark.csv", header="true", schema = customSchema)

answered Nov 20, 2015 at 1:52

shoeboxer

2581 silver badge8 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Hrushi · Accepted Answer · 2022-05-27 07:05:30Z

6

You can add a simple line when you start a spark session, after successfully create spark session add this line to set spark config...

spark.conf.set("spark.sql.caseSensitive", "true")

answered May 27, 2022 at 7:05

Hrushi

3351 gold badge5 silver badges16 bronze badges

Comments

Minnow · Accepted Answer · 2015-11-20 13:39:23Z

1

Try renaming the column.

You can select it by position instead of the select call.

colnames(df)[column number of interest] <- 'deleteme'

Alternatively you could just drop the column directly

 newdf <- df[,-x]

Where x is the column number you don't want.

Update:

If the above don't work, you could set header to false and then use the first row to rename columns:

  df <- read.df(
    sqlContext,
    FILE_PATH,
    source = "com.databricks.spark.csv",
    header = "FALSE",
    mode = "DROPMALFORMED"
  )

#get first row to use as column names
mycolnames <- df[1,]

#edit the dup column *in situ*
mycolnames[x] <- 'IamNotADup'
colnames(df) <- df[1,]

# drop the first row:
df <- df[-1,]

edited Nov 20, 2015 at 13:39

answered Nov 20, 2015 at 0:53

Minnow

1,8192 gold badges27 silver badges52 bronze badges

1 Comment

Bamqf Over a year ago

I tried both, but they all lead to the same "Reference 'Email' is ambiguous" error I mentioned in the question.

forumulator · Accepted Answer · 2018-09-06 13:25:52Z

1

You can also create a new dataframe using toDF.

Here's the same thing, for pyspark: Selecting or removing duplicate columns from spark dataframe

answered Sep 6, 2018 at 13:25

forumulator

88512 silver badges13 bronze badges

Collectives™ on Stack Overflow

Duplicate columns in Spark Dataframe

4 Answers 4

Comments

Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related