Cannot resolve given input columns while sql on dataframe

Question

Platform: IntelliJ Edition 2018.2.4 (Community Edition)
SDK: 1.8.0_144
OS: Windows 7

As a future graduate, I am on my first big data mission and I am facing a problem:

Code

//Loading my csv file here
val df = spark.read
  .format("csv")
  .option("header", "true")
  .option("delimiter",";")
  .load("/user/sfrtech/dilan/yesterdaycsv.csv")
  .toDF()


//Select required columns
val formatedDf = df.select("`TcRun.ID`", "`Td.Name`", "`TcRun.Startdate`", "`TcRun.EndDate`", "`O.Sim.MsisdnVoice`", "`T.Sim.MsisdnVoice`", "`ErrorCause`")

//Sql on DF in order to get useful data
formatedDf.createOrReplaceTempView("yesterday")
val sqlDF = spark.sql("" +
  " SELECT TcRun.Id, Td.Name, TcRun.Startdate, TcRun.EndDate, SUBSTR(O.Sim.MsisdnVoice,7,14) as MsisdnO, SUBSTR(T.Sim.MsisdnVoice,7,14) as MsisdnT", ErrorCause +
  " FROM yesterday" +
  " WHERE Td.Name like '%RING'" +
  " AND MsisdnO is not null" +
  " AND MsisdnT is not null" +
  " AND ErrorCause = 'NoError'")

Getting error

Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'Td.Name' given input columns: [TcRun.EndDate, TcRun.Startdate, O.Sim.MsisdnVoice, TcRun.ID, Td.Name, T.Sim.MsisdnVoice, ErrorCause]; line 1 pos 177;

I guess the problem come from my columns name that contains "." but I don't know how to solve this, even if i'm using backticks

Solution

val newColumns = Seq("id", "name", "startDate", "endDate", "msisdnO", "msisdnT", "error")
val dfRenamed = df.toDF(newColumns: _*)

dfRenamed.printSchema
// root
// |-- id: string (nullable = false)
// |-- name: string (nullable = false)
// |-- startDate: string (nullable = false)
// |-- endDate: string(nullable = false)
// |-- msisdnO: string (nullable = false)
// |-- msisdnT: string (nullable = false)
// |-- error: string (nullable = false)

If you solve your own problem, it's best to post it as an answer and mark that as accepted. — Justin Kaeser
– Justin Kaeser, Commented Nov 8, 2018 at 11:34

Tejaswini Ganapathi · Accepted Answer · 2019-04-24 00:14:59Z

4

This worked,

val sqlDF = spark.sql("" +
  " SELECT 'TcRun.Id', 'Td.Name', 'TcRun.Startdate', 'TcRun.EndDate'", ErrorCause +
  " FROM yesterday" +
  " WHERE 'Td.Name' like '%RING'" +
  " AND MsisdnO is not null" +
  " AND MsisdnT is not null" +
  " AND ErrorCause = 'NoError'")

When you have a . character in your field name, use quotes in the select clause.

answered Apr 24, 2019 at 0:14

Tejaswini Ganapathi

413 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

nl09 · Accepted Answer · 2019-07-15 03:44:32Z

2

For the column name that contains .(dot) you can use the ` character to enclose the column name.

df.select('Td.Name')

I faced similar issue and this solution worked for me.

Referance: DataFrame columns names conflict with .(dot)

answered Jul 15, 2019 at 3:44

nl09

1032 silver badges10 bronze badges

Comments

Terry Dactyl · Accepted Answer · 2018-11-06 15:29:54Z

1

// Define column names of csv without "."
val schema = StructType(Array(
        StructField("id", StringType, true),
        StructField("name", StringType, true),
        // etc. etc. )

// Load csv file without headers and specify your schema
val df = spark.read
  .format("csv")
  .option("header", "false")
  .option("delimiter",";")
  .schema(schema)
  .load("/user/sfrtech/dilan/yesterdaycsv.csv")
  .toDF()

Then select your columns as you wish

df
  .select ($"id", $"name", /*etc etc*/)

answered Nov 6, 2018 at 15:29

Terry Dactyl

1,86814 silver badges24 bronze badges

1 Comment

Dilan Asatekin Over a year ago

Indeed, the best solution was to rename the columns.

Collectives™ on Stack Overflow

Cannot resolve given input columns while sql on dataframe

3 Answers 3

Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related