4
  • Platform: IntelliJ Edition 2018.2.4 (Community Edition)
  • SDK: 1.8.0_144
  • OS: Windows 7

As a future graduate, I am on my first big data mission and I am facing a problem:

Code

//Loading my csv file here
val df = spark.read
  .format("csv")
  .option("header", "true")
  .option("delimiter",";")
  .load("/user/sfrtech/dilan/yesterdaycsv.csv")
  .toDF()


//Select required columns
val formatedDf = df.select("`TcRun.ID`", "`Td.Name`", "`TcRun.Startdate`", "`TcRun.EndDate`", "`O.Sim.MsisdnVoice`", "`T.Sim.MsisdnVoice`", "`ErrorCause`")

//Sql on DF in order to get useful data
formatedDf.createOrReplaceTempView("yesterday")
val sqlDF = spark.sql("" +
  " SELECT TcRun.Id, Td.Name, TcRun.Startdate, TcRun.EndDate, SUBSTR(O.Sim.MsisdnVoice,7,14) as MsisdnO, SUBSTR(T.Sim.MsisdnVoice,7,14) as MsisdnT", ErrorCause +
  " FROM yesterday" +
  " WHERE Td.Name like '%RING'" +
  " AND MsisdnO is not null" +
  " AND MsisdnT is not null" +
  " AND ErrorCause = 'NoError'")

Getting error

Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'Td.Name' given input columns: [TcRun.EndDate, TcRun.Startdate, O.Sim.MsisdnVoice, TcRun.ID, Td.Name, T.Sim.MsisdnVoice, ErrorCause]; line 1 pos 177;

I guess the problem come from my columns name that contains "." but I don't know how to solve this, even if i'm using backticks

Solution

val newColumns = Seq("id", "name", "startDate", "endDate", "msisdnO", "msisdnT", "error")
val dfRenamed = df.toDF(newColumns: _*)

dfRenamed.printSchema
// root
// |-- id: string (nullable = false)
// |-- name: string (nullable = false)
// |-- startDate: string (nullable = false)
// |-- endDate: string(nullable = false)
// |-- msisdnO: string (nullable = false)
// |-- msisdnT: string (nullable = false)
// |-- error: string (nullable = false)
1
  • 1
    If you solve your own problem, it's best to post it as an answer and mark that as accepted. Commented Nov 8, 2018 at 11:34

3 Answers 3

4

This worked,

val sqlDF = spark.sql("" +
  " SELECT 'TcRun.Id', 'Td.Name', 'TcRun.Startdate', 'TcRun.EndDate'", ErrorCause +
  " FROM yesterday" +
  " WHERE 'Td.Name' like '%RING'" +
  " AND MsisdnO is not null" +
  " AND MsisdnT is not null" +
  " AND ErrorCause = 'NoError'")

When you have a . character in your field name, use quotes in the select clause.

Sign up to request clarification or add additional context in comments.

Comments

2

For the column name that contains .(dot) you can use the ` character to enclose the column name.

df.select('Td.Name')

I faced similar issue and this solution worked for me.

Referance: DataFrame columns names conflict with .(dot)

Comments

1
// Define column names of csv without "."
val schema = StructType(Array(
        StructField("id", StringType, true),
        StructField("name", StringType, true),
        // etc. etc. )

// Load csv file without headers and specify your schema
val df = spark.read
  .format("csv")
  .option("header", "false")
  .option("delimiter",";")
  .schema(schema)
  .load("/user/sfrtech/dilan/yesterdaycsv.csv")
  .toDF()

Then select your columns as you wish

df
  .select ($"id", $"name", /*etc etc*/)

1 Comment

Indeed, the best solution was to rename the columns.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.