0

I’m working on a data ingestion pipeline using Apache Spark (triggered via a Cloud Function on Dataproc). The input CSV contains column names that include special characters such as parentheses and a decimal point for example:

Collection_time,Nems,Operator,Technology,Vendor,Site,Device No,Device Name,Subunit No,Subunit Name,Online Status,Actual Tilt(0.1degree),Actual Sector ID etc

While processing these columns, I define transformation formulas (stored in a BigQuery config table) such as:

COALESCE(`Actual Tilt(0.1degree)`, NULL)

However, Spark throws a parsing error during job execution:

Exception in thread "main" org.apache.spark.sql.AnalysisException:
[UNRESOLVED_COLUMN.WITH_SUGGESTION]
A column or function parameter with name `Actual Tilt(0`.`1degree)` cannot be resolved.

I also tried various escaping strategies like:

COALESCE([Actual Tilt(0.1degree)], NULL)
COALESCE("Actual Tilt(0.1degree)", NULL)
COALESCE(col("Actual Tilt(0.1degree)"), NULL)

but they fail with errors such as:

java.lang.IllegalArgumentException: Lexer Error: '')' expected but '[' found'
java.util.NoSuchElementException: key not found: col

I can only modify the SQL formula (string) stored in BigQuery — I cannot rename the source CSV column or modify the ingestion code directly.

Hence how can I correctly reference the column Actual Tilt(0.1degree) inside a Spark SQL expression (e.g., COALESCE or SELECT) when I can only change the SQL formula string?

3
  • 2
    This question is similar to: How to escape column names with hyphen in Spark SQL. If you believe it’s different, please edit the question, make it clear how it’s different and/or how the answers on that question are not helpful for your problem. Error message also kind of gives a hint where it uses backticks when referring to a "missing" column. Commented Nov 3 at 21:59
  • @Suhani Bhatia -> What is the output for df.columns after you reading the dataframe Commented Nov 4 at 12:06
  • 2
    Double quotes with backticks - col("`Actual Tilt(0.1degree)`") Commented Nov 4 at 18:57

1 Answer 1

0

Rename the col first:

df = df.withColumnRenamed("Tilt(0.1degree)", "Tiltdegree")
Sign up to request clarification or add additional context in comments.

3 Comments

Cannot rename the column due to some constraints any other way to handle this?
Use Polars or DuckDB instead?
cast(case when Actual Tilt(0.1degree) = 'Invalid' then 0 else Actual Tilt(0.1degree) end as int) ---->will this work?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.