1

I wrote the following regex :

val reg = ".+([A-Z_].+).(\\d{4})_(\\d{2})_(\\d{2})_(\\d{2})\\.orc".r 

which is supposed to parse the following strings : "S3//bucket//TS11_YREDED.2018_09_28_02.orc" the parse method is :

val dataExtraction: String => Map[String, String] = {
  string: String => {
    string match {
      case reg(filename, year, month, day) =>
                 Map(FILE_NAME-> filename, YEAR -> year, MONTH -> month, DAY -> day)
      case _  => Map(FILE_NAME-> filename,YEAR -> "", MONTH -> "", DAY -> "")
    }
  }
}
val YEAR = "YEAR"
val MONTH = "MONTH"
val DAY = "DAY"
val FILE_NAME = "FILE_NAME"

but it doesn't work properly it is supposed to ommit the bucket name and parse filename and date

so the expected output shall rather be : Map(FILE_NAME-> TS11_YREDED, YEAR -> , MONTH -> 09, DAY -> 28) Any idea how to fix it please ?

0

2 Answers 2

1

The .+ pattern part matches the whole string first and ([A-Z_].+) only captures what remains to be captured and matched by the subsequent patterns.

You may use

"""(?:.*/)?(.*)\.(\d{4})_(\d{2})_(\d{2})_\d{2}\.orc""".r

See this regex demo

Note that the dot must be escaped to match a literal dot.

Details

  • (?:.*/)? - any 0+ chars other than linebreak chars, as many as possible, up to the last / and including it
  • (.*) - Capturing group 1: any 0+ chars, other than linebreak chars, as many as possible
  • \. - a dot
  • (\d{4}) - Capturing group 2: four digits
  • _ - an underscore
  • (\d{2}) - Capturing group 3: two digits
  • _ - an underscore
  • (\d{2}) - Capturing group 4: two digits
  • _\d{2}\.orc - _, 2 digits, . and orc at the end of the string.

Scala demo:

val text = "S3//bucket//TS11_YREDED.2018_09_28_02.orc"
val reg = """(?:.*/)?(.*)\.(\d{4})_(\d{2})_(\d{2})_\d{2}\.orc""".r

var YEAR = "YEAR"
var MONTH = "MONTH"
var DAY = "DAY"
var FILE_NAME = "FILE_NAME"

val dataExtraction: String => Map[String, String] = {
  string: String => {
    string match {
      case reg(filename, year, month, day) =>
                 Map(FILE_NAME-> filename, YEAR -> year, MONTH -> month, DAY -> day)
      case _  => Map(FILE_NAME-> FILE_NAME,YEAR -> YEAR, MONTH -> MONTH, DAY -> DAY)
    }
  }
}

println(dataExtraction(text))
// => Map(FILE_NAME -> TS11_YREDED, YEAR -> 2018, MONTH -> 09, DAY -> 28)

Since you are not using the last capturing group, it can be omitted from the pattern.

Sign up to request clarification or add additional context in comments.

5 Comments

I'd use """^([A-Z0-9_]+)\.(\d{4})_(\d{2})_(\d{2})_(\d{2})\.orc$""".r so that it matches the entire input (rather than just a portion of it) and considers FILE_NAME fields consisting of uppercase, numeric and underscore characters. Also, if the last number is to be ignored, it shouldn't be grouped in the regex (use """^([A-Z0-9_]+)\.(\d{4})_(\d{2})_(\d{2})_\d{2}\.orc$""".r instead, and don't add the s argument to reg in the case statement.
@MikeAllen val reg = """(.*?)\.(\d{4})_(\d{2})_(\d{2})_(\d{2})\.orc""".r matches the whole string, it cannot match any portions of text since it is used in a match block. .unanchored would make it find partial matches.
Agreed, but there's no harm in being explicit in the regex definition. :-) BTW, [A-Z_].+ is wrong for reasons other than those you gave: it doesn't include numeric characters (present in the example), and the period following the ] ensures that only ONE character in the range is matched, followed by one or more other characters - which could be anything. The period should be removed. Also, the OP's original code has 5 regex groups, but only matches 4 of them in the case statement.
@MikeAllen Let OP ask what they need to clarify. I explained how to make it work. The rest is details.
@Wiktor , I updated the question , youyr regex won't work in this case
0

Check this out:

val file_name = "TS11_YREDED.2018_09_28_02.orc"
val reg = """(.*?)\.(\d{4})_(\d{2})_(\d{2})_(\d{2})\.orc""".r
var file_details = scala.collection.mutable.ArrayBuffer[String]()
reg.findAllIn(file_name).matchData.foreach( m => file_details.appendAll(m.subgroups))
val names=Array("FILE_NAME","YEAR","MONTH","DAY","DUMMY")
for( (x,y) <- names.zip(file_details).toMap)
  println(x + "->" + y)

//DUMMY->02
//DAY->28
//FILE_NAME->TS11_YREDED
//MONTH->09
//YEAR->2018

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.