2

I have a free form text file(Not XML) from which i would like to parse the lines between two patterns. Here is the sample data

<Hi>
col1 col2 col3
1 2 3 
4 5 6
helo how are 

<How>
col1 col2
1 2 
helo hi'

I want to parse the data between each tag i.e <Hi> and the blank line as a single string. Similarly the data between <How> and the blank line as another string.

The regex pattern i tried so far did not work.

val pattern = "^<Hi>(.*)\\n"
val pattern = "^<Hi>(.*)\\s*$"
val pattern = "^<Hi>(.*)"
val pattern = "^<Network>(.*)((\\r\\n|\\n|\\r)$)|(^(\\r\\n|\\n|\\r))|^\\s*$"

Is there a way i can specify a pattern for the blank line. Any help is appreciated.

4
  • It is a freeform text. Not an XML Commented May 20, 2019 at 7:43
  • I did mention. Thanks. Commented May 20, 2019 at 7:46
  • Try searching for the "DOTALL" modifier for regular expressions. Have a go with that and if you get stuck, update the question. Also, can you post the code that you are using to execute the regular expression and print the results. Commented May 20, 2019 at 7:48
  • You can try something like this <Hi>([\s\S]+)(?=^$) Demo Commented May 20, 2019 at 7:52

3 Answers 3

4

Use this instead: [^\>]+(?=\n{2,}|$|\<). Remember to use the global flag to find all matches. You can take a look at the explanation here:

https://regexr.com/4e9c1

Sign up to request clarification or add additional context in comments.

Comments

4

You can use this regex and capture your data from group1,

<[^>]+>\s*([\w\W]*?(?=\n\n|$))

Regex Demo

Explanation:

  • <[^>]+>\s* - Start capturing the tag using <[^>]+> and optional whitespace(s) with \s*
  • ([\w\W]*? - Capture any characters including newlines in non-greedy manner
  • (?=\n\n|$)) - Positive look ahead to ensure the match stops as soon as it sees two newlines or absolute end of string

Comments

3

Solution in code.

val src = io.Source.fromFile("so.txt")

"(?s)>\\s*(.+?)(?=\n\n|$)".r
                          .findAllMatchIn(src.mkString)
                          .map(_.group(1))
                          .mkString("->", "<-\n->", "<-")
//res0: String =
//->col1 col2 col3
//1 2 3
//4 5 6
//helo how are <-
//->col1 col2
//1 2
//helo hi'<-

src.close()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.