Regex expression for multiple patterns in 1 line

Question

I am scraping information from a log that I need 3 elements. Another added difficulty is that I am parsing the log via readLine() in my java program aka one(1) line at a time. (If there is a possibility to read multiple lines when parsing let me know :) ) NOTE: I have no control over the log output format.

There are 2 possibilities of what I must extract. Either the log is nice and gives the following

NICE FORMAT

.text.rank     0x0000000000400b8f      0x351 is_x86.o

where I must grab .text.rank , 0x0000000000400b8f , and 0x351

Now the not so nice case: If the name is too long, it bumps everything else to the next line like is below, now the only thing after the first element is one blank space followed by a newline (\n) which gets clobbered by readLine() anyway.

EVIL FORMAT : Note each line is in a separate arraylist entry.

.text.__sfmoreglue 
            0x0000000000401d00       0x55 /mnt/drv2homelibc_popcorn.a(lib_a-findfp.o)

Therefore what the regex actually sees is:

.text.__sfmoreglue

CORNER CASE FORMAT that also occurs within the log but I DO NOT want

 *(.text.unlikely)

Finally below is my Pattern line I am currently using for the first line and pline2 is what is used on the next line when group 2 of the first line is empty.

UPDATE: The pattern below works for the NICE FORMAT and EVIL FORMAT But now pattern pline2 has no matches, even though on regex101.com it is correct. Link: https://regex101.com/r/vS7vZ3/9

UPDATE2: I fixed it, I forgot to add m2.find() once I compiled the second line with Pattern pline2. Corrected code is below.

Pattern p = Pattern.compile("^[ \\s](\\.[tex]*\\.[\\._\\-\\@a-zA-Z0-9]*)\\s*([x0-9a-f]*)[ \\s]*([x0-9a-f]*).*");

Pattern pline2 = Pattern.compile("^\\s*([x0-9a-f]*)[ \\s]*([x0-9a-f]*)\\s*[\\w\\(\\)\\.\\-]*");

To give a little background I am first matching the name .text.whatever to m.group(1) followed by the address 0x000012345 to m.group(2) and finally the size 0xa48 to m.group(3). This is all assuming the log is in the NICE format. If it is in the EVIL format I see that group(2) is empty and therefore readin the next line of the log to a temp buffer and apply the second pattern pline2 to new line.

Can someone help me with the regex? Is there a way I can make sure my current line (or even better, just the second grouping) is either the NICE FORMAT or is empty?

As requested my java code:

//1st line pattern
Pattern p = Pattern.compile("^[ \\s](\\.[tex]*\\.[\\._\\-\\@a-zA-Z0-9]*)\\s*([x0-9a-f]*)[ \\s]*([x0-9a-f]*).*");
//conditional 2nd line pattern
Pattern pline2 = Pattern.compile("^\\s*([x0-9a-f]*)[ \\s]*([x0-9a-f]*)\\s*[\\w\\(\\)\\.\\-]*");
while((temp = br1.readLine()) != null){
        Matcher m = p.matcher(temp);
        while(m.find()){
            System.out.println("What regex finds: m1:"+m.group(1)+"#    m2:"+m.group(2)+"#    m3:"+m.group(3));
            if(!m.group(1).isEmpty() && m.group(2).isEmpty() && m.group(3).isEmpty()){
                //means we probably hit a long symbol name and important stuff is on the next line
                //save the name at least
                name = m.group(1);
                //read and utilize the next line
                if((temp = br1.readLine()) == null){
                    return;
                }
                System.out.println("EVILline2:"+temp); //sanity check the input 
                System.out.println(pline2.toString()); //sanity check the regex
                Matcher m2= pline2.matcher(temp);
                while(m2.find()){
                       System.out.println("regex line2 finds: m1:"+m2.group(1));//+"#    m2:"+m2.group(2));
                       if(m2.group(2).isEmpty()){
                             size = 0;
                       }else{
                             size = Long.parseLong(m2.group(2).replaceFirst("0x", ""),16);
                       }

                       addr = Long.parseLong(m2.group(1).replaceFirst("0x", ""),16);
                       System.out.println("#########LONG NAME: "+name+"    addr:"+addr+"    size:"+size);
                  }
            }//end if
            else{ // assume in NICE FORMAT
                //do nice format stuff.
        }//end while
}//end outerwhile

An Aside, The output I currently get:

line: .text.c_print_results
What regex finds: m1:.text.c_print_results#    m2:#    m3:
EVIL FORMATline2:                0x00000000004001e6      0x231 c_print_results_x86.o
^\s*([x0-9a-f]*)[ \s]*([x0-9a-f]*)\s*[\w\(\)\.\-]*
Exception in thread "main" java.lang.IllegalStateException: No match found
at java.util.regex.Matcher.group(Matcher.java:536)
at java.util.regex.Matcher.group(Matcher.java:496)
at regexTest.regex.grabSymbolsInRange(regex.java:143)
at regexTest.regex.main(regex.java:489)

Separate your concerns: create one regex for the nice form and a different regex for the evil form. When you've got them both working, add in a conditional to choose between them, based on something like empty group 2 or whitespace at start of line. — Paul Hicks
– Paul Hicks, Commented Apr 30, 2015 at 23:41
@PaulHicks I am doing what you say to some degree and the question is how do accommodate 2 patterns (NICE FORMAT or just empty after the name) in 1 pattern. — kPalladyn
– kPalladyn, Commented May 1, 2015 at 15:47

Rodrigo López · Accepted Answer · 2015-05-01 00:39:38Z

1

You have a few issues with your pattern.

1st is the separation of first and second groups (that's why group 2 is returning null).
You have 4 groups and you need 3
After capturing your 3 values you can stop matching, so pattern after last group isn't necessary
you need global modifier \g so it returns all matches

So, instead of your posted Regex, you can try:

(\\.[tex]*\\.[\\._\\-\\@a-zA-Z0-9]*)\\s*([x0-9a-f]*)[ \\s]+([x0-9a-f]*)/g

Tested on Regex101.com:

https://regex101.com/r/lM4bQ9/1

Other then that, a few suggestions:

if you know your text is going to start with text, just put it on the pattern, don't use [tex]*, which will require a few extra work from the engine.
[ \s] is the same thing of \s.
[\._\-\@a-zA-Z0-9]* from what i understood, is basically everything but space, so why not just use [^\s]*

So having these in mind I would suggest you to use this pattern instead:

(\\.text\\.[^\\s]*)\\s*([x0-9a-f]*)\\s+([x0-9a-f]*)/g

edited May 1, 2015 at 0:39

answered May 1, 2015 at 0:28

Rodrigo López

4,3051 gold badge22 silver badges27 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

kPalladyn Over a year ago

First, Thanks again for your timely response! If I use your first input pattern, nothing matches to my log, but I agree that your solution looks correct. Also I simplified the [tex]* of the problem because there are other starting words I am also looking for. I'm being explicit in my coding and not simplifying as a sanity check.

Rodrigo López Over a year ago

Post your java code, maybe the regex is not the problem, it should have matched because it matches in the test tool

kPalladyn Over a year ago

I have updated the question to include my Java Code, and also updated to have my latest patterns. FYI I did tweak your pattern a little to get rid of a corner case I was having. Just to reiterate Pattern p now works but Pattern pline2 does not.

kPalladyn Over a year ago

Found the error! I forgot to call m.find() again for the 2nd line Pattern! As soon as I put it in, it was scrapping the info. Thanks again for taking a look at my question!

Rodrigo López Over a year ago

I'm glad I could help you somehow buddy, Cheers

Collectives™ on Stack Overflow

Regex expression for multiple patterns in 1 line

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related