1

I am scraping information from a log that I need 3 elements. Another added difficulty is that I am parsing the log via readLine() in my java program aka one(1) line at a time. (If there is a possibility to read multiple lines when parsing let me know :) ) NOTE: I have no control over the log output format.

There are 2 possibilities of what I must extract. Either the log is nice and gives the following

NICE FORMAT

.text.rank     0x0000000000400b8f      0x351 is_x86.o

where I must grab .text.rank , 0x0000000000400b8f , and 0x351

Now the not so nice case: If the name is too long, it bumps everything else to the next line like is below, now the only thing after the first element is one blank space followed by a newline (\n) which gets clobbered by readLine() anyway.

EVIL FORMAT : Note each line is in a separate arraylist entry.

.text.__sfmoreglue 
            0x0000000000401d00       0x55 /mnt/drv2homelibc_popcorn.a(lib_a-findfp.o)

Therefore what the regex actually sees is:

.text.__sfmoreglue

CORNER CASE FORMAT that also occurs within the log but I DO NOT want

 *(.text.unlikely)

Finally below is my Pattern line I am currently using for the first line and pline2 is what is used on the next line when group 2 of the first line is empty.

UPDATE: The pattern below works for the NICE FORMAT and EVIL FORMAT But now pattern pline2 has no matches, even though on regex101.com it is correct. Link: https://regex101.com/r/vS7vZ3/9

UPDATE2: I fixed it, I forgot to add m2.find() once I compiled the second line with Pattern pline2. Corrected code is below.

Pattern p = Pattern.compile("^[ \\s](\\.[tex]*\\.[\\._\\-\\@a-zA-Z0-9]*)\\s*([x0-9a-f]*)[ \\s]*([x0-9a-f]*).*");

Pattern pline2 = Pattern.compile("^\\s*([x0-9a-f]*)[ \\s]*([x0-9a-f]*)\\s*[\\w\\(\\)\\.\\-]*");

To give a little background I am first matching the name .text.whatever to m.group(1) followed by the address 0x000012345 to m.group(2) and finally the size 0xa48 to m.group(3). This is all assuming the log is in the NICE format. If it is in the EVIL format I see that group(2) is empty and therefore readin the next line of the log to a temp buffer and apply the second pattern pline2 to new line.

Can someone help me with the regex? Is there a way I can make sure my current line (or even better, just the second grouping) is either the NICE FORMAT or is empty?

As requested my java code:

//1st line pattern
Pattern p = Pattern.compile("^[ \\s](\\.[tex]*\\.[\\._\\-\\@a-zA-Z0-9]*)\\s*([x0-9a-f]*)[ \\s]*([x0-9a-f]*).*");
//conditional 2nd line pattern
Pattern pline2 = Pattern.compile("^\\s*([x0-9a-f]*)[ \\s]*([x0-9a-f]*)\\s*[\\w\\(\\)\\.\\-]*");
while((temp = br1.readLine()) != null){
        Matcher m = p.matcher(temp);
        while(m.find()){
            System.out.println("What regex finds: m1:"+m.group(1)+"#    m2:"+m.group(2)+"#    m3:"+m.group(3));
            if(!m.group(1).isEmpty() && m.group(2).isEmpty() && m.group(3).isEmpty()){
                //means we probably hit a long symbol name and important stuff is on the next line
                //save the name at least
                name = m.group(1);
                //read and utilize the next line
                if((temp = br1.readLine()) == null){
                    return;
                }
                System.out.println("EVILline2:"+temp); //sanity check the input 
                System.out.println(pline2.toString()); //sanity check the regex
                Matcher m2= pline2.matcher(temp);
                while(m2.find()){
                       System.out.println("regex line2 finds: m1:"+m2.group(1));//+"#    m2:"+m2.group(2));
                       if(m2.group(2).isEmpty()){
                             size = 0;
                       }else{
                             size = Long.parseLong(m2.group(2).replaceFirst("0x", ""),16);
                       }

                       addr = Long.parseLong(m2.group(1).replaceFirst("0x", ""),16);
                       System.out.println("#########LONG NAME: "+name+"    addr:"+addr+"    size:"+size);
                  }
            }//end if
            else{ // assume in NICE FORMAT
                //do nice format stuff.
        }//end while
}//end outerwhile

An Aside, The output I currently get:

line: .text.c_print_results
What regex finds: m1:.text.c_print_results#    m2:#    m3:
EVIL FORMATline2:                0x00000000004001e6      0x231 c_print_results_x86.o
^\s*([x0-9a-f]*)[ \s]*([x0-9a-f]*)\s*[\w\(\)\.\-]*
Exception in thread "main" java.lang.IllegalStateException: No match found
at java.util.regex.Matcher.group(Matcher.java:536)
at java.util.regex.Matcher.group(Matcher.java:496)
at regexTest.regex.grabSymbolsInRange(regex.java:143)
at regexTest.regex.main(regex.java:489)
2
  • Separate your concerns: create one regex for the nice form and a different regex for the evil form. When you've got them both working, add in a conditional to choose between them, based on something like empty group 2 or whitespace at start of line. Commented Apr 30, 2015 at 23:41
  • @PaulHicks I am doing what you say to some degree and the question is how do accommodate 2 patterns (NICE FORMAT or just empty after the name) in 1 pattern. Commented May 1, 2015 at 15:47

1 Answer 1

1

You have a few issues with your pattern.

  • 1st is the separation of first and second groups (that's why group 2 is returning null).
  • You have 4 groups and you need 3
  • After capturing your 3 values you can stop matching, so pattern after last group isn't necessary
  • you need global modifier \g so it returns all matches

So, instead of your posted Regex, you can try:

(\\.[tex]*\\.[\\._\\-\\@a-zA-Z0-9]*)\\s*([x0-9a-f]*)[ \\s]+([x0-9a-f]*)/g

Tested on Regex101.com:

https://regex101.com/r/lM4bQ9/1

Other then that, a few suggestions:

  • if you know your text is going to start with text, just put it on the pattern, don't use [tex]*, which will require a few extra work from the engine.
  • [ \s] is the same thing of \s.
  • [\._\-\@a-zA-Z0-9]* from what i understood, is basically everything but space, so why not just use [^\s]*

So having these in mind I would suggest you to use this pattern instead:

(\\.text\\.[^\\s]*)\\s*([x0-9a-f]*)\\s+([x0-9a-f]*)/g
Sign up to request clarification or add additional context in comments.

5 Comments

First, Thanks again for your timely response! If I use your first input pattern, nothing matches to my log, but I agree that your solution looks correct. Also I simplified the [tex]* of the problem because there are other starting words I am also looking for. I'm being explicit in my coding and not simplifying as a sanity check.
Post your java code, maybe the regex is not the problem, it should have matched because it matches in the test tool
I have updated the question to include my Java Code, and also updated to have my latest patterns. FYI I did tweak your pattern a little to get rid of a corner case I was having. Just to reiterate Pattern p now works but Pattern pline2 does not.
Found the error! I forgot to call m.find() again for the 2nd line Pattern! As soon as I put it in, it was scrapping the info. Thanks again for taking a look at my question!
I'm glad I could help you somehow buddy, Cheers

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.