2

I need to get the groups that matches with my Regex in C to manipulate a Java program logs.

I have tested the Regex:

(Client:\s[a-zA-Z\s]+)|(Wallet:\s[a-zA-Z0-9]+)|(ID\s*:\s*[0-9]{3}.{0,1}[0-9]{3}.{0,1}[0-9]{3}-{0,1}[0-9]{2})

here and it works.

But in my C program, it doesn't work as well.

#include <regex.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(void) {
  const char *source =
      "[com.example.app.JavaClass.JavaMethod(JavaClass.java:1)] (Thread-1) - "
      "Client: FirstName MiddleName AnotherName LastName, Wallet: WL01, "
      "Agency: 9999, ID: 06611486123, Ticket: TKR211";
  const char *regexString =
      "(Client:\\s[a-zA-Z[:space:]]+)|(Wallet:\\s[a-zA-Z0-9]+)|(ID\\s*:\\s*[0-"
      "9]{3}.{0,1}[0-9]{3}.{0,1}[0-9]{3}-{0,1}[0-9]{2})";

  regex_t regexCompiled;

  regcomp(&regexCompiled, regexString, REG_ICASE | REG_EXTENDED);

  size_t ngroups = regexCompiled.re_nsub + 1;
  regmatch_t *groups = malloc(ngroups * sizeof(regmatch_t));

  regexec(&regexCompiled, source, ngroups, groups, 0);

  char cursorCopy[strlen(source) + 1];
  strcpy(cursorCopy, source);
  size_t nmatched;
  for (nmatched = 0; nmatched < ngroups; nmatched++) {
    if (groups[nmatched].rm_so == (size_t)(-1)) {
      break;
    }

    char *match =
        calloc(groups[nmatched].rm_eo - groups[nmatched].rm_so, sizeof(char));
    memcpy(match, &source[groups[nmatched].rm_so],
           groups[nmatched].rm_eo - groups[nmatched].rm_so);
    printf("Match: [%2u-%2u]: \"%s\"\n", groups[nmatched].rm_so,
           groups[nmatched].rm_eo, match);
  }
  regfree(&regexCompiled);

  return 0;
}

Executing:

$ gcc -Wall -Wextra -Wwrite-strings reg.c && ./a.out

Generates the output:

Match: [70-119]: "Client: FirstName MiddleName AnotherName LastName"
Match: [70-119]: "Client: FirstName MiddleName AnotherName LastName"

But what I want is:

Match: [xx-xx]: "Client: FirstName MiddleName AnotherName LastName"
Match: [xx-xx]: "Wallet: WL01"
Match: [xx-xx]: "ID: 06611486123"

Can someone tell me if it is possible to do using C or I need another approach?

edit:

In my case is possible that some fields ("Client", "Wallet" or "ID") won't coming in log.

4
  • The first difference I notice is the inclusion of [:space:] in the C source. Read up on group 0. Commented Jul 25, 2021 at 5:38
  • (What is the intention with .{0,1}?) Commented Jul 25, 2021 at 5:43
  • ([possible?] C is considered a universal programming language.) Commented Jul 25, 2021 at 5:50
  • What did you try in way of "debugging"/closely observing program execution? Commented Jul 25, 2021 at 16:26

1 Answer 1

3

Your regex is composed like this: (a)|(b)|(c) where a, b, and c correspond to the Client regex, the Wallet regex, and the ID regex.

This is not what you want - you can see in your own RegExr that you are not getting one match, but three different matches. In C, you are only matching once.

What you're really trying to accomplish is matching your source string only once, and having each of the groups contain their string. In other words, we want to change your regex:

(a)|(b)|(c) -> (a),(b),(c) - one single match matching the entirety of the string.

This does the trick:

const char *regexString =
    "(Client:\\s[a-zA-Z[:space:]]+), (Wallet:\\s[a-zA-Z0-9]+).*(ID\\s*:\\s*[0-"
    "9]{3}.{0,1}[0-9]{3}.{0,1}[0-9]{3}-{0,1}[0-9]{2})";

I changed the first | to a , which is what separates the Client and Wallet substrings, and I changed the second | to a .* which encapsulates everything between the Wallet and ID substrings.

Running this now gives:

Match: [70-164]: "Client: FirstName MiddleName AnotherName LastName, Wallet: WL01, Agency: 9999, ID: 06611486123"
Match: [70-119]: "Client: FirstName MiddleName AnotherName LastName"
Match: [121-133]: "Wallet: WL01"
Match: [149-164]: "ID: 06611486123"

The first line gives you the entire match, while the following lines give you the contents of each separate group.


A more visual way of looking at this is going from:

enter image description here

to:

enter image description here

Sign up to request clarification or add additional context in comments.

2 Comments

My problem is that is possible not coming all the fields ("Client", "Wallet" or "ID"). If the log come with the value: "Client: FirstName MiddleName AnotherName LastName, Agency: 9999, ID: 06611486123", for example, it returns segmentation fault.
@MatheusRodriguesGuimaraes In that case, you need to restructure your program - you need to call regexec in a loop, and on each iteration keep track of rm_eo - on the next iteration, begin the search at source + rm_eo (and make sure to keep this maintained properly throughout the loop). You won't be able to get what you want with a single call to regexec.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.