1

I wanna extract strings from http header like: GET http://www.example.com HTTP/1.1 using regex. I use this pattern: ^([A-Za-z]+)(\s+)(http?):\/\/(.*)(\s+)(HTTP\/)([0-9].[0-9]) and this works good and splits GET, http://www.example.com and HTTP/1.1. But when I use this pattern in C, it doesn't escape /(i.e, \/\/ doesn't detect in C). How can I do this? or is there a better pattern for extract strings from http header?

4
  • You probably don't need any of those backslashes at all. Please clarify your question, and include a short example of actual code that illustrates the problem. You should also state which regex library you're using. We aren't mind readers, you know. Commented Aug 1, 2016 at 20:22
  • What is your exact regex string? const char *str_regex = "([A-Za-z]*) *(http?://.*) *(HTTP/[0-9][.][0-9])" seems to work well. Commented Aug 1, 2016 at 20:49
  • If I use const char *str_regex = "^([A-Za-z]+)(\\s+)(http?):\/\/(.*)(\\s+)(HTTP\/)([0-9].[0-9])", I get "GET", " ", "http", "www.example.com", " ", "HTTP/", "1.1", "ET example.com HTTP/1.1" captures. The \/ can be replaced with /s, but the most important is to use \\s. Commented Aug 1, 2016 at 20:55
  • GET http://www.example.com HTTP/1.1 - this is actually a request line (not an header). HTTP header is a name followed by a colon ' : ', then by its value. wiki Commented Jun 21, 2017 at 14:15

1 Answer 1

2

Note you do not need to escape a forward slash in a C regex library since the regcomp does not support regex delimiters.

All you need is to properly initialize the regmatch_t, size_t variables, use double escapes with the \s shorthand character class, and pass the REG_EXTENDED flag to the regex compiler.

I also suggest reducing the pattern to just 3 capture groups:

const char *str_regex = "([A-Za-z]+) +(http?://.*) +(HTTP/[0-9][.][0-9])";

Note the dot is "escaped" by putting it into a bracket expression.

Full C demo extracting GET, http://www.example.com and HTTP/1.1:

#include <stdio.h>
#include <stdlib.h>
#include <regex.h>

int main (void)
{
  int match;
  int err;
  regex_t preg;
  regmatch_t pmatch[4]; // We have 3 capturing groups + the whole match group
  size_t nmatch = 4; // Same as above
  const char *str_request = "GET http://www.example.com HTTP/1.1";

  const char *str_regex = "([A-Za-z]+) +(http?://.*) +(HTTP/[0-9][.][0-9])";
  err = regcomp(&preg, str_regex, REG_EXTENDED);
  if (err == 0)
    {
      match = regexec(&preg, str_request, nmatch, pmatch, 0);
      nmatch = preg.re_nsub;
      regfree(&preg);
      if (match == 0)
        {
          printf("\"%.*s\"\n", pmatch[1].rm_eo - pmatch[1].rm_so, &str_request[pmatch[1].rm_so]);
          printf("\"%.*s\"\n", pmatch[2].rm_eo - pmatch[2].rm_so, &str_request[pmatch[2].rm_so]);
          printf("\"%.*s\"\n", pmatch[3].rm_eo - pmatch[3].rm_so, &str_request[pmatch[3].rm_so]);
        }
      else if (match == REG_NOMATCH)
        {
          printf("unmatch\n");
        }
    }
  return 0;
 }
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.