I wanna extract strings from http header like: GET http://www.example.com HTTP/1.1 using regex. I use this pattern: ^([A-Za-z]+)(\s+)(http?):\/\/(.*)(\s+)(HTTP\/)([0-9].[0-9]) and this works good and splits GET, http://www.example.com and HTTP/1.1. But when I use this pattern in C, it doesn't escape /(i.e, \/\/ doesn't detect in C). How can I do this? or is there a better pattern for extract strings from http header?
1 Answer
Note you do not need to escape a forward slash in a C regex library since the regcomp does not support regex delimiters.
All you need is to properly initialize the regmatch_t, size_t variables, use double escapes with the \s shorthand character class, and pass the REG_EXTENDED flag to the regex compiler.
I also suggest reducing the pattern to just 3 capture groups:
const char *str_regex = "([A-Za-z]+) +(http?://.*) +(HTTP/[0-9][.][0-9])";
Note the dot is "escaped" by putting it into a bracket expression.
Full C demo extracting GET, http://www.example.com and HTTP/1.1:
#include <stdio.h>
#include <stdlib.h>
#include <regex.h>
int main (void)
{
int match;
int err;
regex_t preg;
regmatch_t pmatch[4]; // We have 3 capturing groups + the whole match group
size_t nmatch = 4; // Same as above
const char *str_request = "GET http://www.example.com HTTP/1.1";
const char *str_regex = "([A-Za-z]+) +(http?://.*) +(HTTP/[0-9][.][0-9])";
err = regcomp(&preg, str_regex, REG_EXTENDED);
if (err == 0)
{
match = regexec(&preg, str_request, nmatch, pmatch, 0);
nmatch = preg.re_nsub;
regfree(&preg);
if (match == 0)
{
printf("\"%.*s\"\n", pmatch[1].rm_eo - pmatch[1].rm_so, &str_request[pmatch[1].rm_so]);
printf("\"%.*s\"\n", pmatch[2].rm_eo - pmatch[2].rm_so, &str_request[pmatch[2].rm_so]);
printf("\"%.*s\"\n", pmatch[3].rm_eo - pmatch[3].rm_so, &str_request[pmatch[3].rm_so]);
}
else if (match == REG_NOMATCH)
{
printf("unmatch\n");
}
}
return 0;
}
const char *str_regex = "([A-Za-z]*) *(http?://.*) *(HTTP/[0-9][.][0-9])"seems to work well.const char *str_regex = "^([A-Za-z]+)(\\s+)(http?):\/\/(.*)(\\s+)(HTTP\/)([0-9].[0-9])", I get "GET", " ", "http", "www.example.com", " ", "HTTP/", "1.1", "ET example.com HTTP/1.1" captures. The\/can be replaced with/s, but the most important is to use\\s.GET http://www.example.com HTTP/1.1- this is actually a request line (not an header). HTTP header is a name followed by a colon ' : ', then by its value. wiki