1

I've a rails server log file, whose format is as follows.

Started <REQUEST_TYPE_1> <URL_1> for <IP_1> at <TIMESTAMP_1>
  Processing by <controller#action_1> as <REQUEST_FORMAT_1>
  Parameters: <parameters_1>
<Some logs from code>
Rendered <some_template_1> (<timetaken_1>)
Completed <RESPONSE_CODE_1> in <TIME_1>


Started <REQUEST_REQUEST_TYPE_2> <URL_2> for <IP_2> at <TIMESTAMP_2>
  Processing by <controller#action_2> as <REQUEST_FORMAT_2>
  Parameters: <parameters_2>
<Some logs from code>
Completed <RESPONSE_CODE_2> in <TIME_2>

Now, I need to parse this log and extract all the REQUEST_TYPE, URL, IP, TIMESTAMP, REQUEST_FORMAT, RESPONSE_CODE from above log. I'm struggling to create a good regex for it in java/ruby. <> is not present in actual input. I've added for readability and masking of actual data.

Example request:

Started GET "/google.com/2" for 127.0.0.1 at Tue Dec 01 12:01:13 +0530 2015
  Processing by MyController#method as JS
  Parameters: {"abc" => "xyz"}
[LOG] 3 : User text log
Completed 200 OK in 26ms (Views: 3.3ms | ActiveRecord: 2.9ms)


Started POST "/google.com/543" for 127.0.1.1 at Tue Dec 01 13:13:16 +0530 2015
  Processing by MyController#method_2 as JSON
  Parameters: {"efg" => "uvw"}
Completed 404 Not Authorized in 65ms (Views: 1.5ms | ActiveRecord: 1.0ms)

Expected Output:

request_types = ['GET', 'POST']
urls = ['/google.com/2','/google.com/543']
ips = ['127.0.0.1','127.0.1.1']
timestamps = ['Tue Dec 01 12:01:13 +0530 2015','Tue Dec 01 13:13:16 +0530 2015']
request_formats = ['JS','JSON']
response_codes = ['200 OK','404 Not Authorized']

I was able to write following regex, but it doesn't work as expected.

request_types = /Started \w+/  //Expected array of all request types
urls = /"\/.*\/"/ //Expected array of all urls types
ips = /"d{1,3}.d{1,3}.d{1,3}.d{1,3}"/ //Expected array of all ips types
timestamps =  /at \w+/
request_formats =/as \w+/
response_codes = /Completed \w+/

I hope to get some help in creating regex for extracting this parameters from given input in JAVA/RUBY. I would prefer java, if possible.

15
  • Does your original log file have these brackets as well (<>) ? Commented Mar 7, 2016 at 14:54
  • Nope. It was just to mask actual data Commented Mar 7, 2016 at 14:56
  • Something like regex101.com/r/uI6oV1/3 ? Commented Mar 7, 2016 at 14:58
  • Yup, But the lines which says parameters, some log, needn't be present always. Also, how will I use that in JAVA/ruby? Commented Mar 7, 2016 at 15:03
  • @Jan Added sample input with masked data and expected output. Commented Mar 7, 2016 at 15:14

1 Answer 1

2

Here is a Java snippet showing how to get the details from the log into separate array lists in Java:

String re = "(?sm)^Started\\s+(?<requesttype>\\S+)\\s+\"(?<url>\\S+)\"\\s+for\\s+(?<ip>\\d+(?:\\.\\d+)+)\\s+at\\s+(?<tsp>[a-zA-Z]+\\s+[a-zA-Z]+\\s+\\d+\\s+\\d+:\\d+:\\d+\\s+\\+\\d+\\s\\d{4})\\s+(?:Processing\\s+by\\s+\\S+)\\s+as\\s+(?<requestformat>\\S+)(?:\\s+Parameters:\\s+\\S+)?(?:(?:(?:(?!\nStarted ).)*Completed\\s)(?<responsecode>\\d+(?:(?!\\sin\\s).)*))?";
String str = "Started GET \"/google.com/2\" for 127.0.0.1 at Tue Dec 01 12:01:13 +0530 2015\n  Processing by MyController#method as JS\n  Parameters: {\"abc\" => \"xyz\"}\n[LOG] 3 : User text log\nCompleted 200 OK in 26ms (Views: 3.3ms | ActiveRecord: 2.9ms)\n\n\nStarted POST \"/google.com/543\" for 127.0.1.1 at Tue Dec 01 13:13:16 +0530 2015\n  Processing by MyController#method_2 as JSON\n  Parameters: {\"efg\" => \"uvw\"}\nCompleted 404 Not Authorized in 65ms (Views: 1.5ms | ActiveRecord: 1.0ms)";
Pattern pattern = Pattern.compile(re);
Matcher matcher = pattern.matcher(str);
List<String> requesttypes = new ArrayList<String>();
List<String> urls = new ArrayList<String>();
List<String> ips = new ArrayList<String>();
List<String> timestamps = new ArrayList<String>(); 
List<String> requestformats = new ArrayList<String>(); 
List<String> responsecodes = new ArrayList<String>();
while (matcher.find()){
    requesttypes.add(matcher.group("requesttype"));
    urls.add(matcher.group("url"));
    ips.add(matcher.group("ip"));
    timestamps.add(matcher.group("tsp"));
    requestformats.add(matcher.group("requestformat"));
    responsecodes.add(matcher.group("responsecode"));
    System.out.println("-----------------------");
    System.out.println(matcher.group("requesttype"));
    System.out.println(matcher.group("url")); 
    System.out.println(matcher.group("ip")); 
    System.out.println(matcher.group("tsp")); 
    System.out.println(matcher.group("requestformat")); 
    System.out.println(matcher.group("responsecode")); 
} 

See the IDEONE demo. You can even print the arrays after you get the matching done with, e.g. System.out.println(urls):

System.out.println(requesttypes);
System.out.println(urls);
System.out.println(ips);
System.out.println(urls);
System.out.println(timestamps);
System.out.println(requestformats);
System.out.println(responsecodes);

See this demo. The output is:

[GET, POST]
[/google.com/2, /google.com/543]
[127.0.0.1, 127.0.1.1]
[/google.com/2, /google.com/543]
[Tue Dec 01 12:01:13 +0530 2015, Tue Dec 01 13:13:16 +0530 2015]
[JS, JSON]
[200 OK, 404 Not Authorized]

The regex matches:

  • (?sm)^ - start of a line (due to ^ and ?m option)
  • Started\\s+ - literal Started string and 1+ whitespaces
  • (?<requesttype>\\S+) - Group "request type" holding 1+ non-whitespace chars
  • \\s+\" - 1+ whitespace followed with "
  • (?<url>\\S+) - Group "url" holding 1+ non-whitespace
  • \"\\s+for\\s+ - " followed with 1+ whitespace + for + 1+ whitespace
  • (?<ip>\\d+(?:\\.\\d+)+) - IP group containing digits + . + digits (.+digits 1+ times)
  • \\s+at\\s+ - the word at surrounded with whitespace
  • (?<tsp>[a-zA-Z]+\\s+[a-zA-Z]+\\s+\\d+\\s+\\d+:\\d+:\\d+\\s+\\+\\d+\\s\\d{4}) - timestamp group holding letter and digits in different order separated with whitespace acc. to the input examples
    • \\s+ - 1+ whitespace
  • (?:Processing\\s+by\\s+\\S+)\\s+as\\s+ - Processing by followed with some word (1+ non-whitespaces) followed with the word as surrounded with whitespace
  • (?<requestformat>\\S+) - Group "request format" that consists of non-whitespace symbols
  • (?:\\s+Parameters:\\s+\\S+)? - optional group Paramters: followed with whitepspace(s) and some word
  • (?:(?:(?:(?!\nStarted ).)*Completed\\s)(?<responsecode>\\d+(?:(?!\\sin\\s).)*))? - an optional group (since enclosed in (?:...)?) that matches any characters up to Completed, but that has no Started (due to the tempered greedy token (?:(?!\nStarted ).)*), and then matches Completed followed with a whitespace, and then (?<responsecode>\\d+(?:(?!\\sin\\s).)*) matches and captures into Group "response code" digits followed with any characters up to the whole word in surrounded with spaces.
Sign up to request clarification or add additional context in comments.

4 Comments

Nice. Can you share a bit more on how did you construct this regex? I mean, How to understand this regex?
Let's say I don't want time stamp. So, I removed (?<tsp>[a-zA-Z]+\\s+[a-zA-Z]+\\s+\\d+\\s+\\d+:\\d+:\\d+\\s+\\+\\d+\\s\\d{4}) from the regex and it says no match :|. regex101.com/r/iN7yO3/3 . How is that possible? What am I doing wrong?
You cannot just remove it. Make it optional by enclosing with (?: and )?.
Awesome!! I'll have to learn more about regex, so that I can do this next time by myself! Thanks :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.