1

I have this file containing this text:

$ more audit.log
2018-01-31 15:34:08 GMT:10.34.160.60(63788):agent3@pem:[31884]00000:LOG:  statement: DROP TABLE tmp_zombies
2018-01-31 15:58:52 GMT:127.0.0.1(45050):agent1@pem:[13182]00000:LOG:  statement: CREATE TEMP TABLE tmp_zombies(jagpid int4)
2018-01-31 15:58:52 GMT:127.0.0.1(45050):agent1@pem:[13182]00000:LOG:  statement: DROP TABLE tmp_zombies
2018-01-31 16:24:00 GMT:10.34.160.55(57199):agent8@pem:[27888]00000:LOG:  statement: CREATE TEMP TABLE tmp_zombies(jagpid int4)
2018-01-31 16:24:00 GMT:10.34.160.55(57199):agent8@pem:[27888]00000:LOG:  statement: DROP TABLE tmp_zombies
2018-01-31 21:08:47 GMT:[local]:pgsql@p106:[26349]00000:LOG:  statement: create table global_pg_audit
        (
           rolename         text not null,
           stmt_timestamp   timestamp not null,
           source_ip        text,
           target_ip        text,
           dbname           text,
           pid              text,
           statement_type   text,
           statement        text
        );
2018-01-31 15:34:08 GMT:10.34.160.60(63788):agent3@pem:[31884]00000:LOG:  statement: DROP TABLE tmp_zombies

When I run this python code:

    import re
    fullpathname='./audit.log'
    regex_pattern=re.compile(r'^(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})(.*?)$',re.MULTILINE|re.DOTALL)
    with open(fullpathname,'r') as f:
        log_entries = regex_pattern.findall(f.read())
    counter=0
    for entry in log_entries:
        print '%d=>['%(counter),entry,']'
        counter=counter+1

The output is as follows:

0=>[ ('2018-01-31 15:34:08', ' GMT:10.34.160.60(63788):agent3@pem:[31884]00000:LOG:  statement: DROP TABLE tmp_zombies') ]
1=>[ ('2018-01-31 15:58:52', ' GMT:127.0.0.1(45050):agent1@pem:[13182]00000:LOG:  statement: CREATE TEMP TABLE tmp_zombies(jagpid int4)') ]
2=>[ ('2018-01-31 15:58:52', ' GMT:127.0.0.1(45050):agent1@pem:[13182]00000:LOG:  statement: DROP TABLE tmp_zombies') ]
3=>[ ('2018-01-31 16:24:00', ' GMT:10.34.160.55(57199):agent8@pem:[27888]00000:LOG:  statement: CREATE TEMP TABLE tmp_zombies(jagpid int4)') ]
4=>[ ('2018-01-31 16:24:00', ' GMT:10.34.160.55(57199):agent8@pem:[27888]00000:LOG:  statement: DROP TABLE tmp_zombies') ]
5=>[ ('2018-01-31 21:08:47', ' GMT:[local]:pgsql@p106:[26349]00000:LOG:  statement: create table global_pg_audit ') ]
6=>[ ('2018-01-31 15:34:08', ' GMT:10.34.160.60(63788):agent3@pem:[31884]00000:LOG:  statement: DROP TABLE tmp_zombies') ]
7=>[ ('2018-01-31 15:58:52', ' GMT:127.0.0.1(45050):agent1@pem:[13182]00000:LOG:  statement: CREATE TEMP TABLE tmp_zombies(jagpid int4)') ]

Notice that line 5 in the output, the code did not include the entire statement which should be:

    create table global_pg_audit
        (
           rolename         text not null,
           stmt_timestamp   timestamp not null,
           source_ip        text,
           target_ip        text,
           dbname           text,
           pid              text,
           statement_type   text,
           statement        text
        );

What is wrong with the code?

Thanks very much!

1 Answer 1

1

Your regex is anchored to the end of the line:

^(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})(.*?)$

Since you've enabled multi-line mode, $ matches at a line break. That's why the match ends after global_pg_audit.


You want to match until the next line that starts with a date. You can use a lookahead to do this:

^(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})(.*?)(?=\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}|\Z)

The alternation |\Z allows the regex to match the last line even though it's not followed by a date.

See also the regex demo.

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you. Works very well.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.