3

I have a parsed through a file and I need to split the data according to LogType .Below is my data:

===================================================================================
LogType:container-localizer-syslog
Log Upload Time :Thu Jun 25 12:24:45 +0100 2020
LogLength:0
Log Contents:

LogType:stderr
Log Upload Time :Thu Jun 25 12:24:52 +0100 2020
LogLength:3000
Log Contents:
20/06/25 12:19:33 INFO datasources.FileScanRDD
20/06/25 12:19:40 INFO executor.EXECUTOR: Finished task 18.0 in stage 0.0 (TID 18),18994 bytes result sent to driver.
20/06/21 12:19:40 INFO eas
20/06/25 12:20:41 WARN Warning as the node is accessed without started

===================================================================================
LogType:container-localizer-syslog
Log Upload Time :Thu Jun 25 12:24:45 +0100 2020
LogLength:0

I have applied a code which results in some error in splitting of data.Below is the code I applied:

def parse_container(text,full_text_lines,filter_log_types=None,filter_content_types=None):
    results={}

    first, rest  = text.split('\n', 1)
   #print(rest)      #rest is the block of data mentioned above
    results['id'] = first
    all_log_types = re.compile('^(?=LogType:)',flags=re.MULTILINE).split(rest)
    print(all_log_types)

The output I got:

['========================================================================\nLogType:container-
localizer-syslog\nLog Upload Time :Thu Jun 25 12:24:45 +0100 2020\nLogLength:0\nLog Contents:\n\n
LogType:stderr\nLog Upload Time :Thu Jun 25 12:24:52 +0100 2020\nLogLength:3000\nLog Contents:\n20/06/25 12:19:33 INFO datasources.FileScanRDD \n20/06/25 12:19:40 INFO executor.EXECUTOR: Finished task 18.0 in stage 0.0 (TID 18),18994 bytes result sent to driver.\n
20/06/21 12:19:40 INFO eas\n20/06/25 12:20:41 WARN Warning as the node is accessed without started\n \n']
['========================================================================\nLogType:container-
localizer-syslog\nLog Upload Time :Thu Jun 25 12:24:45 +0100 2020\nLogLength:0\nLog Contents:\n\n']

The output I need:

['========================================================================\n','LogType:contain
er-localizer-syslog\nLog Upload Time :Thu Jun 25 12:24:45 +0100 2020\nLogLength:0\nLog Contents:\n\n', 
 'LogType:stderr\nLog Upload Time :Thu Jun 25 12:24:52 +0100 2020\nLogLength:3000\nLog Contents:\n20/06/25 12:19:33 INFO datasources.FileScanRDD \n20/06/25 12:19:40 INFO executor.EXECUTOR: Finished task 18.0 in stage 0.0 (TID 18),18994 bytes result sent to driver.\n20/06/21 12:19:40 INFO eas\n20/06/25 12:20:41 WARN Warning as the node is accessed without started\n \n']

['========================================================================\n','LogType:contain
    er-localizer-syslog\nLog Upload Time :Thu Jun 25 12:24:45 +0100 2020\nLogLength:0\nLog Contents:\n\n']

In my output you can see I am getting \n at the beginning of LogType but I need to split according to the LogType by comma.

In the expected output you can see that the data has been split according LogType by ,

I am using Python 2.6.6 . Please help me to solve this issue . Thanks a lot!

3 Answers 3

1

We could easily split the logs using regular expressions in python. The following code splits the logs by an or of two conditions.

Condition1: Multiple occurrences of = followed by a \n

Condition2: 2 occurrences of \n

If any of the conditions is satisfied, we get the output. filter will remove any empty strings returned by the split and return an object. This object is then converted to a list.

import re

text = """===================================================================================
LogType:container-localizer-syslog
Log Upload Time :Thu Jun 25 12:24:45 +0100 2020
LogLength:0
Log Contents:

LogType:stderr
Log Upload Time :Thu Jun 25 12:24:52 +0100 2020
LogLength:3000
Log Contents:
20/06/25 12:19:33 INFO datasources.FileScanRDD
20/06/25 12:19:40 INFO executor.EXECUTOR: Finished task 18.0 in stage 0.0 (TID 18),18994 bytes result sent to driver.
20/06/21 12:19:40 INFO eas
20/06/25 12:20:41 WARN Warning as the node is accessed without started

===================================================================================
LogType:container-localizer-syslog
Log Upload Time :Thu Jun 25 12:24:45 +0100 2020
LogLength:0
"""


output = list(filter(None, re.compile('[=]+.\n|\n\n').split(text)))

print(output)

OUTPUT:

['LogType:container-localizer-syslog\nLog Upload Time :Thu Jun 25 12:24:45 +0100 2020\nLogLength:0\nLog Contents:', 'LogType:stderr\nLog Upload Time :Thu Jun 25 12:24:52 +0100 2020\nLogLength:3000\nLog Contents:\n20/06/25 12:19:33 INFO datasources.FileScanRDD\n20/06/25 12:19:40 INFO executor.EXECUTOR: Finished task 18.0 in stage 0.0 (TID 18),18994 bytes result sent to driver.\n20/06/21 12:19:40 INFO eas\n20/06/25 12:20:41 WARN Warning as the node is accessed without started', 'LogType:container-localizer-syslog\nLog Upload Time :Thu Jun 25 12:24:45 +0100 2020\nLogLength:0\n']
Sign up to request clarification or add additional context in comments.

2 Comments

It gives correct output when I have only one block but fails if have many blocks like I have mentioned above in the question.
I encounter a similar problem in another area. Can you help me to solve that ? This is the link stackoverflow.com/questions/62901200/…
1

If you have multiple logs in one file, try this:

import re

results={}
logs = re.split('^=', text, 0, re.MULTILINE)

for log in logs:
    if (len(log) > 0):
        first, rest = log.split('=\n')
        print('first', first)
        print('rest',rest)
        print("\n\n")

Output:

first =================================================================================
rest LogType:container-localizer-syslog
Log Upload Time :Thu Jun 25 12:24:45 +0100 2020
LogLength:0
Log Contents:

LogType:stderr
Log Upload Time :Thu Jun 25 12:24:52 +0100 2020
LogLength:3000
Log Contents:
20/06/25 12:19:33 INFO datasources.FileScanRDD
20/06/25 12:19:40 INFO executor.EXECUTOR: Finished task 18.0 in stage 0.0 (TID 18),18994 bytes result sent to driver.
20/06/21 12:19:40 INFO eas
20/06/25 12:20:41 WARN Warning as the node is accessed without started



first =================================================================================
rest LogType:container-localizer-syslog
Log Upload Time :Thu Jun 25 12:24:45 +0100 2020
LogLength:0

8 Comments

Thanks Donald! But I am still getting the same output.
Is there anything wrong with this : all_log_types = re.compile('^(?=LogType:)',flags=re.MULTILINE).split(rest)
@Lekshmi, the 2nd line looks correct too. I tried it and got the results you are looking for. I added something in my answer for you to try to help narrow down the problem
As you said I think my text source is a problem I think so .Is there a way to solve this ? Because I must perform this operation on the given file . I have added some portion to the question for better understanding. Or can you suggest some way to solve this error ?
I added another string to match in my example above, "=$", which should find the last "=" at the end of the first line and split on that, see if that helps.
|
0

you can use this as per your question .

text=text.replace('=','')
 all_log_types=text.split('\n\n') # splitting based on an Empty line
 print(all_log_types)

2 Comments

Thank you so much Syed. I have got my output partially. But I have that first line '===============' which is still joined to my first LogType . How to remove that ?
if you dont want '===' , then you can use this text=text.replace('=','') text=text.replace('=','') # removing '=' all_log_types=text.split('\n\n') print(all_log_types)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.