0

I need to extract data in a data file beginning with the letter "U" or "L" and exclude comment lines beginning with character "/" .

Example:

/data file FLG.dat
UAB-AB      LRD1503     / reminder latches

I used a regex pattern in the python program which results in only capturing the comment lines. I'm only getting comment lines but not the identity beginning with character.

2
  • if file_path != "": #pattern to search comment lines in the text file #pattern = "[^A-Za-z0-9-]/.+" data = read_file(file_path) find_str = re.findall(pattern , data) for x in find_str: print(x) else: print("no file selected") sys.exit() Commented Aug 31, 2019 at 19:29
  • 1
    Please add your code into the question and make sure it's well-formatted. Commented Aug 31, 2019 at 19:30

2 Answers 2

1

You can use ^([UL].+?)(?:/.*|)$. Code:

import re

s = """/data file FLG.dat
UAB-AB      LRD1503     / reminder latches
LAB-AB      LRD1503     / reminder latches
SAB-AB      LRD1503     / reminder latches"""
lines = re.findall(r"^([UL].+?)(?:/.*|)$", s, re.MULTILINE)

If you want to delete spaces at the end of string you can use list comprehension with same regular expression:

lines = [match.group(1).strip() for match in re.finditer(r"^([UL].+)/.*$", s, re.MULTILINE)]

OR you can edit regular expression to not include spaces before slash ^([UL].+?)(?:\s*/.*|)$:

lines = re.findall(r"^([UL].+?)(?:\s*/.*|)$", s, re.MULTILINE)
Sign up to request clarification or add additional context in comments.

2 Comments

^([UL].+)/.*$ doesn't look right. First, try it against ' 'Uabc/xyz/def'. Second, it only matches lines with comments.
@RonaldAaronson, I agree with second, it's a logical mistake and it's fixed. About second, I prefer to use logic of comments in code - first occurrence of comment marker starts comment block.
1

In case the comments in your data lines are optional here's a regular expression that covers both types, lines with or without a comment.

The regular expression for that is R"^([UL][^/]*)" (edited, original RE was R"^([UL][^/]*)(/.*)?$") The first group is the data you want to extract, the 2nd (optional group) would catch the comment if any.

This example code prints only the 2 valid data lines.

import re

lines=["/data file FLG.dat",
       "UAB-AB      LRD1503     / reminder latches",
       "UAB-AC      LRD1600",
       "MAB-AD      LRD1700     / does not start with U or L"
       ]

datare=re.compile(R"^([UL][^/]*)")

matches = ( match.group(1).strip() for match in ( datare.match(line) for line in lines) if match)

for match in matches:
    print(match)

Note how match.group(1).strip() extracts the first group of your RE and strip() removes any trailing spaces in your match

Also note that you can replace lines in this example with a file handle and it would work the same way

If the matches = line looks too complicated, it's an efficient way for writing this:

for line in lines:
    match = datare.match(line)
    if match:
        print(match.group(1).strip())

3 Comments

And what does the (/.*)?$ portion of your regex contribute to the final result unless you wanted to know what the comment was?
I wasn't clear: (/.*)?$ isn't necessary unless you want to know if there is a comment and what it is.
Yes, you are correct. The 2nd match group is not needed (unless you need to know whether there is a comment or what it is). I updated my answer and the example still works.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.