3

I have a large number of strings on the format YYYYYYYYXXXXXXXXZZZZZZZZ, where X, Y, and Z are numbers of fix length, eight digits. Now, the problem is that I need to parse out the middle sequence of integers and remove any leading zeroes. Unfortunately is the only way to determine where each of the three sequences begins/ends is to count the number of digits.

I am currently doing it in two steps, i.e:

m = re.match(
    r"(?P<first_sequence>\d{8})"
    r"(?P<second_sequence>\d{8})"
    r"(?P<third_sequence>\d{8})",
    string)
second_secquence = m.group(2)
second_secquence.lstrip(0)

Which does work, and gives me the right results, e.g.:

112233441234567855667788 --> 12345678
112233440012345655667788 --> 123456
112233001234567855667788 --> 12345678
112233000012345655667788 --> 123456

But is there a better method? Is is possible to write a single regex expression which matches against the second sequence, sans the leading zeros?

I guess I am looking for a regex which does the following:

  1. Skips over the first eight digits.
  2. Skips any leading zeros.
  3. Captures anything after that, up to the point where there's sixteen characters behind/eight infront.

The above solution does work, as mentioned, so the purpose of this problem is more to improve my knowledge of regex. I appreciate any pointers.

2
  • 2
    Do you need regexes here? string[8:16].lstrip('0'). Commented Dec 7, 2016 at 13:53
  • \d{8}0*(\d*)\d{8} regex101.com/r/1HjS5m/1 Commented Dec 7, 2016 at 13:57

4 Answers 4

4

This is a typical case of "useless use of regular expressions".

Your strings are fixed-length. Just cut them at the appropriate positions.

s = "112233440012345655667788"
int(s[8:16])
# -> 123456
Sign up to request clarification or add additional context in comments.

Comments

3

I think it's simpler not to use regex.

result = my_str[8:16].lstrip('0')

Comments

2

Agree with the other answers here that regex isn't really required. If you really want to use regex, then \d{8}0*(\d*)\d{8} should do it.

Comments

1

Just to show that it is possible with regex:

https://regex101.com/r/8RSxaH/2

# CODE AUTO GENERATED BY REGEX101.COM (SEE LINK ABOVE)
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility

import re

regex = r"(?<=\d{8})((?:0*)(\d{,8}))(?=\d{8})"

test_str = ("112233441234567855667788\n"
    "112233440012345655667788\n"
    "112233001234567855667788\n"
    "112233000012345655667788")

matches = re.finditer(regex, test_str)

for matchNum, match in enumerate(matches):
    matchNum = matchNum + 1

    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))

    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1

        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

Although you don't really need it to do what you're asking

1 Comment

Thank you. Thats exactly the kind of expression I was looking for. Excellent website, thanks for bring it to my attention as well. Cheers!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.