1

I'm trying to extract address details from very ugly free text:

import regex

pat_addr_verbose = """(?ix)       # case insensitive and verbose flag
(?:(?:BND|BY|CNR|OF)\W+)*         # non-capturing (list)
(?:(?!RD|HWY|TRAIL|St)           # negative lookahead (list of street types)
(?:                              # either
(?P<n_start>\d+)-(?P<n_end>\d+)  # number sequence
|(?<!-)(?P<n>\d+)                      # single number
)\W+)?                               # No number, maybe non word character follows
(?P<name>
(?:
(?!RD|HWY|TRAIL|St)\w+\W*)+)\W+   # capturing words not preceded by (list of street types)
(?P<type>RD|HWY|TRAIL|St)*             # non-capturing (list of street types)
"""

pat_addr = regex.compile(pat_addr_verbose, regex.IGNORECASE & regex.VERBOSE)

text = """BND BY THOMAS RAIL TRAIL, 7 SNOW WHITE HWY & MICKEY RD,
337-343 BOGEYMAN RD, 4, 8, 9-13, 16-18 Fictional Rd & 17 Elm St"""

regex.findall(pat_addr, text)

I'm getting the right results for simple addresses, but I'm failing to get the many different street numbers in Fictional Road

[m.groupdict() for m in pat_addr.finditer(text)]

[{'n': None,
'n_end': None,
'n_start': None,
'name': 'THOMAS RAIL',
'type': 'TRAIL'},
{'n': '7',
'n_end': None,
'n_start': None,
'name': 'SNOW WHITE',
'type': 'HWY'},
{'n': None, 'n_end': None, 'n_start': None, 'name': 'MICKEY', 'type': 'RD'},
{'n': None,
'n_end': '343',
'n_start': '337',
'name': 'BOGEYMAN',
'type': 'RD'},
{'n': '4',
'n_end': None,
'n_start': None,
'name': '8, 9-13, 16-18 Fictional',
'type': 'Rd'},
{'n': '17', 'n_end': None, 'n_start': None, 'name': 'Elm', 'type': 'St'}]

I wonder if it is possible to either get a list of numbers (doesn't matter if they're not named) or a dict for them in regex?

EDIT: This is what I expect to get:

Option 1:

{'numbers': 
    [
        {
            'n': '4',
            'n_end': None,
            'n_start': None,
        },
        {
            'n': '8',
            'n_end': None,
            'n_start': None,
        },
        {
            'n': None,
            'n_end': '13',
            'n_start': '9',
        },
        {
            'n': None,
            'n_end': '18',
            'n_start': '16',
        }
    ],
'name': 'Fictional',
'type': 'Rd'},

Option 2:

    {'numbers': 
    [
        '4',
        '8',
        '9-13',
        '16-18'
    ],
'name': '8, 9-13, 16-18 Fictional',
'type': 'Rd'},
7
  • Can you post results that you'd expect to get? Commented Oct 6, 2017 at 1:31
  • @Colin, here you go. Commented Oct 6, 2017 at 1:51
  • 1
    you are essentially asking for capturing an arbitrary number of groups, which is something regex is not capable of doing. Commented Oct 6, 2017 at 1:53
  • @RNar, maybe not in all flavours, but the answer you refer to says it is possible in .NET and not in JavaScript. It doesn't mention Python. Commented Oct 6, 2017 at 1:56
  • Python is among the ones that take only the last capture Commented Oct 6, 2017 at 14:39

1 Answer 1

1
(?ix)                             # case insensitive and verbose flag
(?:(?:BND|BY|CNR|OF)\W+)*         # non-capturing (list)

(?:                               #Number non capture Start
(?!RD|HWY|TRAIL|St)               # negative lookahead (list of street types)
                                  # EITHER
(?P<numbers>\d+-\d+|\d+)          #double number OR single number
\W+                               # No number, maybe non word character follows
)                                 #Number non capture End
*?                                #This Number group repeats to produce numbers

(?P<name>
(?:
(?!RD|HWY|TRAIL|St)[A-Z]+\W*)+)\W+   # capturing words not preceded by (list of street types)
(?P<type>RD|HWY|TRAIL|St)*

UPDATED WITH NEW REGEX MODULE

The new regex module does allow repeated groups to be captured.

import regex

text='BND BY THOMAS RAIL TRAIL, 7 SNOW WHITE HWY & MICKEY RD, 337-343 BOGEYMAN RD, 4, 8, 9-13, 16-18 Fictional Rd & 17 Elm St'
reg=r'(?ix)(?:(?:BND|BY|CNR|OF)\W+)*(?:(?!RD|HWY|TRAIL|St)(?P<numbers>\d+-\d+|\d+)\W+)*?(?P<name>(?:(?!RD|HWY|TRAIL|St)[A-Z]+\W*)+)\W+(?P<type>RD|HWY|TRAIL|St)*'


def updateD(m):
  d=m.groupdict()
  d['numbers']=m.captures('numbers')
  return d

[updateD(m) for m in regex.finditer(reg,text)]

OUTPUT

[
  {
   'numbers': [],
   'name': 'THOMAS RAIL',
   'type': 'TRAIL'
  }, 
  {
   'numbers': ['7'],
   'name': 'SNOW WHITE',
   'type': 'HWY'
  }, 
  {
   'numbers': [],
   'name': 'MICKEY',
   'type': 'RD'
  }, 
  {
   'numbers': ['337-343'],
   'name': 'BOGEYMAN',
   'type': 'RD'
  }, 
  {
   'numbers': ['4', '8', '9-13', '16-18'],
   'name': 'Fictional',
   'type': 'Rd'
  }, 
  {
   'numbers': ['17'],
   'name': 'Elm',
   'type': 'St'
  }
]
Sign up to request clarification or add additional context in comments.

7 Comments

I made an edit specifying the expected result, please have a look. Thank you, however. I hadn't thought of getting the whole sequence. That would allow a second pass (but I would rather avoid it if possible).
@dmvianna Just a bit confused as you introduced a new field numbers. Does that mean you want that to be a main stay in all entries? I've a slightly different version see if it works for you. You can extend it on that note. Mind you I'm using the new regex module for the first time.
Thanks for your answer. Yes, that’s the desired result. It would be great to get it using a single regex, but I’ll use more steps if necessary. Your answer provides a good first pass.
@dmvianna just updated the answer, not happy the way it looks though :-(
@dmvianna see the latest as per your OP.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.