Dissecting a string with python regex, using named groups and substitution

Question

I have a special use case which I do not yet know how to cover. I want to dissect a string based on field_name/field_length. For that I define a regex like this:

'(?P<%s>.{%d})' % (field_name, field_length)

And this is repeated for all fields.

I have also a regex to remove spaces to the right of each field:

self.re_remove_spaces = re.compile(' *$')

This way I can get each field like this:

def dissect(self, str):
    data = { }
    m = self.compiled.search(str)
    for field_name in self.fields:
        value = m.group_name(field_name)
        value = re.sub(self.re_remove_spaces, '', value)
        data[field_name] = value
    return data

I have to perform this processing for millions of strings, so it must be efficient.

What annoys me is that I would prefer to perform the dissection + space removal in a single step, using compiled.sub instead of compiled.search, but I do not know how to do this.

Specifically, my question is:

How do I perform regex substitution combining it with named groups in Python regexes?

The regex you showed ('(?<%s>.%d)' % ..., which would become something like '(?<name>.12)') is not a valid Python regex. — interjay
– interjay, Commented Mar 12, 2012 at 10:10
Sorry, corrected: P and {} were missing. I am still testing this, so it could be that more bugs are present. — blueFast
– blueFast, Commented Mar 12, 2012 at 10:14

aquavitae · Accepted Answer · 2012-03-12 11:45:22Z

4

I take it each field sits next to each other in the string, like in a table, e.g.:

name     description        license
python   language           opensource
windows  operating system   proprietry

So assuming you know in advance the length of each field, you can do it much more simply, without using a regex at all. (btw, str is not a good name for a variable since it clashes with the builtin str type)

def dissect(text):
    data = {}
    for name, length in fields:
        data[name] = text[:length].rstrip()
        text = text[length:]
    return data

Then, if fields = [('lang', 9), ('desc', 19), ('license', 12)]:

>>> self.dissect('python   language           opensource')
{'lang': 'python', 'license': 'opensource', 'desc': 'language'}

Is this what you're trying do though?

answered Mar 12, 2012 at 11:45

aquavitae

19.4k12 gold badges68 silver badges110 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

blueFast Over a year ago

Yes, this is exactly what I am trying to to, and actually my first implementation was done using string slices. Somehow I convinced myself that regex would perform better in this scenario, and thus changed the implementation to regex with named groups. My process function is now running for several hours, and to tell you the truth I have the feeling that slices were faster. Probably I change it back, or maybe I keep both implementations to do performance tests.

aquavitae Over a year ago

I think you'll find that a slice is significantly faster, but if its been running that long then you should certainly profile it and see where the bottleneck is, and how to speed it up further.

blueFast Over a year ago

It is taking long time because there are lots of records to process.

blueFast Over a year ago

I know too little about python internals to be able to recognize performance issues, but let me ask you something: isn't the step text=text[length:] a performance pit? I mean, you are copying the whole string, except the part that you have just extracted, again and again for every field. There could be lots of fields (my format has currently 13 fields). In my original implementation I was keeping track of positions with start/end, and moving those markers to extract the slice that I wanted for each field. I have the feeling this must be faster than copying the string, but I have no data.

aquavitae Over a year ago

You really need to run a profiler on it to find where the performance problems are, and I doubt copying the string is one of them - its pretty fast in python. If this is a class method you're calling for each record, the overhead of calling the function for each record is likely to be a far greater problem. If you post the full code that's doing this then maybe we can suggest where you can improve performance.

|

Qtax · Accepted Answer · 2012-03-12 11:32:38Z

0

Why even use sub when you could match the part you want directly?

You could use something like:

(?P<name>.{0,N}(?<! ))

But if the matches have to be exactly N long, you could use a lookahead, like:

(?=(?P<name>.{0,N}(?<! ))).{N}

If this is better performing than using an additional trim is questionable. You can try it and let us know.

These expressions will not work if the match is only spaces while also the character before it is a space as well. If you need that case to work you could add a | at the end of the group:

(?P<name>.{0,N}(?<! )|)

answered Mar 12, 2012 at 11:32

Qtax

34k9 gold badges92 silver badges127 bronze badges

6 Comments

blueFast Over a year ago

Yes, fields can be all spaces, and the preceeding field can also be spaces. And fields have an exact length.

Qtax Over a year ago

@gonvaled, then you can use the 2nd expression with the last suggestion: (?=(?P<name>.{0,N}(?<! )|)).{N}

blueFast Over a year ago

I can not modify the input data. Adding a | is not possible.

Qtax Over a year ago

@gonvaled, what data? I'm talking about the regex. Use the fitting expression as it's written. Or are you asking us for how to solve your problem without modifying your code? ~.~

blueFast Over a year ago

mmmm. ok, sorry, I misunderstood your answer. What is the | matching against, then?

|

Collectives™ on Stack Overflow

Dissecting a string with python regex, using named groups and substitution

2 Answers 2

6 Comments

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related