4

I have a special use case which I do not yet know how to cover. I want to dissect a string based on field_name/field_length. For that I define a regex like this:

'(?P<%s>.{%d})' % (field_name, field_length)

And this is repeated for all fields.

I have also a regex to remove spaces to the right of each field:

self.re_remove_spaces = re.compile(' *$')

This way I can get each field like this:

def dissect(self, str):
    data = { }
    m = self.compiled.search(str)
    for field_name in self.fields:
        value = m.group_name(field_name)
        value = re.sub(self.re_remove_spaces, '', value)
        data[field_name] = value
    return data

I have to perform this processing for millions of strings, so it must be efficient.

What annoys me is that I would prefer to perform the dissection + space removal in a single step, using compiled.sub instead of compiled.search, but I do not know how to do this.

Specifically, my question is:

How do I perform regex substitution combining it with named groups in Python regexes?

2
  • The regex you showed ('(?<%s>.%d)' % ..., which would become something like '(?<name>.12)') is not a valid Python regex. Commented Mar 12, 2012 at 10:10
  • Sorry, corrected: P and {} were missing. I am still testing this, so it could be that more bugs are present. Commented Mar 12, 2012 at 10:14

2 Answers 2

4

I take it each field sits next to each other in the string, like in a table, e.g.:

name     description        license
python   language           opensource
windows  operating system   proprietry

So assuming you know in advance the length of each field, you can do it much more simply, without using a regex at all. (btw, str is not a good name for a variable since it clashes with the builtin str type)

def dissect(text):
    data = {}
    for name, length in fields:
        data[name] = text[:length].rstrip()
        text = text[length:]
    return data

Then, if fields = [('lang', 9), ('desc', 19), ('license', 12)]:

>>> self.dissect('python   language           opensource')
{'lang': 'python', 'license': 'opensource', 'desc': 'language'}

Is this what you're trying do though?

Sign up to request clarification or add additional context in comments.

6 Comments

Yes, this is exactly what I am trying to to, and actually my first implementation was done using string slices. Somehow I convinced myself that regex would perform better in this scenario, and thus changed the implementation to regex with named groups. My process function is now running for several hours, and to tell you the truth I have the feeling that slices were faster. Probably I change it back, or maybe I keep both implementations to do performance tests.
I think you'll find that a slice is significantly faster, but if its been running that long then you should certainly profile it and see where the bottleneck is, and how to speed it up further.
It is taking long time because there are lots of records to process.
I know too little about python internals to be able to recognize performance issues, but let me ask you something: isn't the step text=text[length:] a performance pit? I mean, you are copying the whole string, except the part that you have just extracted, again and again for every field. There could be lots of fields (my format has currently 13 fields). In my original implementation I was keeping track of positions with start/end, and moving those markers to extract the slice that I wanted for each field. I have the feeling this must be faster than copying the string, but I have no data.
You really need to run a profiler on it to find where the performance problems are, and I doubt copying the string is one of them - its pretty fast in python. If this is a class method you're calling for each record, the overhead of calling the function for each record is likely to be a far greater problem. If you post the full code that's doing this then maybe we can suggest where you can improve performance.
|
0

Why even use sub when you could match the part you want directly?

You could use something like:

(?P<name>.{0,N}(?<! ))

But if the matches have to be exactly N long, you could use a lookahead, like:

(?=(?P<name>.{0,N}(?<! ))).{N}

If this is better performing than using an additional trim is questionable. You can try it and let us know.

These expressions will not work if the match is only spaces while also the character before it is a space as well. If you need that case to work you could add a | at the end of the group:

(?P<name>.{0,N}(?<! )|)

6 Comments

Yes, fields can be all spaces, and the preceeding field can also be spaces. And fields have an exact length.
@gonvaled, then you can use the 2nd expression with the last suggestion: (?=(?P<name>.{0,N}(?<! )|)).{N}
I can not modify the input data. Adding a | is not possible.
@gonvaled, what data? I'm talking about the regex. Use the fitting expression as it's written. Or are you asking us for how to solve your problem without modifying your code? ~.~
mmmm. ok, sorry, I misunderstood your answer. What is the | matching against, then?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.