string matching in python

Question

I am getting trouble with the following matter.Let's say, I have some string in two list in a dictionary:

 left                                right
british                             7
cuneate nucleus                     Medulla oblongata
Motoneurons                         anterior

And I have some test lines in a file as like below:

<s id="69-7">British Meanwhile is the studio 7 album by british pop band 10cc 7.</s>
<s id="5239778-2">Medulla oblongata,the name refers collectively to the cuneate nucleus and gracile nucleus, which are present at the junction between the spinal cord and the medulla oblongata.</s>
<s id="21120-99">Terior horn cells, motoneurons located in the spinal.</s>

I want to get output as like following way:

<s id="69-7"><w2>British</w2> Meanwhile is the studio <w2>7</w2> album by <w1>british</w1> pop band 10cc <w2>7</w2>.</s>
<s id="5239778-2"><w2>Medulla oblongata</w2>,the name refers collectively to the <w1>cuneate nucleus</w1> and gracile nucleus, which are present at the junction between the spinal cord and the <w2>medulla oblongata</w2>.</s>

I tried with the following code:

import re

def textReturn(left, right):
    text = ""
    filetext = open(text.xml, "r").read()
    linelist = re.split(u'[\n|\r\n]+',filetext)

    for i in linelist:
        left = left.strip()
        right = right.strip()

        if left in i and right in i:
            i1 = re.sub('(?i)(\s+)(%s)(\s+)'%left, '\\1<w1>\\2</w1>\\3', i)
            i2 = re.sub('(?i)(\s+)(%s)(\s+)'%right, '\\1<w2>\\2</w2>\\3', i1)
            text = text + i2 + "\n"         
    return text

But it gives me:

'<s id="69-7">British meanwhile is the studio <w2>7</w2> album by <w1>British</w1> pop band 10cc 7.</s>'.
<s id="5239778-2">Medulla oblongata,the name refers collectively to the <w1>cuneate nucleus</w1> and gracile nucleus, which are present at the junction between the spinal cord and the medulla oblongata.</s>
<s id="21120-99">Terior horn cells, <w1>motoneurons</w2> located in the spinal.</s>

i.e It can't tag if there are string at the beginning & end .

Also,I just want to get return those line ,which matches both left & right strings, NOT others line.

Any solution please! Thanks a lot!!!

That input looks like XML. Are you sure you don't need to pull the strings out with an XML parser? Also, REs really should use raw strings (r'...') since they don't treat backslashes specially.. — Keith
– Keith, Commented Jul 31, 2011 at 18:22
Keith has a good point. It is probably not a good idea to rely on the entire s element to be on a single line. You can only get away with finding elements yourself if you take into account literal strings, CDATA sections, processing directives, etc. but why would you want to when xml parsers do that for you already? There is a learning curve to using them, as well as XSLT (for modifying the docs the way you want to) but it is sooooooo worth it! — Ray Toal
– Ray Toal, Commented Jul 31, 2011 at 18:27

Ray Toal · Accepted Answer · 2011-07-31 19:34:02Z

3

It doesn't tag at the beginning and the end because you expect one or more spaces before and after your keywords.

Instead of \s+, use \b (word break).

ADDENDUM

Actual code:

import re

dict = [('british','7'),('cuneate nucleus','Medulla oblongata'),('Motoneurons','anterior')]

filetext = """<s id="69-7">British Meanwhile is the studio 7 album by british pop band 10cc 7.</s>
<s id="5239778-2">Medulla oblongata,the name refers collectively to the cuneate nucleus and gracile nucleus, which are present at the junction between the spinal cord and the medulla oblongata.</s>
<s id="21120-99">Terior horn cells, motoneurons located in the spinal.</s>
"""

linelist = re.split(u'[\n|\r\n]+', filetext)

s_tag = re.compile(r"(<s[^>]+>)(.*?)(</s>)")

for i in range(3):
    left, right = dict[i]

    line_parts = re.search(s_tag, linelist[i])
    start = line_parts.group(1)
    content = line_parts.group(2)
    end = line_parts.group(3)

    left_match = "(?i)\\b(%s)\\b" % left
    right_match = "(?i)\\b(%s)\\b" % right
    if re.search(left_match, content) and re.search(right_match, content):
        line1 = re.sub(left_match, '<w1>\\1</w1>', content)
        line2 = re.sub(right_match, '<w2>\\1</w2>', line1)
        print(line_parts.group(1) + line2 + line_parts.group(3))

This is the basis for a short-term solution, but long-term you should try out the XML parser approach.

edited Jul 31, 2011 at 19:34

answered Jul 31, 2011 at 18:21

Ray Toal

88.7k20 gold badges186 silver badges245 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Ray Toal Over a year ago

Okay I'll work on it for you.... Also I'll add in the r for making the strings raw. Give me 10 min or so.

Ray Toal Over a year ago

Okay the answer is edited. Works on your example but is probably not the most efficient approach. I hardcoded the file so the example would be self-contained.

Liza Over a year ago

But still the problem remains that it also tag elements inside <s id="69-7 eg.

s id="69-<w2>7</w2>"><w1>British</w1> Meanwhile is the studio <w2>7</w2> album by <w1>british</w1> pop band 10cc <w2>7</w2>.</s>

Ray Toal Over a year ago

Oops. I will fix, sorry. This is why the XML parser will work a little better.... :)

Ray Toal Over a year ago

Fixed. It's not the greatest code, since I manually pulled out the inside of the s tags and put them back at the end.... It at least gives the correct two lines of output.

yasouser · Accepted Answer · 2011-07-31 18:22:29Z

2

If your input file is going to be an xml file, why not use an xml parser? See here: 19.5. xml.parsers.expat — Fast XML parsing using Expat

answered Jul 31, 2011 at 18:22

yasouser

5,2172 gold badges29 silver badges42 bronze badges

Collectives™ on Stack Overflow

string matching in python

2 Answers 2

5 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related