1

I am getting trouble with the following matter.Let's say, I have some string in two list in a dictionary:

 left                                right
british                             7
cuneate nucleus                     Medulla oblongata
Motoneurons                         anterior

And I have some test lines in a file as like below:

<s id="69-7">British Meanwhile is the studio 7 album by british pop band 10cc 7.</s>
<s id="5239778-2">Medulla oblongata,the name refers collectively to the cuneate nucleus and gracile nucleus, which are present at the junction between the spinal cord and the medulla oblongata.</s>
<s id="21120-99">Terior horn cells, motoneurons located in the spinal.</s>

I want to get output as like following way:

<s id="69-7"><w2>British</w2> Meanwhile is the studio <w2>7</w2> album by <w1>british</w1> pop band 10cc <w2>7</w2>.</s>
<s id="5239778-2"><w2>Medulla oblongata</w2>,the name refers collectively to the <w1>cuneate nucleus</w1> and gracile nucleus, which are present at the junction between the spinal cord and the <w2>medulla oblongata</w2>.</s>

I tried with the following code:

import re

def textReturn(left, right):
    text = ""
    filetext = open(text.xml, "r").read()
    linelist = re.split(u'[\n|\r\n]+',filetext)

    for i in linelist:
        left = left.strip()
        right = right.strip()

        if left in i and right in i:
            i1 = re.sub('(?i)(\s+)(%s)(\s+)'%left, '\\1<w1>\\2</w1>\\3', i)
            i2 = re.sub('(?i)(\s+)(%s)(\s+)'%right, '\\1<w2>\\2</w2>\\3', i1)
            text = text + i2 + "\n"         
    return text   

But it gives me:

'<s id="69-7">British meanwhile is the studio <w2>7</w2> album by <w1>British</w1> pop band 10cc 7.</s>'.
<s id="5239778-2">Medulla oblongata,the name refers collectively to the <w1>cuneate nucleus</w1> and gracile nucleus, which are present at the junction between the spinal cord and the medulla oblongata.</s>
<s id="21120-99">Terior horn cells, <w1>motoneurons</w2> located in the spinal.</s>

i.e It can't tag if there are string at the beginning & end .

Also,I just want to get return those line ,which matches both left & right strings, NOT others line.

Any solution please! Thanks a lot!!!

2
  • 2
    That input looks like XML. Are you sure you don't need to pull the strings out with an XML parser? Also, REs really should use raw strings (r'...') since they don't treat backslashes specially.. Commented Jul 31, 2011 at 18:22
  • 1
    Keith has a good point. It is probably not a good idea to rely on the entire s element to be on a single line. You can only get away with finding elements yourself if you take into account literal strings, CDATA sections, processing directives, etc. but why would you want to when xml parsers do that for you already? There is a learning curve to using them, as well as XSLT (for modifying the docs the way you want to) but it is sooooooo worth it! Commented Jul 31, 2011 at 18:27

2 Answers 2

3

It doesn't tag at the beginning and the end because you expect one or more spaces before and after your keywords.

Instead of \s+, use \b (word break).

ADDENDUM

Actual code:

import re

dict = [('british','7'),('cuneate nucleus','Medulla oblongata'),('Motoneurons','anterior')]

filetext = """<s id="69-7">British Meanwhile is the studio 7 album by british pop band 10cc 7.</s>
<s id="5239778-2">Medulla oblongata,the name refers collectively to the cuneate nucleus and gracile nucleus, which are present at the junction between the spinal cord and the medulla oblongata.</s>
<s id="21120-99">Terior horn cells, motoneurons located in the spinal.</s>
"""

linelist = re.split(u'[\n|\r\n]+', filetext)

s_tag = re.compile(r"(<s[^>]+>)(.*?)(</s>)")

for i in range(3):
    left, right = dict[i]

    line_parts = re.search(s_tag, linelist[i])
    start = line_parts.group(1)
    content = line_parts.group(2)
    end = line_parts.group(3)

    left_match = "(?i)\\b(%s)\\b" % left
    right_match = "(?i)\\b(%s)\\b" % right
    if re.search(left_match, content) and re.search(right_match, content):
        line1 = re.sub(left_match, '<w1>\\1</w1>', content)
        line2 = re.sub(right_match, '<w2>\\1</w2>', line1)
        print(line_parts.group(1) + line2 + line_parts.group(3))

This is the basis for a short-term solution, but long-term you should try out the XML parser approach.

Sign up to request clarification or add additional context in comments.

5 Comments

Okay I'll work on it for you.... Also I'll add in the r for making the strings raw. Give me 10 min or so.
Okay the answer is edited. Works on your example but is probably not the most efficient approach. I hardcoded the file so the example would be self-contained.
But still the problem remains that it also tag elements inside <s id="69-7 eg. s id="69-<w2>7</w2>"><w1>British</w1> Meanwhile is the studio <w2>7</w2> album by <w1>british</w1> pop band 10cc <w2>7</w2>.</s>
Oops. I will fix, sorry. This is why the XML parser will work a little better.... :)
Fixed. It's not the greatest code, since I manually pulled out the inside of the s tags and put them back at the end.... It at least gives the correct two lines of output.
2

If your input file is going to be an xml file, why not use an xml parser? See here: 19.5. xml.parsers.expat — Fast XML parsing using Expat

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.