text extraction using python lxml looping issue

Question

Here is a part of my xml file..

- <a:p>
    - <a:pPr lvl="2">
        - <a:spcBef>
              <a:spcPts val="200" /> 
          </a:spcBef>
     </a:pPr>
    - <a:r>
          <a:rPr lang="en-US" sz="1400" dirty="0" smtClean="0" /> 
          <a:t>The</a:t> 
     </a:r>
    - <a:r>
         <a:rPr lang="en-US" sz="1400" dirty="0" /> 
         <a:t>world</a:t> 
      </a:r>
     - <a:r>
          <a:rPr lang="en-US" sz="1400" dirty="0" smtClean="0" /> 
          <a:t>is small</a:t> 
      </a:r>
  </a:p>
    - <a:p>
    - <a:pPr lvl="2">
        - <a:spcBef>
              <a:spcPts val="200" /> 
          </a:spcBef>
     </a:pPr>
    - <a:r>
          <a:rPr lang="en-US" sz="1400" dirty="0" smtClean="0" b="0" /> 
          <a:t>The</a:t> 
     </a:r>
    - <a:r>
         <a:rPr lang="en-US" sz="1400" dirty="0" b="0" /> 
         <a:t>world</a:t> 
      </a:r>
     - <a:r>
          <a:rPr lang="en-US" sz="1400" dirty="0" smtClean="0" b="0" /> 
          <a:t>is too big</a:t> 
      </a:r>
  </a:p>

I have written a code using lxml to extract the text. But, as the sentence is split into two lines, I want to join these two to form a single sentence like The world is small... . So here I write a code:

path4 = file.xpath('/p:sld/p:cSld/p:spTree/p:sp/p:txBody/a:p/a:r/a:rPr', namespaces={'p':'http://schemas.openxmlformats.org/presentationml/2006/main',
                'a':'http://schemas.openxmlformats.org/drawingml/2006/main'})
    if path5:
        for a in path4:  
            if a.get('sz') == '1400' and a.xpath('node()') == [] and a.get('b') != '0':
                b = a.getparent()
                c = b.getparent()
                d = c.xpath('./a:r/a:t/text()' , namespaces {'p':'http://schemas.openxmlformats.org/presentationml/2006/main', 'a':'http://schemas.openxmlformats.org/drawingml/2006/main'})
                print ''.join(d)
             elif a.get('sz') == '1400' and a.xpath('node()') == [] and a.get('b') == '0':
                b = a.getparent()
                c = b.getparent()
                d = c.xpath('./a:r/a:t/text()' , namespaces {'p':'http://schemas.openxmlformats.org/presentationml/2006/main', 'a':'http://schemas.openxmlformats.org/drawingml/2006/main'})
                print ''.join(d)

I get the output :

The world is samll...
The world is small...
The world is small...

expected output:

the world is small...

any suggestions?

Community · Accepted Answer · 2020-06-20 09:12:55Z

1

You are making the sentence for every a:rPr found in the loop.

Here's an example of what you should do instead:

test.xml:

<body xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main"
      xmlns:p="http://schemas.openxmlformats.org/presentationml/2006/main">
    <a:p>
        -
        <a:pPr lvl="2">
            -
            <a:spcBef>
                <a:spcPts val="200"/>
            </a:spcBef>
        </a:pPr>
        -
        <a:r>
            <a:rPr lang="en-US" sz="1400" dirty="0" smtClean="0"/>
            <a:t>The</a:t>
        </a:r>
        -
        <a:r>
            <a:rPr lang="en-US" sz="1400" dirty="0"/>
            <a:t>world</a:t>
        </a:r>
        -
        <a:r>
            <a:rPr lang="en-US" sz="1400" dirty="0" smtClean="0"/>
            <a:t>is small</a:t>
        </a:r>
    </a:p>
    <a:p>
        -
        <a:pPr lvl="2">
            -
            <a:spcBef>
                <a:spcPts val="200"/>
            </a:spcBef>
        </a:pPr>
        -
        <a:r>
            <a:rPr lang="en-US" sz="1400" dirty="0" smtClean="0" b="0"/>
            <a:t>The</a:t>
        </a:r>
        -
        <a:r>
            <a:rPr lang="en-US" sz="1400" dirty="0" b="0"/>
            <a:t>world</a:t>
        </a:r>
        -
        <a:r>
            <a:rPr lang="en-US" sz="1400" dirty="0" smtClean="0" b="0"/>
            <a:t>is too big</a:t>
        </a:r>
    </a:p>
</body>

test.py:

from lxml import etree


tree = etree.parse('test.xml')
NAMESPACES = {'p': 'http://schemas.openxmlformats.org/presentationml/2006/main',
              'a': 'http://schemas.openxmlformats.org/drawingml/2006/main'}

path = tree.xpath('/body/a:p', namespaces=NAMESPACES)

for outer_item in path:
    parts = []
    for item in outer_item.xpath('./a:r/a:rPr', namespaces=NAMESPACES):
        parts.append(item.getparent().xpath('./a:t/text()', namespaces=NAMESPACES)[0])

    print " ".join(parts)

output:

The world is small

The world is too big

So, just looping over a:p items and extracting the text into parts, then print it after processing of each a:p. I've removed if statement for clarity.

Hope that helps.

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered Jun 5, 2013 at 11:42

alecxe

476k127 gold badges1.1k silver badges1.2k bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

Sangamesh Over a year ago

Many Thanks for your effort to form the answer.This I have tried before..This works but not when there are two to three elif condition after the if conditionbecause the list parts should be printed outside the loop..

alecxe Over a year ago

You're welcome. Well, you can keep track of several conditions by making parts a list of lists or a dictionary. It's hard to say without a real example. Could you please improve your question to see what you are talking about?

Sangamesh Over a year ago

well i will post the whole example on pastecode.org and will give the link here for your reference is that k?

alecxe Over a year ago

Well, you can try to extend your example to contain one more elif and more xml. If it's not possible, let's go with pastecode.org, thanks.

Sangamesh Over a year ago

oops excuse me i edited your answer by mistake ..I am so sorry for that!!

|

Collectives™ on Stack Overflow

text extraction using python lxml looping issue

1 Answer 1

8 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

8 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related