0

I have a html file with xml snipped embedded, the source code is pasted in the pastbin:

<html>
  <head>
    <title> test֤</title>
  </head>
  <body>
    <form name="acsForm" action="" method="post" >
      <textarea rows=10 cols=80 name="xmlText"><?xml version="1.0" encoding="UTF-8"?>
        <samlp:Response xmlns:samlp="urn:oasis:names:tc:SAML:2.0:protocol">
        </samlp:Response> 
      </textarea>
      <textarea name="2nd"> text2....</textarea>             
    </form>
  </body>
</html>

My task is to extract the text enclosed in the first textarea, which is a XML snippet, from the HTML. Without any change to the original snippet. I'm able to get it by using the BeautifulSoup, but it changes all the tag names into lower case.

5 Answers 5

1

Try using the BeautifulStoneSoup part of the BeautifulSoup library, which is designed for XML.

Sign up to request clarification or add additional context in comments.

Comments

0

Perhaps lxml would work, although I've never used it myself so I don't know how easy/complicated it would be to do what you want.

Comments

0

(Ugh! Why do so many authors seem to think <textarea> content doesn't need HTML-escaping? Fools!)

Unfortunately BeautifulSoup 3.1 is not applying the (incorrect but common) browser-fixup of treating < and & characters inside <textarea> as text, and is instead creating real XML elements.

BeautifulSoup 3.0 copes with it OK though. Why there's a difference.

Comments

0

Well I just tried beautifulSoup 3.0, and it doesn't work for me:

xml ='<samlp:Response xmlns:samlp="urn:oasis:names:tc:SAML:2.0:protocol"></samlp:Resonse>'
print BeautifulSoup.BeautifulStoneSoup(xml)
<samlp:response xmlns:samlp="urn:oasis:names:tc:SAML:2.0:protocol"

You will notice that the soup has changed Response to response

Comments

0

Finally I found the pyparsing is the best weapon to accomplish the task:

aStart,aEnd = makeHTMLTags("textarea")

search = aStart + SkipTo(aEnd)("body")+ aEnd

saml_resp_str = search.searchString(doc)[0].body relay_state_str = search.searchString(doc)[1].body

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.