1

So, I am accessing some url that is formatted something like the following:

<DOCUMENT>
 <TYPE>A
 <SEQUENCE>1
 <TEXT>
  <HTML>
   <BODY BGCOLOR="#FFFFFF" LINK=BLUE  VLINK=PURPLE>
   </BODY>
  </HTML>
 </TEXT>
</DOCUMENT>

<DOCUMENT>
 <TYPE>B
 <SEQUENCE>2
 ...

As you can see, it starts a document, (which is the sequence number 1), and then finishes the document, and then document with sequence 2 starts and so on.

So, what I want to do, is to write an xpath address in python such that to just get the document with sequence value 1, (or, equivalently, TYPE A).

I supposed that such a thing would work:

import lxml
from lxml import html
page = html.fromstring(pagehtml)
type_a = page.xpath("//document[sequence=1]/descendant::*/text()")

however, it just gives me an empty list as type_a variable.

Could someone please let me know what is my mistake in this code? I am really new to this xml stuff.

2
  • 1
    What you have as input is neither HTML nor XML of any kind. Did you write it yourself? Commented Nov 12, 2014 at 19:48
  • @MathiasMüller: No, I did not indeed! Here is an example of such urls that I am trying to crawl in: sec.gov/Archives/edgar/data/21344/000104746909001875/… It is a pretty long one, so that was why I did not include it in my question. Commented Nov 12, 2014 at 21:09

3 Answers 3

3

It might be because that's highly dubious HTML. The <SEQUENCE> tag is unclosed, so it could well be interpreted by lxml as containing all of the code until the next </DOCUMENT>, so it does not end up just containing the 1. When your XPath code then looks for a <SEQUENCE> containing 1, there isn't one.

Additionally, XML is case-sensitive, but HTML isn't. XPath is designed for XML, so it is also case sensitive, which would also stop your document matching <DOCUMENT>.

Try //DOCUMENT[starts-with(SEQUENCE,'1')]. That's based on Xpath using starts-with function.

Ideally, if the input is under your control, you should instead just close the type and sequence tags (with </TYPE> and </SEQUENCE>) to make the input valid.

Sign up to request clarification or add additional context in comments.

3 Comments

This is the first time I've actually used Python or XPath, so this is based on experience of HTML/XML only.
+1 "highly dubious HTML" - exactly. And yes, XPath is case-sensitive.
Thanks a lot @GKFX. I do agree that it is an odd one. Unfortunately, I dont have control over the code. [ I provided a link in the comments of the question ] Unfortunatey, your code did not work as well. It is odd, because xpath recognizes sequence as a node, so that when I put ://sequence/descendant::*/text() it recognizes the correct place. but as you mentioned it does not get when it is finished
2

I'd like to point out, apart from the great answer provided by @GKFX, lxml.html module is capable of parsing broken or a fragment of HTML. In fact it will parse from your string just fine and handle it well.

fromstring(string): Returns document_fromstring or fragment_fromstring, based on whether the string looks like a full document, or just a fragment.

The problem you have, perhaps from your other codes generating the string, also lies on the fact that, you haven't given the true path to access the SEQUENCE node.

type_a = page.xpath("//document[sequence=1]/descendant::*/text()")

your above xpath will try to find all document nodes with a following children node called sequence which its value 1, however your document's first children node is type, not sequence, so you will never get what you want.

Consider rewriting to this, will get what you need:

page.xpath('//document[type/sequence=1]/descendant::*/text()')
['A\n ', '1\n ']

Since your html string is missing the closing tag for sequence, you cannot, however get the correct result by another xpath like this:

page.xpath('//document[type/sequence=1]/../..//text()')
['A\n ', '1\n ', 'B\n ', '2']

That is because your sequence=1 has no closing tag, sequence=2 will become a child node of it.

I have to point out an important point that your html string is still invalid, but the tolerance from lxml's parser can handle your case just fine.

2 Comments

thanks a lot, @Anzel. but still no chance. your code gives me [] as well
@novice_007, well it works fine on my machine, that leads me to believe your html parser may be broken somehow. Are you sure you have the libxml2 installed? You can check with python -c "import libxml2" and see if it throws an error
1

Try using a relative path: explicitly specifying the correct path to your element. (not skipping type)

page.xpath("//document[./type/sequence = 1]")

See: http://pastebin.com/ezQXtKcr

Output:

Trying original post (novice_007): //document[sequence=1]/descendant::*/text()
[]
Using GKFX's answer: //DOCUMENT[starts-with(SEQUENCE,'1')]
[]
My answer: //document[./type/sequence = 1]
[<Element document at 0x1bfcb30>]

Currently, the xpath I provided is the only one that ... to just get the document with sequence value 1

1 Comment

How is that path expression more relative than the one in the OP?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.