Navigating in html by xpath in python

Question

So, I am accessing some url that is formatted something like the following:

<DOCUMENT>
 <TYPE>A
 <SEQUENCE>1
 <TEXT>
  <HTML>
   <BODY BGCOLOR="#FFFFFF" LINK=BLUE  VLINK=PURPLE>
   </BODY>
  </HTML>
 </TEXT>
</DOCUMENT>

<DOCUMENT>
 <TYPE>B
 <SEQUENCE>2
 ...

As you can see, it starts a document, (which is the sequence number 1), and then finishes the document, and then document with sequence 2 starts and so on.

So, what I want to do, is to write an xpath address in python such that to just get the document with sequence value 1, (or, equivalently, TYPE A).

I supposed that such a thing would work:

import lxml
from lxml import html
page = html.fromstring(pagehtml)
type_a = page.xpath("//document[sequence=1]/descendant::*/text()")

however, it just gives me an empty list as type_a variable.

Could someone please let me know what is my mistake in this code? I am really new to this xml stuff.

What you have as input is neither HTML nor XML of any kind. Did you write it yourself? — Mathias Müller
– Mathias Müller, Commented Nov 12, 2014 at 19:48
@MathiasMüller: No, I did not indeed! Here is an example of such urls that I am trying to crawl in: sec.gov/Archives/edgar/data/21344/000104746909001875/… It is a pretty long one, so that was why I did not include it in my question. — MarcusAerlius
– MarcusAerlius, Commented Nov 12, 2014 at 21:09

Community · Accepted Answer · 2017-05-23 10:33:40Z

3

It might be because that's highly dubious HTML. The <SEQUENCE> tag is unclosed, so it could well be interpreted by lxml as containing all of the code until the next </DOCUMENT>, so it does not end up just containing the 1. When your XPath code then looks for a <SEQUENCE> containing 1, there isn't one.

Additionally, XML is case-sensitive, but HTML isn't. XPath is designed for XML, so it is also case sensitive, which would also stop your document matching <DOCUMENT>.

Try //DOCUMENT[starts-with(SEQUENCE,'1')]. That's based on Xpath using starts-with function.

Ideally, if the input is under your control, you should instead just close the type and sequence tags (with </TYPE> and </SEQUENCE>) to make the input valid.

edited May 23, 2017 at 10:33

CommunityBot

11 silver badge

answered Nov 12, 2014 at 18:47

GKFX

1,3971 gold badge12 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

GKFX Over a year ago

This is the first time I've actually used Python or XPath, so this is based on experience of HTML/XML only.

Mathias Müller Over a year ago

+1 "highly dubious HTML" - exactly. And yes, XPath is case-sensitive.

MarcusAerlius Over a year ago

Thanks a lot @GKFX. I do agree that it is an odd one. Unfortunately, I dont have control over the code. [ I provided a link in the comments of the question ] Unfortunatey, your code did not work as well. It is odd, because xpath recognizes sequence as a node, so that when I put ://sequence/descendant::*/text() it recognizes the correct place. but as you mentioned it does not get when it is finished

Anzel · Accepted Answer · 2014-11-12 21:18:07Z

2

I'd like to point out, apart from the great answer provided by @GKFX, lxml.html module is capable of parsing broken or a fragment of HTML. In fact it will parse from your string just fine and handle it well.

fromstring(string): Returns document_fromstring or fragment_fromstring, based on whether the string looks like a full document, or just a fragment.

The problem you have, perhaps from your other codes generating the string, also lies on the fact that, you haven't given the true path to access the SEQUENCE node.

type_a = page.xpath("//document[sequence=1]/descendant::*/text()")

your above xpath will try to find all document nodes with a following children node called sequence which its value 1, however your document's first children node is type, not sequence, so you will never get what you want.

Consider rewriting to this, will get what you need:

page.xpath('//document[type/sequence=1]/descendant::*/text()')
['A\n ', '1\n ']

Since your html string is missing the closing tag for sequence, you cannot, however get the correct result by another xpath like this:

page.xpath('//document[type/sequence=1]/../..//text()')
['A\n ', '1\n ', 'B\n ', '2']

That is because your sequence=1 has no closing tag, sequence=2 will become a child node of it.

I have to point out an important point that your html string is still invalid, but the tolerance from lxml's parser can handle your case just fine.

answered Nov 12, 2014 at 21:18

Anzel

20.6k5 gold badges54 silver badges53 bronze badges

2 Comments

MarcusAerlius Over a year ago

thanks a lot, @Anzel. but still no chance. your code gives me [] as well

Anzel Over a year ago

@novice_007, well it works fine on my machine, that leads me to believe your html parser may be broken somehow. Are you sure you have the libxml2 installed? You can check with python -c "import libxml2" and see if it throws an error

Curtis Mattoon · Accepted Answer · 2014-11-12 20:15:59Z

1

Try ~~using a relative path:~~ explicitly specifying the correct path to your element. (not skipping type)

page.xpath("//document[./type/sequence = 1]")

See: http://pastebin.com/ezQXtKcr

Output:

Trying original post (novice_007): //document[sequence=1]/descendant::*/text()
[]
Using GKFX's answer: //DOCUMENT[starts-with(SEQUENCE,'1')]
[]
My answer: //document[./type/sequence = 1]
[<Element document at 0x1bfcb30>]

Currently, the xpath I provided is the only one that ... to just get the document with sequence value 1

edited Nov 12, 2014 at 20:15

answered Nov 12, 2014 at 18:42

Curtis Mattoon

4,7422 gold badges30 silver badges35 bronze badges

1 Comment

Mathias Müller Over a year ago

How is that path expression more relative than the one in the OP?

Collectives™ on Stack Overflow

Navigating in html by xpath in python

3 Answers 3

3 Comments

2 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related