So, I am accessing some url that is formatted something like the following:
<DOCUMENT>
<TYPE>A
<SEQUENCE>1
<TEXT>
<HTML>
<BODY BGCOLOR="#FFFFFF" LINK=BLUE VLINK=PURPLE>
</BODY>
</HTML>
</TEXT>
</DOCUMENT>
<DOCUMENT>
<TYPE>B
<SEQUENCE>2
...
As you can see, it starts a document, (which is the sequence number 1), and then finishes the document, and then document with sequence 2 starts and so on.
So, what I want to do, is to write an xpath address in python such that to just get the document with sequence value 1, (or, equivalently, TYPE A).
I supposed that such a thing would work:
import lxml
from lxml import html
page = html.fromstring(pagehtml)
type_a = page.xpath("//document[sequence=1]/descendant::*/text()")
however, it just gives me an empty list as type_a variable.
Could someone please let me know what is my mistake in this code? I am really new to this xml stuff.