python lxml xpath returning escape characters in list with text

Question

Before last week, my experience with Python had been very limited to large database files on our network, and suddenly I am thrust into the world of trying to extract information from html tables.

After a lot of reading, I chose to use lxml and xpath with Python 2.7 to retrieve the data in question. I have retrieved one field using the following code:

xpath = "//table[@id='resultsTbl1']/tr[position()>1]/td[@id='row_0_partNumber']/child::text()"

which produced the following list:

['\r\n\t\tBAR18FILM/BKN', '\r\n\t\t\r\n\t\t\t', '\r\n\t\t\t', '\r\n\t\t\t', '\r\n\t\t\t', '\r\n\t\t\t', '\r\n\t\t\t\r\n\t\t']

I recognized the CR/LF and tab escape characters, I was wondering how to avoid them?

larsks · Accepted Answer · 2015-05-08 13:04:08Z

Those characters are part of the XML document, which is why they are being returned. You can't avoid them, but you can strip them out. You could call the .strip() method on each item returned:

results = [x.strip() for x in results]

This would strip leading and trailing whitespace. Without seeing your actual code and data it's harder to give a good answer.

For example, given this script:

#!/usr/bin/python

from lxml import etree

with open('data.xml') as fd:
    doc = etree.parse(fd)

results = doc.xpath(
    "//table[@id='results']/tr[position()>1]/td/child::text()")

print 'Before stripping'
print repr(results)

print 'After stripping'
results = [x.strip() for x in results]
print repr(results)

And this data:

<doc>
  <table id="results">
    <tr>
      <th>ID</th><th>Name</th><th>Description</th>
    </tr>

    <tr>
      <td>
      1
      </td>
      <td>
      Bob
      </td>
      <td>
      A person
      </td>
      </tr>
    <tr>
      <td>
      2
      </td>
      <td>
      Alice
      </td>
      <td>
      Another person
      </td>
    </tr>
  </table>
</doc>

We get these results:

Before stripping
['\n\t\t\t1\n\t\t\t', '\n\t\t\tBob\n\t\t\t', '\n\t\t\tA person\n\t\t\t', '\n\t\t\t2\n\t\t\t', '\n\t\t\tAlice\n\t\t\t', '\n\t\t\tAnother person\n\t\t\t']
After stripping
['1', 'Bob', 'A person', '2', 'Alice', 'Another person']

Collectives™ on Stack Overflow

python lxml xpath returning escape characters in list with text

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related