6

am trying to use lxml to read html from a string and then try to find all img tags, update the image src's attribute and add hyper link around each image found

so this,

<img src="old-value" />

will be this

<a href=""><img src="new-value" /></a>

the problem am facing is two, first am using etree.HTML to load the html string, which for some reason is adding html tag and body tag to the html itself. Is there a way to load it without automatically causing this to happen?

Another problem am not able to solve, how do i add the hyper link element around the image tag, I tried the below but it would add the hyper link element inside the img tag

tree = etree.HTML(self.content)
imgs = tree.xpath('.//img')
thm = "new-value"
for img in imgs:
     img.set('src', thm)
     a = etree.Element('a', href="#")
     img.insert(0, a)

Any one can advise please?

update:

I just tried the approach provided by @Alko and its working well, but it has a problem with the type of content am using.

The img tag is located inside p tags such as example below

<html><body><p><img src="/public_media/cache/66/ed/66edd1c01e3027ba18bef9244ca8e8b4.jpg?id=31"/>jshjksh skjhs jksh skjhsj ksh jkshs kjhs kjsh sjkhs khs ksh skh skh skjh skjh skjh ksjh ksh skhs kjsh skjh skhs khs kjsh skjh skjhs kshk sjh skjhs kjsh skjh skjh ksj ksjh jsk hskjh s</p><p>jshjksh skjhs jksh skjhsj ksh jkshs kjhs kjsh sjkhs khs ksh skh skh skjh&#13;
 skjh skjh ksjh ksh skhs kjsh skjh skhs khs kjsh skjh skjhs kshk sjh &#13;
skjhs kjsh skjh skjh ksj ksjh jsk hskjh s</p></body></html>

whats happening when i run the solution given, the closing a tag is being added after the ending of the paragraph.

2
  • 1
    great that you start to use LXML now. Could you please accept the answer where you have got this idea / knowledge from - as it solved your problem to replace the src value: stackoverflow.com/questions/20595735/… Commented Dec 17, 2013 at 16:10
  • 1
    I just did, thanks jon :).. appreciate your input Commented Dec 17, 2013 at 16:14

2 Answers 2

3

You can use addprevious before of insert:

imgs = tree.xpath('.//img')
thm = "new-value"
for img in imgs:
    img.set('src', thm)
    a = etree.Element('a', href="#")
    img.addprevious(a)
    a.insert(0, img)

That will result in

>>> etree.tostring(tree)
'<html><body><a href="#"><img src="new-value"/></a></body></html>'

Also, lxml.html.fragment_fromstring can be useful, but you have to provide more diverse example, as in your case of alone image element, it won't be found by your xpath.

See following demo:

>>> import lxml.html
>>> img = lxml.html.fragment_fromstring('<img src="old-value" />')
>>> thm = "new-value"
>>> img.set('src', thm)
>>> a = etree.Element('a', href="#")
>>> a.insert(0, img)
>>> lxml.html.etree.tostring(a)
'<a href="#"><img src="new-value"/></a>'

Update

For a case when img tag has tail, you can reassign it to created a tag:

>>> s = '<html><body><p><img src="old_value"/>some text</p></body></html>'
>>> tree = etree.HTML(s)
>>> imgs = tree.xpath('.//img')
>>> thm = "new-value"
>>> for img in imgs:
...     img.set('src', thm)
...     a = etree.Element('a', href="#")
...     img.addprevious(a)
...     a.insert(0, img)
...     a.tail = img.tail
...     img.tail = ''
...
>>> etree.tostring(tree)
'<html><body><p><a href="#"><img src="new-value"/></a>some text</p></body></html>'
Sign up to request clarification or add additional context in comments.

4 Comments

thanks for the quick response, addprevious did work.. but it won't add closing tag after the image? for me, am getting <a href="#" /><img src..> and thats it, there is no closing </a> after the img
@MoJ.Mughrabi you have to addprevious and insert in the new (<a>) element, see samples in my answer
i just updated the question, the problem is in the html itself, the image tag is located inside a p tag which causing the a to be placed after the paragraph when using insert
@MoJ.Mughrabi this text is in fact part of img element, i.e. img.tail. you can use a.tail = img.tail; img.tail= ''
0
holder = etree.Element('div', {'id': 'links'})
for img in imgs:
   a_tag = etree.SubElement( holder, {'href':'#'} )
   img_tag = etree.SubElement( a_tag, {'src': 'new_value'} ) 

etree.toString(holder)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.