1

Hi i would like to code me a small helper Tool in Python it should process the following content:

<tr>
 <td><p>L1</p></td>
 <td><p>(4.000x2.300x500;   4,6m³)</p></td>
 <td><p>&nbsp;</p></td>
 <td><p> 1.221 kg</p></td>
 </tr>
 <tr>
 <td><p>L2</p></td>
 <td><p>(4.250x2.300x500;   4,9m³)</p></td>
 <td><p>&nbsp;</p></td>
 <td><p> 1.279 kg</p></td>
 </tr>
 <tr>
 <td><p>L3</p></td>
 <td><p>(4.500x2.300x500;   5,2m³)</p></td>
 <td><p>&nbsp;</p></td>
 <td><p> 1.321 kg</p></td>
 </tr>
 <tr>
 <td><p>L4</p></td>
 <td><p>(4.750x2.300x500;   5,5m³)</p></td>
 <td><p>&nbsp;</p></td>
 <td><p> 1.364 kg</p></td>
 </tr>

It should replace the &nbsp; of each table row with the the volume in this case everthing between the ; and the ) in the second table data field of each row.

i started to code it in python like that and i could allready scrape the Volume with a regex statement but my logic ends on how to put the values on the right place. any idea ? here is my code

import BeautifulSoup
import re

with open('3mmcontainer.html') as f:
    content = f.read()
f.close()

#print content

contentsoup = BeautifulSoup.BeautifulSoup(content)

for tablerow in contentsoup.findAll('tr'):
    inhalt = str(tablerow.contents[3])
    print inhalt


    match = re.findall('\;(.*?)\)', inhalt)


    print match
# for x in match:
#    volumen = x.lstrip()
#    print volumen

   #f = open('3mmcontainer.html', 'w')
   #newdata = f.replace("&nbsp;", volumen)
   #f.write(newdata)
   #f.close()


#m = re.search('\;(.*?)\)', inhalt)
# print m

# volumen = re.compile(r'\;(.*?)\)')
# volumen.match(tablerow.contents[3])

3 Answers 3

3

NB: you don't need to call close() because the with statement will do it for you.

You can use a simple function to transform each row (<tr/>):

import re


def parse_inhalt(content):
    td_list = re.findall(r"<td>(?:(?!</td>).)+</td>", content)
    vol_content = td_list[1]
    vol = re.findall(r";([^)]+)", vol_content)[0]
    return content.replace("&nbsp;", vol)

The code is straightforward:

  • Extract each cell in td_list
  • Get the content of the second cell which contains the volume
  • Find the volume contained between ";" and ")" (excluding those characters)
  • Replace the &nbsp; by the volume

For instance:

inhalt = u"""\
<tr>
<td><p>L4</p></td>
<td><p>(4.750x2.300x500;   5,5m³)</p></td>
<td><p>&nbsp;</p></td>
<td><p> 1.364 kg</p></td>
</tr>"""

print(parse_inhalt(inhalt))

You get:

<tr>
<td><p>L4</p></td>
<td><p>(4.750x2.300x500;   5,5m³)</p></td>
<td><p>   5,5m³</p></td>
<td><p> 1.364 kg</p></td>
</tr>

You can drop the spaces by using:

vol = re.findall(r";\s*([^)]+)", vol_content)[0]
Sign up to request clarification or add additional context in comments.

Comments

1

An alternative approach.

First, find all of the table cells, and the p elements within them. You know that the p elements are characterised by the presence of within their texts, so watch for them, and you know that you must change the p elements that follow immediately. Then arrange to capture the area when you encounter it, note the ordinal number of the p element and then when you encounter the next p element, change its text by assigning area to its string attribute.

If you prefer regex then you could use this for calculating area:

area = bs4.re.search(r';\s+([^\)]+)', p.text).groups(0)[0]

.

>>> import bs4
>>> soup = bs4.BeautifulSoup(open('temp.htm').read(), 'lxml')
>>> k = None
>>> for i, p in enumerate(soup.select('td > p')):
...     if 'm³' in p.text:
...         area = p.text[1+p.text.rfind(';'):-1].strip()
...         k = i
...     if k and i == k + 1:
...         p.string = area
... 
>>> soup
<html><body><tr>
<td><p>L1</p></td>
<td><p>(4.000x2.300x500;   4,6m³)</p></td>
<td><p>4,6m³</p></td>
<td><p> 1.221 kg</p></td>
</tr>
<tr>
<td><p>L2</p></td>
<td><p>(4.250x2.300x500;   4,9m³)</p></td>
<td><p>4,9m³</p></td>
<td><p> 1.279 kg</p></td>
</tr>
<tr>
<td><p>L3</p></td>
<td><p>(4.500x2.300x500;   5,2m³)</p></td>
<td><p>5,2m³</p></td>
<td><p> 1.321 kg</p></td>
</tr>
<tr>
<td><p>L4</p></td>
<td><p>(4.750x2.300x500;   5,5m³)</p></td>
<td><p>5,5m³</p></td>
<td><p> 1.364 kg</p></td>
</tr></body></html>
>>> 

Comments

1

if brute force regex is acceptable

s='''
<tr>
 <td><p>L1</p></td>
 <td><p>(4.000x2.300x500;   4,6m³)</p></td>
 <td><p>&nbsp;</p></td>
 <td><p> 1.221 kg</p></td>
 </tr>
 <tr>
 <td><p>L2</p></td>
 <td><p>(4.250x2.300x500;   4,9m³)</p></td>
 <td><p>&nbsp;</p></td>
 <td><p> 1.279 kg</p></td>
 </tr>
 <tr>
 <td><p>L3</p></td>
 <td><p>(4.500x2.300x500;   5,2m³)</p></td>
 <td><p>&nbsp;</p></td>
 <td><p> 1.321 kg</p></td>
 </tr>
 <tr>
 <td><p>L4</p></td>
 <td><p>(4.750x2.300x500;   5,5m³)</p></td>
 <td><p>&nbsp;</p></td>
 <td><p> 1.364 kg</p></td>
 </tr>
'''

import re

p=r'(\([0-9x.]+)(; +)([0-9,m³]+)(\)</p></td>\n <td><p>)(&nbsp;)'

# not sure which output is preferred
x = re.sub(p, '\g<1>\g<2>\g<3>\g<4>\g<3>', s)
print(x)

y = re.sub(p, '\g<1>\g<4>\g<3>', s)
print(y)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.