1

I'm using REGEX to compile a list of strings from an HTML document in Python. The strings are either found inside a td tag (<td>SOME OF THE STRINGS COULD BE HERE</td>) or inside a div tag (<div style="line-height: 100%;margin:0;padding:0;">SOME STRINGS COULD ALSO BE HERE</div>).

Since the order of the strings inside the final list should correspond to the order in which they appear inside the HTML document, I am looking for a REGEX that will allow me to compile all of these strings considering both possible cases.

I know how to do it individually with something that looks like:

FindStrings = re.compile('(?<=\<td>)(.*?)(?=\</td>)')
MyList = re.findall(FindStrings, str(mydocument))

for the first case, but would like to know the most efficient way to combine both cases inside a unique REGEX.

1
  • 1
    Why don't you use beautifulsoup? Commented Oct 11, 2014 at 3:18

2 Answers 2

1

You can combine regex pattern by using regex OR. Btw, you don't need to use regex lookarounds.

You can use this regex:

<td>(.+?)</td>|<div.*?>(.+?)</div>

Working demo

enter image description here

Match information

MATCH 1
1.  [4-37]  `SOME OF THE STRINGS COULD BE HERE`
MATCH 2
2.  [94-125]    `SOME STRINGS COULD ALSO BE HERE`
QUICK REFERENCE

Code:

>>> import re
>>> s = """<td>SOME OF THE STRINGS COULD BE HERE</td>
... <div style="line-height: 100%;margin:0;padding:0;">SOME STRINGS COULD ALSO BE HERE</div>
... """
>>> m = re.findall(r'<td>(.+?)</td>|<div.*?>(.+?)</div>', s)
>>> m
[('SOME OF THE STRINGS COULD BE HERE', ''), ('', 'SOME STRINGS COULD ALSO BE HERE')]
>>> [s for s in x if s for x in m]
['SOME STRINGS COULD ALSO BE HERE', 'SOME STRINGS COULD ALSO BE HERE']
Sign up to request clarification or add additional context in comments.

2 Comments

Seems to do half of the job done... I get a list of multiple strings couple, including an empty one. [('', 'One of the string'), ('', 'Another one'), ...]
Hi @LaGuille as you can see in the screenshot it takes both needed strings
0
<td[^>]*>((?:(?!<\/td>).)*)<\/td>|<div[^>]*>((?:(?!<\/div>).)*)<\/div>

You can try this.See demo.

http://regex101.com/r/mD7gK4/11

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.