Python regular expression match in html file

Question

I am trying to match inside an html file. This is the html:

<td>
<b>BBcode</b><br />
<textarea onclick='this.select();' style='width:300px;     height:200px;' />
[URL=http://someimage.com/LwraZS1]          [IMG]http://t1.someimage.com/LwraZS1.jpg[/IMG][    [/URL] [URL=http://someimage.com/CDnuiST]   [IMG]http://t1.someimage.com/CDnuiST.jpg[/IMG]   [/URL] [URL=http://someimage.com/Y0oZKPb][IMG]http://t1.someimage.com/Y0oZKPb.jpg[/IMG][/URL] [URL=http://someimage.com/W2RMAOR][IMG]http://t1.someimage.com/W2RMAOR.jpg[/IMG][/URL] [URL=http://someimage.com/5e5AYUz][IMG]http://t1.someimage.com/5e5AYUz.jpg[/IMG][/URL] [URL=http://someimage.com/EWDQErN][IMG]http://t1.someimage.com/EWDQErN.jpg[/IMG][/URL]
</textarea>
</td>

I want to extract all the BB code from [ to ] included.

And this is my code:

import re
x = open('/xxx/xxx/file.html', 'r').read
y = re.compile(r"""<td> <b>BBcode</b><br /><textarea onclick='this.select();' style='width:300px; height:200px;' />. (. *) </textarea> </td>""") 
z  = y.search(str(x())
print z

But when i run this i get None object... Where is the mistake?

Forgot paren read().

C Panda
– C Panda

2016-04-16 07:05:53 +00:00
Commented Apr 16, 2016 at 7:05 — C Panda
– C Panda, Commented Apr 16, 2016 at 7:05
Nothing, still get a None.. Maybe the regex is wrong..

Andrew Stef
– Andrew Stef

2016-04-16 07:27:03 +00:00
Commented Apr 16, 2016 at 7:27 — Andrew Stef
– Andrew Stef, Commented Apr 16, 2016 at 7:27
Yeah, posted an answer.check.

C Panda
– C Panda

2016-04-16 07:51:23 +00:00
Commented Apr 16, 2016 at 7:51 — C Panda
– C Panda, Commented Apr 16, 2016 at 7:51

C Panda · Accepted Answer · 2016-04-16 11:34:41Z

1

import re
x = open('/xxx/xxx/file.html', 'rt').read()
r1 = r'<textarea.*?>(.*?)</textarea>'
s1 = re.findall(r1, s, re.DOTALL)[1] # just by inspection
r2 = r'\[(.*?)\]'
s2 = re.findall(r2, s1)
for u in s2:
    print(u)

edited Apr 16, 2016 at 11:34

answered Apr 16, 2016 at 7:51

C Panda

3,4352 gold badges13 silver badges11 bronze badges

Sign up to request clarification or add additional context in comments.

12 Comments

Andrew Stef Over a year ago

Thanks it works but it gets another part of the html because its All inside <textarea>. I updated the question with all the html text file. Thank you for the help btw!!

C Panda Over a year ago

@AndrewStef Can you show some expected output. In regex problems it's always helpful.

Andrew Stef Over a year ago

Expecting output should be [URL=someimage.com/LwraZS1] [IMG]t1.someimage.com/LwraZS1.jpg[/IMG][ [/URL] [URL=someimage.com/CDnuiST] [IMG]t1.someimage.com/CDnuiST.jpg [/IMG][/URL]... This one exactely. The page is the output of someimage.com uploaded files. I am trying to catch the BBCODE between [ and ] Text out of it.

C Panda Over a year ago

@AndrewStef Try y = re.compile(r'<textarea.*?>(?:\[(.*?)\])*?</textarea>', re.DOTALL) and z = y.findall(x)

Andrew Stef Over a year ago

No match found.. Unfortunately

|

James Doepp - pihentagyu · Accepted Answer · 2016-04-16 14:59:14Z

0

I would use a parser for this:

from html import HTMLParser

class MyHtmlParser(HTMLParser):
    def __init__(self):
        self.reset()
        self.convert_charrefs = True
        self.dat = []
    def handle_data(self, d):
        self.dat.append(d.strip())
    def return_data(self):
        return self.dat
>>> with open('sample.html') as htmltext:
        htmldata = htmltext.read()
>>> parser = MyHtmlParser()
>>> parser.feed(htmldata)
>>> res = parser.return_data()
>>> res = [item for item in filter(None, res)]
>>> res[0]
'BBcode'
>>>

answered Apr 16, 2016 at 14:59

James Doepp - pihentagyu

1,3089 silver badges11 bronze badges

4 Comments

Andrew Stef Over a year ago

Thanks for your answer! Actually when i run this script and try to print res[0] i get this part of the html: box-shadow { -moz-box-shadow: 3px 3px 5px #000000; -webkit-box-shadow: 3px 3px 5px #000000; box-shadow: 3px 3px 5px #000000; }

Andrew Stef Over a year ago

Oh nevermind, i had to print 4th argument. Exactely what i needed. Thanks alot! One last thing, how can i write the output to a file!?

James Doepp - pihentagyu Over a year ago

As a simple text file: with open('filename.txt', 'w') as newfile: newfile.write(res[0])

James Doepp - pihentagyu Over a year ago

Glad to be of help. I noticed you're new to StackOverflow -- welcome! If this or another answer solved your problem, you might want to accept it as the answer using the big checkmark on the left.

coralvanda · Accepted Answer · 2016-04-16 07:44:33Z

0

I think you need to add something like z.group() in order to pull out of the regex object, right? So, just changing your last line to

print z.group()

might do it.

answered Apr 16, 2016 at 7:44

coralvanda

6,6562 gold badges17 silver badges25 bronze badges

Collectives™ on Stack Overflow

Python regular expression match in html file

3 Answers 3

12 Comments

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

12 Comments

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related