2

I am trying to match inside an html file. This is the html:

<td>
<b>BBcode</b><br />
<textarea onclick='this.select();' style='width:300px;     height:200px;' />
[URL=http://someimage.com/LwraZS1]          [IMG]http://t1.someimage.com/LwraZS1.jpg[/IMG][    [/URL] [URL=http://someimage.com/CDnuiST]   [IMG]http://t1.someimage.com/CDnuiST.jpg[/IMG]   [/URL] [URL=http://someimage.com/Y0oZKPb][IMG]http://t1.someimage.com/Y0oZKPb.jpg[/IMG][/URL] [URL=http://someimage.com/W2RMAOR][IMG]http://t1.someimage.com/W2RMAOR.jpg[/IMG][/URL] [URL=http://someimage.com/5e5AYUz][IMG]http://t1.someimage.com/5e5AYUz.jpg[/IMG][/URL] [URL=http://someimage.com/EWDQErN][IMG]http://t1.someimage.com/EWDQErN.jpg[/IMG][/URL]
</textarea>
</td>

I want to extract all the BB code from [ to ] included.

And this is my code:

import re
x = open('/xxx/xxx/file.html', 'r').read
y = re.compile(r"""<td> <b>BBcode</b><br /><textarea onclick='this.select();' style='width:300px; height:200px;' />. (. *) </textarea> </td>""") 
z  = y.search(str(x())
print z          

But when i run this i get None object... Where is the mistake?

3
  • Forgot paren read(). Commented Apr 16, 2016 at 7:05
  • Nothing, still get a None.. Maybe the regex is wrong.. Commented Apr 16, 2016 at 7:27
  • Yeah, posted an answer.check. Commented Apr 16, 2016 at 7:51

3 Answers 3

1
import re
x = open('/xxx/xxx/file.html', 'rt').read()
r1 = r'<textarea.*?>(.*?)</textarea>'
s1 = re.findall(r1, s, re.DOTALL)[1] # just by inspection
r2 = r'\[(.*?)\]'
s2 = re.findall(r2, s1)
for u in s2:
    print(u)
Sign up to request clarification or add additional context in comments.

12 Comments

Thanks it works but it gets another part of the html because its All inside <textarea>. I updated the question with all the html text file. Thank you for the help btw!!
@AndrewStef Can you show some expected output. In regex problems it's always helpful.
Expecting output should be [URL=someimage.com/LwraZS1] [IMG]t1.someimage.com/LwraZS1.jpg[/IMG][ [/URL] [URL=someimage.com/CDnuiST] [IMG]t1.someimage.com/CDnuiST.jpg [/IMG][/URL]... This one exactely. The page is the output of someimage.com uploaded files. I am trying to catch the BBCODE between [ and ] Text out of it.
@AndrewStef Try y = re.compile(r'<textarea.*?>(?:\[(.*?)\])*?</textarea>', re.DOTALL) and z = y.findall(x)
No match found.. Unfortunately
|
0

I would use a parser for this:

from html import HTMLParser

class MyHtmlParser(HTMLParser):
    def __init__(self):
        self.reset()
        self.convert_charrefs = True
        self.dat = []
    def handle_data(self, d):
        self.dat.append(d.strip())
    def return_data(self):
        return self.dat
>>> with open('sample.html') as htmltext:
        htmldata = htmltext.read()
>>> parser = MyHtmlParser()
>>> parser.feed(htmldata)
>>> res = parser.return_data()
>>> res = [item for item in filter(None, res)]
>>> res[0]
'BBcode'
>>> 

4 Comments

Thanks for your answer! Actually when i run this script and try to print res[0] i get this part of the html: box-shadow { -moz-box-shadow: 3px 3px 5px #000000; -webkit-box-shadow: 3px 3px 5px #000000; box-shadow: 3px 3px 5px #000000; }
Oh nevermind, i had to print 4th argument. Exactely what i needed. Thanks alot! One last thing, how can i write the output to a file!?
As a simple text file: with open('filename.txt', 'w') as newfile: newfile.write(res[0])
Glad to be of help. I noticed you're new to StackOverflow -- welcome! If this or another answer solved your problem, you might want to accept it as the answer using the big checkmark on the left.
0

I think you need to add something like z.group() in order to pull out of the regex object, right? So, just changing your last line to

print z.group()

might do it.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.