Why regular expression is not working, python?

Question

I am trying to pull url from inside of html but seems that regex is not working. Any issue spotted ? Though when i take only a part of html for my website it works fine(have commented out that part of code)

I do know about scapy and beautifulSoap etc. but for due to restriction i don't want to use such modules.

    page="ANY-XYZ-WEBSITE"

    def extract_first_link():
        urlopener=urllib.urlopen(page)
        html=str(urlopener.read())
        matchObj = re.match( '<a href="(.*)/([0-9a-zA-Z-]+)"', html, re.I)
        #k = open ("file.txt",'w')
        #k.write(html)
        #print "matchObj.group() : ", matchObj.group(1)
        #matchObj = re.match( '<a href="(.*)/([0-9a-zA-Z-]+)"', html[4111:4150], re.M|re.I)
        print "matchObj.group() : ", matchObj.group()
        print "matchObj.group() : ", matchObj.group(1)
        print "matchObj.group() : ", matchObj.group(2)

    if __name__=="__main__":
        print extract_first_link()

I think you need to use re.search instead of re.match. match will only look at the beginning of the string. — tobias_k
– tobias_k, Commented Sep 2, 2015 at 12:22

mstuebner · Accepted Answer · 2015-09-02 21:51:15Z

1

re.match checks only the beginning of the string, re.search searches all the string.

Described here: https://docs.python.org/2/library/re.html

answered Sep 2, 2015 at 21:51

mstuebner

4344 silver badges15 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Why regular expression is not working, python?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related