Can't get entire link from string with regex in python

Question

I have the following string, and I want to parse out the link.

string =

'<td scope="row"><a href="/Archives/edgar/data/886982/000076999319000460/xslForm13F_X01/InfoTable_2019-08-09_Final.xml">InfoTable_2019-08-09_Final.html</a></td>None

So essentially grab everything between 'href=' and '">'

The result should be: /Archives/edgar/data/886982/000076999319000460/xslForm13F_X01/InfoTable_2019-08-09_Final.xml

This is what I've tried:

test = re.search('(?<=href).?(?=.xml)', final_link_str)*

and for kicks and giggles I tried this as well, to grab everything after href,

test = rtest = re.search('(?<=href).', final_link_str)*

No matter what I do, the output is only a part of the entire link.

Here is the result I'm getting:

<re.Match object; span=(23, 163), match='="/Archives/edgar/data/886982/000076999319000460/>

Have you considered trying to parse the HTML properly instead of using a regular expression? — CertainPerformance
– CertainPerformance, Commented Sep 9, 2019 at 4:14
What is the proper way? This was more of a regex learning experience from me, so I did want to purposefully use regex here. But I am curious how else you would do this. — mikelowry
– mikelowry, Commented Sep 9, 2019 at 4:31

CertainPerformance · Accepted Answer · 2019-09-09 04:35:55Z

4

Consider parsing the HTML using BeautifulSoup instead:

from bs4 import BeautifulSoup

string = '<td scope="row"><a href="/Archives/edgar/data/886982/000076999319000460/xslForm13F_X01/InfoTable_2019-08-09_Final.xml">InfoTable_2019-08-09_Final.html</a></td>None'
soup = BeautifulSoup(string, 'html.parser')
href = soup.find('a')['href']

Result:

/Archives/edgar/data/886982/000076999319000460/xslForm13F_X01/InfoTable_2019-08-09_Final.xml

answered Sep 9, 2019 at 4:35

CertainPerformance

373k55 gold badges354 silver badges359 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Emma Marcier · Accepted Answer · 2019-09-09 04:21:37Z

Just in case, if there would have been undesired spaces before and after:

href="\s*([^"\s]*)\s*"

then, the above expression might be fine.

Test

import re

string = """
<td scope="row"><a href=" /Archives/edgar/data/886982/000076999319000460/xslForm13F_X01/InfoTable_2019-08-09_Final.xml ">InfoTable_2019-08-09_Final.html</a></td>None
"""

expression = r'href="\s*([^"\s]*)\s*"'
matches = re.findall(expression, string)

print(matches)

Output

['/Archives/edgar/data/886982/000076999319000460/xslForm13F_X01/InfoTable_2019-08-09_Final.xml']

If you wish to explore/simplify/modify the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.

dcg · Accepted Answer · 2019-09-09 04:26:59Z

0

This gets what's the value of the href:

>>> string = '<td scope="row"><a href="/Archives/edgar/data/886982/000076999319000460/xslForm13F_X01/InfoTable_2019-08-09_Final.xml">InfoTable_2019-08-09_Final.html</a></td>None'
>>> re.search('href="(.*?)"', string).groups(0)
('/Archives/edgar/data/886982/000076999319000460/xslForm13F_X01/InfoTable_2019-08-09_Final.xml',)
>>>

EDIT: As commented by @Jonas Berlin, correct output would be:

>>> v, = re.search('href="(.*?)"', string).groups(0)
>>> v        
'/Archives/edgar/data/886982/000076999319000460/xslForm13F_X01/InfoTable_2019-08-09_Final.xml'

edited Sep 9, 2019 at 4:26

answered Sep 9, 2019 at 4:17

dcg

4,2291 gold badge26 silver badges33 bronze badges

1 Comment

Jonas Berlin Over a year ago

Add [0] to the end of the second command perhaps? Then you get a plain string..

Collectives™ on Stack Overflow

Can't get entire link from string with regex in python

3 Answers 3

Comments

Test

Output

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Test

Output

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related