1

I have the following string, and I want to parse out the link.

string =

'<td scope="row"><a href="/Archives/edgar/data/886982/000076999319000460/xslForm13F_X01/InfoTable_2019-08-09_Final.xml">InfoTable_2019-08-09_Final.html</a></td>None

So essentially grab everything between 'href=' and '">'

The result should be: /Archives/edgar/data/886982/000076999319000460/xslForm13F_X01/InfoTable_2019-08-09_Final.xml

This is what I've tried:

test = re.search('(?<=href).?(?=.xml)', final_link_str)*

and for kicks and giggles I tried this as well, to grab everything after href,

test = rtest = re.search('(?<=href).', final_link_str)*

No matter what I do, the output is only a part of the entire link.

Here is the result I'm getting:

<re.Match object; span=(23, 163), match='="/Archives/edgar/data/886982/000076999319000460/>
2
  • 2
    Have you considered trying to parse the HTML properly instead of using a regular expression? Commented Sep 9, 2019 at 4:14
  • What is the proper way? This was more of a regex learning experience from me, so I did want to purposefully use regex here. But I am curious how else you would do this. Commented Sep 9, 2019 at 4:31

3 Answers 3

4

Consider parsing the HTML using BeautifulSoup instead:

from bs4 import BeautifulSoup

string = '<td scope="row"><a href="/Archives/edgar/data/886982/000076999319000460/xslForm13F_X01/InfoTable_2019-08-09_Final.xml">InfoTable_2019-08-09_Final.html</a></td>None'
soup = BeautifulSoup(string, 'html.parser')
href = soup.find('a')['href']

Result:

/Archives/edgar/data/886982/000076999319000460/xslForm13F_X01/InfoTable_2019-08-09_Final.xml
Sign up to request clarification or add additional context in comments.

Comments

0

Just in case, if there would have been undesired spaces before and after:

href="\s*([^"\s]*)\s*"

then, the above expression might be fine.

Test

import re

string = """
<td scope="row"><a href=" /Archives/edgar/data/886982/000076999319000460/xslForm13F_X01/InfoTable_2019-08-09_Final.xml ">InfoTable_2019-08-09_Final.html</a></td>None
"""

expression = r'href="\s*([^"\s]*)\s*"'
matches = re.findall(expression, string)

print(matches)

Output

['/Archives/edgar/data/886982/000076999319000460/xslForm13F_X01/InfoTable_2019-08-09_Final.xml']

If you wish to explore/simplify/modify the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.


Comments

0

This gets what's the value of the href:

>>> string = '<td scope="row"><a href="/Archives/edgar/data/886982/000076999319000460/xslForm13F_X01/InfoTable_2019-08-09_Final.xml">InfoTable_2019-08-09_Final.html</a></td>None'
>>> re.search('href="(.*?)"', string).groups(0)
('/Archives/edgar/data/886982/000076999319000460/xslForm13F_X01/InfoTable_2019-08-09_Final.xml',)
>>> 

EDIT: As commented by @Jonas Berlin, correct output would be:

>>> v, = re.search('href="(.*?)"', string).groups(0)
>>> v        
'/Archives/edgar/data/886982/000076999319000460/xslForm13F_X01/InfoTable_2019-08-09_Final.xml'

1 Comment

Add [0] to the end of the second command perhaps? Then you get a plain string..

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.