0

I want to use regular expressions to match a pattern and extract a section of the pattern.

I have scraped HTML data, an illustrative snippet looks like:

</script>
</li>
<li itemprop="itemListElement" itemscope="" itemtype="http://schema.org/ListItem">
<span class="hide" itemprop="position">1</span>
<div class="result-heading">
<a class="project-icon show-outline" href="/projects/quickfixj/" title="Find out more about QuickFIX/J - Open Source Java FIX Engine">
<img alt="QuickFIX/J - Open Source Java FIX Engine Icon" src="//a.fsdn.com/allura/p/quickfixj/icon?1533295730"/></a>
<div class="result-heading-texts">
<a href="/projects/quickfixj/" itemprop="url" title="Find out more 
<a href="/projects/desmoj/" itemprop="url" title="Find out more about DESMO-J"><h2>DESMO-J</h2></a>
<div class="description">
<p class="description-inner">DESMO-<em>J</em> is a framework for 
<a href="/projects/desmoj/files/stats/timeline" title="Downloads This Week">29 This Week</a>
</strong>
<strong>

More representative subset highlighting issue with find_all('a'):

<!-- Menu -->
<ul class="header-nav-menulist">
<li class="highlight social row">
<span class="social-label">Connect</span>
<span class="social-icons">
<span></span>
<a class="twitter" href="https://twitter.com/sourceforge" rel="nofollow" target="_blank">
<svg viewbox="0 0 1792 1792" xmlns="http://www.w3.org/2000/svg"><path d="M1684 408q-67 98-162 167 1 14 1 42 0 130-38 259.5t-115.5 248.5-184.5 210.5-258 146-323 54.5q-271 0-496-145 35 4 78 4 225 0 401-138-105-2-188-64.5t-114-159.5q33 5 61 5 43 0 85-11-112-23-185.5-111.5t-73.5-205.5v-4q68 38 146 41-66-44-105-115t-39-154q0-88 44-163 121 149 294.5 238.5t371.5 99.5q-8-38-8-74 0-134 94.5-228.5t228.5-94.5q140 0 236 102 109-21 205-78-37 115-142 178 93-10 186-50z"></path></svg></a>
<a class="facebook" href="https://www.facebook.com/sourceforgenet/" rel="nofollow" target="_blank">

The HTML is currently stored as a BeautifulSoup object, i.e. it has been passed through:

html_soup= BeautifulSoup(response.text, 'html.parser')

I would like to search this entire object for all instances of /projects/ and extract the string between the subsequent slashes. For example:

from "/projects/quickfixj/" I would like to store "quickfixj".

My initial idea is to use re.findall() and try to match (/projects/./)* but this does not work.

Any help is greatly appreciated.

3 Answers 3

1

You are already half way through

a='''</script>
</li>
<li itemprop="itemListElement" itemscope="" itemtype="http://schema.org/ListItem">
<span class="hide" itemprop="position">1</span>
<div class="result-heading">
<a class="project-icon show-outline" href="/projects/quickfixj/" title="Find out more about QuickFIX/J - Open Source Java FIX Engine">
<img alt="QuickFIX/J - Open Source Java FIX Engine Icon" src="//a.fsdn.com/allura/p/quickfixj/icon?1533295730"/></a>
<div class="result-heading-texts">
<a href="/projects/quickfixj/" itemprop="url" title="Find out more 
<a href="/projects/desmoj/" itemprop="url" title="Find out more about DESMO-J"><h2>DESMO-J</h2></a>
<div class="description">
<p class="description-inner">DESMO-<em>J</em> is a framework for 
<a href="/projects/desmoj/files/stats/timeline" title="Downloads This Week">29 This Week</a>
</strong>
<strong>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(a,"html.parser")
for i in soup.find_all('a'):
    print(re.findall('/projects/(\w{1,})/',i.get('href')))

In case you need unique projects. Change last few line to

from bs4 import BeautifulSoup
soup = BeautifulSoup(a,"html.parser")
project_set=set()
for i in soup.find_all('a'):
    project_set.add(*re.findall('/projects/(\w{1,})/',i.get('href')))

print(project_set) #{u'desmoj', u'quickfixj'}
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you for your answer, I am having one issue using it largely because I didn't use fully representative example data. I have updated the question to show this. You'll note in the new data, some instances that start with a have an href that doesn't include a /projects/ so when trying to use set.add() an error is thrown as the object to be added is empty. I am new to python so struggling to work around this if you could help. Thanks!
well that's fine just check the len of the object before adding to the set.
0

You can extract all of the links and then apply a regex:

from bs4 import BeautifulSoup

html = '''</script>
</li>
<li itemprop="itemListElement" itemscope="" itemtype="http://schema.org/ListItem">
<span class="hide" itemprop="position">1</span>
<div class="result-heading">
<a class="project-icon show-outline" href="/projects/quickfixj/" title="Find out more about QuickFIX/J - Open Source Java FIX Engine">
<img alt="QuickFIX/J - Open Source Java FIX Engine Icon" src="//a.fsdn.com/allura/p/quickfixj/icon?1533295730"/></a>
<div class="result-heading-texts">
<a href="/projects/quickfixj/" itemprop="url" title="Find out more 
<a href="/projects/desmoj/" itemprop="url" title="Find out more about DESMO-J"><h2>DESMO-J</h2></a>
<div class="description">
<p class="description-inner">DESMO-<em>J</em> is a framework for 
<a href="/projects/desmoj/files/stats/timeline" title="Downloads This Week">29 This Week</a>
</strong>
<strong>'''

html_soup = BeautifulSoup(html, 'html.parser')

links = [i.get('href') for i in html_soup.find_all('a', href=True)]

Yields:

['/projects/quickfixj/', '/projects/quickfixj/', '/projects/desmoj/files/stats/timeline']

Then you can apply your regex:

cleaned = [re.findall(r'(?<=projects\/)(.*?)\/', i)[0] for i in links]

Yields:

['quickfixj', 'quickfixj', 'desmoj']

Comments

0

A Regex like this should do the trick (?<=\/projects\/).+?(?=\/)

And would work like this

import re
regex = "(?<=\/projects\/).+?(?=\/)"
string = "<a href="/projects/quickfixj/" itemprop="url" title="Find out more...."
matches = re.findall(regex, string)
print(matches)

Output: ["quickfixj"]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.