Python regex for removing scraping results according to substrings?

Question

I have a written a scraper in python. I have a group of strings which i want to search on the page and from the result of that, i want to remove those results which contains words from another group of strings i have.

Here is the code -

def find_jobs(self, company, soup):
        allowed = re.compile(r"Developer|Engineer|Designer|Admin|Manager|Writer|Executive|Lead|Analyst|Editor|"
                             r"Associate|Architect|Recruiter|Specialist|Scientist|Support|Expert|SSE|Head|"
                             r"Producer|Evangelist|Ninja", re.IGNORECASE)
        not_allowed = re.compile(r"^responsibilities$|^description$|^requirements$|^experience$|^empowering$|^engineering$|^"
                                 r"find$|^skills$|^recruiterbox$|^google$|^communicating$|^associated$|^internship$|^you$|^"
                                 r"proficient$|^leadsquared$|^referral$|^should$|^must$|^become$|^global$|^degree$|^good$|^"
                                 r"capabilities$|^leadership$|^services$|^expertise$|^architecture$|^hire$|^follow$|^jobs$|^"
                                 r"procedures$|^conduct$|^perk$|^missed$|^generation$|^search$|^tools$|^worldwide$|^contact$|^"
                                 r"question$|^intern$|^classes$|^trust$|^ability$|^businesses$|^join$|^industry$|^response$|^"
                                 r"using$|^work$|^based$|^grow$|^provide$|^understand$|^header$|^headline$|^masthead$|^office$", re.IGNORECASE)

        profile_list = set()
        k = soup.body.findAll(text=allowed)
        for i in k:
            if len(i) < 60 and not_allowed.search(i) is None:
                profile_list.add(i.strip().upper())
        self.update_jobs(company, profile_list)

So I am facing a problem here. With the anchor tags in not_allowed, strings such as //HEADLINE-BG and ABILITY TO LEAD & MENTOR A TEAM are getting through, although i have the strings headline and ability in not_allowed. These are removed if i remove the anchor tags but then a string such as SCALABILITY ENGINEER does not get saved due to string ability in not_allowed.So being a newbie in regex, i am not sure how can i get this to work. Earlier i was using this -

def find_jobs(self, company, soup):
        allowed = re.compile(r"Developer|Designer|Engineer|Admin|Manager|Writer|Executive|Lead|Analyst|Editor|"
                             r"Associate|Architect|Recruiter|Specialist|Scientist|Support|Expert|SSE|Head"
                             r"Producer|Evangelist|Ninja", re.IGNORECASE)
        not_allowed = ['responsibilities', 'description', 'requirements', 'experience', 'empowering', 'engineering',
                       'find', 'skills', 'recruiterbox', 'google', 'communicating', 'associated', 'internship',
                       'proficient', 'leadsquared', 'referral', 'should', 'must', 'become', 'global', 'degree', 'good',
                       'capabilities', 'leadership', 'services', 'expertise', 'architecture', 'hire', 'follow',
                       'procedures', 'conduct', 'perk', 'missed', 'generation', 'search', 'tools', 'worldwide', 'contact',
                       'question', 'intern', 'classes', 'trust', 'ability', 'businesses', 'join', 'industry', 'response', 'you', 'using', 'work',              'based', 'grow', 'provide']

        profile_list = set()
        k = soup.body.findAll(text=allowed)
        for i in k:
            if len(i) < 60 and not any(x in i.lower() for x in not_allowed):
                profile_list.add(i.strip().upper())
        self.update_jobs(company, profile_list)

But this also omitted a string if a substring was present in not_allowed. Please can anyone help with this.

Are you sure you approach is the best, are there no class names etc.. you can use? — Padraic Cunningham
– Padraic Cunningham, Commented Aug 13, 2016 at 23:13

engineer14 · Accepted Answer · 2016-08-13 14:55:26Z

1

It looks like your are writing your notallowed regex wrongly. Your notallowed regex is actually looking for those words to be the only item on the line.

re.compile(r'^something_i_dont_like$') is going to match something_i_dont_like if it is the only item on the line

if you want to omit something, you need to do a negative lookahead

re.compile(r'^((?!something_i_dont_like).)*$')

answered Aug 13, 2016 at 14:55

engineer14

6175 silver badges13 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

swstephe · Accepted Answer · 2016-08-13 14:57:32Z

0

The regex

^ability$

Means "the line consists only of the word "ability". If you want sub-strings, just change to

ability

If you want to omit the word "ability", but not "disability", then use something like

\bability\b

answered Aug 13, 2016 at 14:57

swstephe

1,91011 silver badges19 bronze badges

Collectives™ on Stack Overflow

Python regex for removing scraping results according to substrings?

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related