0

I have a written a scraper in python. I have a group of strings which i want to search on the page and from the result of that, i want to remove those results which contains words from another group of strings i have.

Here is the code -

def find_jobs(self, company, soup):
        allowed = re.compile(r"Developer|Engineer|Designer|Admin|Manager|Writer|Executive|Lead|Analyst|Editor|"
                             r"Associate|Architect|Recruiter|Specialist|Scientist|Support|Expert|SSE|Head|"
                             r"Producer|Evangelist|Ninja", re.IGNORECASE)
        not_allowed = re.compile(r"^responsibilities$|^description$|^requirements$|^experience$|^empowering$|^engineering$|^"
                                 r"find$|^skills$|^recruiterbox$|^google$|^communicating$|^associated$|^internship$|^you$|^"
                                 r"proficient$|^leadsquared$|^referral$|^should$|^must$|^become$|^global$|^degree$|^good$|^"
                                 r"capabilities$|^leadership$|^services$|^expertise$|^architecture$|^hire$|^follow$|^jobs$|^"
                                 r"procedures$|^conduct$|^perk$|^missed$|^generation$|^search$|^tools$|^worldwide$|^contact$|^"
                                 r"question$|^intern$|^classes$|^trust$|^ability$|^businesses$|^join$|^industry$|^response$|^"
                                 r"using$|^work$|^based$|^grow$|^provide$|^understand$|^header$|^headline$|^masthead$|^office$", re.IGNORECASE)

        profile_list = set()
        k = soup.body.findAll(text=allowed)
        for i in k:
            if len(i) < 60 and not_allowed.search(i) is None:
                profile_list.add(i.strip().upper())
        self.update_jobs(company, profile_list)

So I am facing a problem here. With the anchor tags in not_allowed, strings such as //HEADLINE-BG and ABILITY TO LEAD & MENTOR A TEAM are getting through, although i have the strings headline and ability in not_allowed. These are removed if i remove the anchor tags but then a string such as SCALABILITY ENGINEER does not get saved due to string ability in not_allowed.So being a newbie in regex, i am not sure how can i get this to work. Earlier i was using this -

def find_jobs(self, company, soup):
        allowed = re.compile(r"Developer|Designer|Engineer|Admin|Manager|Writer|Executive|Lead|Analyst|Editor|"
                             r"Associate|Architect|Recruiter|Specialist|Scientist|Support|Expert|SSE|Head"
                             r"Producer|Evangelist|Ninja", re.IGNORECASE)
        not_allowed = ['responsibilities', 'description', 'requirements', 'experience', 'empowering', 'engineering',
                       'find', 'skills', 'recruiterbox', 'google', 'communicating', 'associated', 'internship',
                       'proficient', 'leadsquared', 'referral', 'should', 'must', 'become', 'global', 'degree', 'good',
                       'capabilities', 'leadership', 'services', 'expertise', 'architecture', 'hire', 'follow',
                       'procedures', 'conduct', 'perk', 'missed', 'generation', 'search', 'tools', 'worldwide', 'contact',
                       'question', 'intern', 'classes', 'trust', 'ability', 'businesses', 'join', 'industry', 'response', 'you', 'using', 'work',              'based', 'grow', 'provide']

        profile_list = set()
        k = soup.body.findAll(text=allowed)
        for i in k:
            if len(i) < 60 and not any(x in i.lower() for x in not_allowed):
                profile_list.add(i.strip().upper())
        self.update_jobs(company, profile_list)

But this also omitted a string if a substring was present in not_allowed. Please can anyone help with this.

1
  • Are you sure you approach is the best, are there no class names etc.. you can use? Commented Aug 13, 2016 at 23:13

2 Answers 2

1

It looks like your are writing your notallowed regex wrongly. Your notallowed regex is actually looking for those words to be the only item on the line.

re.compile(r'^something_i_dont_like$') is going to match something_i_dont_like if it is the only item on the line

if you want to omit something, you need to do a negative lookahead

re.compile(r'^((?!something_i_dont_like).)*$')

Sign up to request clarification or add additional context in comments.

Comments

0

The regex

^ability$

Means "the line consists only of the word "ability". If you want sub-strings, just change to

ability

If you want to omit the word "ability", but not "disability", then use something like

\bability\b

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.