1

Here is my code:

import urllib
import webbrowser
from bs4 import BeautifulSoup
import requests
import re

address = 'https://google.com/search?q='
# Default Google search address start
file = open( "OCR.txt", "rt" )
# Open text document that contains the question
word = file.read()
file.close()

myList = [item for item in word.split('\n')]
newString = ' '.join(myList)
# The question is on multiple lines so this joins them together with proper spacing

qstr = urllib.parse.quote_plus(newString)
# Encode the string

newWord = address + qstr
# Combine the base and the encoded query

response = requests.get(newWord)

#with open('output.html', 'wb') as f:
#    f.write(response.content)
#webbrowser.open('output.html')

answers = open("ocr2.txt", "rt")

ansTable = answers.read()
answers.close()

ans = ansTable.splitlines()

ans1 = str(ans[0])
ans2 = str(ans[2])
ans3 = str(ans[4])

ans1Score = 0
ans2Score = 0
ans3Score = 0

links = []

soup = BeautifulSoup(response.text, 'lxml')

for r in soup.find_all(class_='r'):

    linkRaw = str(r)

    link = re.search("(?P<url>https?://[^\s]+)", linkRaw).group("url")

    if '&' in link:

        finalLink = link.split('&')
        link = str(finalLink[0])

    links.append(link)

#print(links)
#print(' ')

for g in soup.find_all(class_='g'):

    webBlock = str(g)

    ans1Tally = webBlock.count(ans1)
    ans2Tally = webBlock.count(ans2)
    ans3Tally = webBlock.count(ans3)

    if  ans1 in webBlock:

        ans1Score += ans1Tally

    else:

        ans1Found = False

    if ans2 in webBlock:

        ans2Score += ans2Tally

    else:

        ans2Found = False

    if ans3 in webBlock:

        ans3Score += ans3Tally

    else:

        ans3Found = False

    if ans1Found and ans2Found and ans3Found is False:

        searchLink = str(links[0])

        if searchLink.endswith('pdf'):
            pass

        else:

            response2 = requests.get(searchLink)
            soup2 = BeautifulSoup(response2.text, 'lxml')

            for p in soup2.find_all('p'):

                extraBlock = str(p)

                extraAns1Tally = extraBlock.count(ans1)
                extraAns2tally = extraBlock.count(ans2)
                extraAns3Tally = extraBlock.count(ans3)

                if ans1 in extraBlock:

                    ans1Score += extraAns1Tally

                if ans2 in extraBlock:

                    ans2Score += extraAns2Tally

                if ans3 in extraBlock:

                    ans3Score += extraAns3Tally

                with open("Results.txt", "w") as results:
                    results.write(newString + '\n\n')    
                    results.write(ans1+": "+str(ans1Score)+'\n')
                    results.write(ans2+": "+str(ans2Score)+'\n')
                    results.write(ans3+": "+str(ans3Score))

    links.pop(0)

    print(' ')
    print('-----')
    print(ans1+": "+str(ans1Score))
    print(ans2+": "+str(ans2Score))
    print(ans3+": "+str(ans3Score))
    print('-----')

Basically right now it is scraping each "g" one at a time, when this program can benefit massively from scraping each link all at the same time. For example, I want it to have them all scraping at the same time instead of waiting until the one before it is done. Sorry if this is a simple kind of question but I have little experience with asyncio so if anyone could help that would be massively appreciated. Thanks!

2

1 Answer 1

3

To write async program you need:

  • define functions with async def
  • call it with await
  • create event loop and run some function in it
  • run requests concurrently using asyncio.gather

All other is almost same as usual. Instead of using blocking request module you should use some async one. For example, aiohttp:

python -m pip install aiohttp

And use it like this:

async def get(url):
    async with aiohttp.ClientSession() as session:
        async with session.get('https://api.github.com/events') as resp:
            return await resp.text()

Here's code with some changes I statrted. I didn't check if it's actually works since I don't have files you use. You should also move logic inside for g in soup.find_all(class_='g'): to seperate function and run multiple of these functions with asyncio.gather to benefit of asyncio.

import asyncio
import aiohttp
import urllib
import webbrowser
from bs4 import BeautifulSoup
import re


async def get(url):
    async with aiohttp.ClientSession() as session:
        async with session.get('https://api.github.com/events') as resp:
            return await resp.text()


async def main():
    address = 'https://google.com/search?q='
    # Default Google search address start
    file = open( "OCR.txt", "rt" )
    # Open text document that contains the question
    word = file.read()
    file.close()

    myList = [item for item in word.split('\n')]
    newString = ' '.join(myList)
    # The question is on multiple lines so this joins them together with proper spacing

    qstr = urllib.parse.quote_plus(newString)
    # Encode the string

    newWord = address + qstr
    # Combine the base and the encoded query

    text = await get(newWord)

    #with open('output.html', 'wb') as f:
    #    f.write(response.content)
    #webbrowser.open('output.html')

    answers = open("ocr2.txt", "rt")

    ansTable = answers.read()
    answers.close()

    ans = ansTable.splitlines()

    ans1 = str(ans[0])
    ans2 = str(ans[2])
    ans3 = str(ans[4])

    ans1Score = 0
    ans2Score = 0
    ans3Score = 0

    links = []

    soup = BeautifulSoup(text, 'lxml')

    for r in soup.find_all(class_='r'):

        linkRaw = str(r)

        link = re.search("(?P<url>https?://[^\s]+)", linkRaw).group("url")

        if '&' in link:

            finalLink = link.split('&')
            link = str(finalLink[0])

        links.append(link)

    #print(links)
    #print(' ')

    for g in soup.find_all(class_='g'):

        webBlock = str(g)

        ans1Tally = webBlock.count(ans1)
        ans2Tally = webBlock.count(ans2)
        ans3Tally = webBlock.count(ans3)

        if  ans1 in webBlock:

            ans1Score += ans1Tally

        else:

            ans1Found = False

        if ans2 in webBlock:

            ans2Score += ans2Tally

        else:

            ans2Found = False

        if ans3 in webBlock:

            ans3Score += ans3Tally

        else:

            ans3Found = False

        if ans1Found and ans2Found and ans3Found is False:

            searchLink = str(links[0])

            if searchLink.endswith('pdf'):
                pass

            else:

                text2 = await get(searchLink)
                soup2 = BeautifulSoup(text2, 'lxml')

                for p in soup2.find_all('p'):

                    extraBlock = str(p)

                    extraAns1Tally = extraBlock.count(ans1)
                    extraAns2tally = extraBlock.count(ans2)
                    extraAns3Tally = extraBlock.count(ans3)

                    if ans1 in extraBlock:

                        ans1Score += extraAns1Tally

                    if ans2 in extraBlock:

                        ans2Score += extraAns2Tally

                    if ans3 in extraBlock:

                        ans3Score += extraAns3Tally

                    with open("Results.txt", "w") as results:
                        results.write(newString + '\n\n')    
                        results.write(ans1+": "+str(ans1Score)+'\n')
                        results.write(ans2+": "+str(ans2Score)+'\n')
                        results.write(ans3+": "+str(ans3Score))

        links.pop(0)

        print(' ')
        print('-----')
        print(ans1+": "+str(ans1Score))
        print(ans2+": "+str(ans2Score))
        print(ans3+": "+str(ans3Score))
        print('-----')


if __name__ ==  '__main__':
    loop = asyncio.get_event_loop()
    try:
        loop.run_until_complete(main())
    finally:
        loop.run_until_complete(loop.shutdown_asyncgens())
        loop.close()

Upd:

Main idea is to move logic inside loop that does request into separate coroutine and pass multiple of these coroutines to asyncio.gather. It will parallelize your requests.

async def main():
    # Her do all that are before the loop.

    coros = [
        process_single_g(g)
        for g
        in soup.find_all(class_='g')
    ]

    results = await asyncio.gather(*coros)  # this function will run multiple tasks concurrently
                                            # and return all results together.

    for res in results:
        ans1Score, ans2Score, ans3Score = res

        print(' ')
        print('-----')
        print(ans1+": "+str(ans1Score))
        print(ans2+": "+str(ans2Score))
        print(ans3+": "+str(ans3Score))
        print('-----')



async def process_single_g(g):
    # Here do all things you inside loop for concrete g.

    text2 = await get(searchLink)

    # ...

    return ans1Score, ans2Score, ans3Score
Sign up to request clarification or add additional context in comments.

2 Comments

You should also move logic inside for g in soup.find_all(class_='g'): to seperate function and run multiple of these functions with asyncio.gather to benefit of asyncio. How exactly do I do that? Thank you for the reply!
@DevinGP I won't rewrite all code for you, but I updated answer showing how usage of asyncio.gather will look like in your case. You may want to start with some simple abstract example of executing multiple coroutines parallely to get the idea.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.