1

I am scraping all the words from website Merriam-Webster.

I want to scrape all pages starting from a-z and all pages within them and save them to a text file. The problem i'm having is i only get first result of the table instead of all. I know that this is a large amount of text (around 500k) but i'm doing it for educating myself.

CODE:

import requests
from bs4 import BeautifulSoup as bs

URL = 'https://www.merriam-webster.com/browse/dictionary/a/'

page = 1
# for page in range(1, 75):

req = requests.get(URL + str(page))
soup = bs(req.text, 'html.parser')
containers = soup.find('div', attrs={'class', 'entries'})
table = containers.find_all('ul')

for entries in table:
    links = entries.find_all('a')
    name = links[0].text
    print(name)

Now what i want is to get all the entries from this table, but instead i only get the first entry.

I'm kinda stuck here so any help would be appreciated. Thanks

https://www.merriam-webster.com/browse/medical/a-z
https://www.merriam-webster.com/browse/legal/a-z
https://www.merriam-webster.com/browse/dictionary/a-z
https://www.merriam-webster.com/browse/thesaurus/a-z
1
  • 1
    Like the answer below, you need another for loop. One is for looping a-z, inner for loop for looping page numbers. To get the page number, find the a tag for last page then you will get the last page number: <a aria-label="Last" data-page="75" ... Commented Oct 21, 2020 at 19:33

2 Answers 2

1

To get all entries, you can use this example:

import requests
from bs4 import BeautifulSoup


url = 'https://www.merriam-webster.com/browse/dictionary/a/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

for a in soup.select('.entries a'):
    print('{:<30} {}'.format(a.text, 'https://www.merriam-webster.com' + a['href']))

Prints:

(a) heaven on earth            https://www.merriam-webster.com/dictionary/%28a%29%20heaven%20on%20earth
(a) method in/to one's madness https://www.merriam-webster.com/dictionary/%28a%29%20method%20in%2Fto%20one%27s%20madness
(a) penny for your thoughts    https://www.merriam-webster.com/dictionary/%28a%29%20penny%20for%20your%20thoughts
(a) quarter after              https://www.merriam-webster.com/dictionary/%28a%29%20quarter%20after
(a) quarter of                 https://www.merriam-webster.com/dictionary/%28a%29%20quarter%20of
(a) quarter past               https://www.merriam-webster.com/dictionary/%28a%29%20quarter%20past
(a) quarter to                 https://www.merriam-webster.com/dictionary/%28a%29%20quarter%20to
(all) by one's lonesome        https://www.merriam-webster.com/dictionary/%28all%29%20by%20one%27s%20lonesome
(all) choked up                https://www.merriam-webster.com/dictionary/%28all%29%20choked%20up
(all) for the best             https://www.merriam-webster.com/dictionary/%28all%29%20for%20the%20best
(all) in good time             https://www.merriam-webster.com/dictionary/%28all%29%20in%20good%20time

...and so on.

To scrape multiple pages:

url = 'https://www.merriam-webster.com/browse/dictionary/a/{}'

for page in range(1, 76):
    soup = BeautifulSoup(requests.get(url.format(page)).content, 'html.parser')
    for a in soup.select('.entries a'):
        print('{:<30} {}'.format(a.text, 'https://www.merriam-webster.com' + a['href']))

EDIT: To get all pages from A to Z:

import requests
from bs4 import BeautifulSoup


url = 'https://www.merriam-webster.com/browse/dictionary/{}/{}'

for char in range(ord('a'), ord('z')+1):
    page = 1
    while True:
        soup = BeautifulSoup(requests.get(url.format(chr(char), page)).content, 'html.parser')
        for a in soup.select('.entries a'):
            print('{:<30} {}'.format(a.text, 'https://www.merriam-webster.com' + a['href']))

        last_page = soup.select_one('[aria-label="Last"]')['data-page']
        if last_page == '':
            break

        page += 1

EDIT 2: To save to file:

import requests
from bs4 import BeautifulSoup


url = 'https://www.merriam-webster.com/browse/dictionary/{}/{}'


with open('data.txt', 'w') as f_out:
    for char in range(ord('a'), ord('z')+1):
        page = 1
        while True:
            soup = BeautifulSoup(requests.get(url.format(chr(char), page)).content, 'html.parser')
            for a in soup.select('.entries a'):
                print('{:<30} {}'.format(a.text, 'https://www.merriam-webster.com' + a['href']))

                print('{}\t{}'.format(a.text, 'https://www.merriam-webster.com' + a['href']), file=f_out)

            last_page = soup.select_one('[aria-label="Last"]')['data-page']
            if last_page == '':
                break

            page += 1
Sign up to request clarification or add additional context in comments.

5 Comments

This is only for page 'a' but i want this for all pages 'a-z'. Kindly Can u tell me that as well
Thanks, but as a beginner i don't understand some of the commands so if you can add some explanation that would help others as well. And i also want to save this to a text file how do it do that.
@Mujtaba See my Edit 2, how to save to file.
Thanks sir, it's really helpful. Finally i'm adding other categories as well but i'm getting some errors. Can u please add that as well in the code. They are as follows: thesaurus, medical, legal.
Please look above this when u have time.
1

I think you need another loop:

for entries in table:
    links = entries.find_all('a')
    for name in links:
        print(name.text)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.