0

I am trying to webscrape the "Active Positions" table from the following website:

https://www.nasdaq.com/market-activity/stocks/aapl/institutional-holdings

My code is below:

from bs4 import BeautifulSoup
import requests

html_text = requests.get('https://www.nasdaq.com/market-activity/stocks/aapl/institutional-holdings')
soup = BeautifulSoup(html_text, 'lxml')
job1 = soup.find('div', classs_ = 'dialog-off-canvas-main-canvas')
job2 = job1.find('div', class_ = 'page with-primary-nav hide-more-videos')
job3 = job2.find('div', class_ = 'page__main')
job4 = job3.find('div', class_ = 'page__content')
job5 = job4.find('div', class_ = 'quote-subdetail__content quote-subdetail__content--new')
job6 = job5.findAll('div', class_ = 'layout layout--2-col-large')
job7 = job6.find('div', class_ = 'institutional-holdings institutional-holdings--paginated')
job8 = job7.find('div', class_ = 'institutional-holdings__section institutional-holdings__section--active-positions')
job9 = job8.find('div', class_ = 'institutional-holdings__table-container')
job10 = job9.find('table', class_ = 'institutional-holdings__table')
job11 = job10.find('tbody', class_ = 'institutional-holdings__body')
job12 = job11.findAll('tr', class_ = 'institutional-holdings__row').text

print(job12)

I have chosen to include nearly every class path to attempt to speed up the execution, as including only a couple took up to 10 minutes before i decided to interupt. However, i still get the same long execution with no output. Is there something wrong with my code? Or can I improve this by doing something I haven't thought of? Thanks.

1 Answer 1

1

Data is being hydrated in page via Javascript XHR calls. Here is a way of getting ActivePositions by scraping the API endpoint directly:

import requests
import pandas as pd

url = 'https://api.nasdaq.com/api/company/AAPL/institutional-holdings?limit=15&type=TOTAL&sortColumn=marketValue&sortOrder=DESC'

headers = {
    'accept': 'application/json, text/plain, */*',
    'origin': 'https://www.nasdaq.com',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}

r = requests.get(url, headers=headers)
df = pd.json_normalize(r.json()['data']['activePositions']['rows'])
print(df)

Result in terminal:

positions   holders shares
0   Increased Positions 1,780   239,170,203
1   Decreased Positions 2,339   209,017,331
2   Held Positions  283 8,965,339,255
3   Total Institutional Shares  4,402   9,413,526,789

In case you want to scrape the big 4,402 Institutional Holders table, there are ways for that too.

EDIT: Here is how you can save the data to a json file:

df.to_json('active_positions.json')

Although it might make more sense to save it as tabular data (csv):

df.to_csv('active_positions.csv')

Pandas docs: https://pandas.pydata.org/docs/

Sign up to request clarification or add additional context in comments.

6 Comments

Thank you! I noticed you've included JSON with the code, how would i be able to save the output data into a json file?
Welcome @kiestuthridge23. I edited my answer, to show you how you can save the data to json, and also to csv.
That's great thanks. Also how would I be able to scrape the larger table below as you mentioned?
There is a different API for that one - you can find it under Dev tools - Network tab. If you have difficulties, post a new question (as it is really a new question, based on my suggestion :) )
I will give you a solution to your new question if you will ask it @kiestuthridge23
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.