0

I'm trying to scrape a web page at my company and write the result to a CSV file.

I am able to get at the data I want with this code:

page = requests.get('https://wiki.us.cworld.company.com/display/6TO/AWS+Accounts', auth=('tdunphy', 'secret!'))
soup = BeautifulSoup(page.text, 'html.parser')
html = list(soup.children)[1]
all_rows = soup.find_all('tr')
row_count = 0
for row in all_rows:
    row_count += 1
    if row_count == 1:
        continue
    print(row.get_text())

But the resulting data is run together and barely decipherable:

company-govcloud-ab-mc-stage-adminkpmg-us-aws-adv-ab-mc-govcloud-admin-stageCommercial AccountAdvisory12345678901NoIslandhttps://company-govcloud-ab-mc-stage-admin.signin.aws.amazon.com/consoleKarel Somebody23452126676371Console, Access Key
company-govcloud-ab-mc-stagekpmg-us-aws-adv-ab-mc-govcloud-stageGov AccountAdvisory12324546562NoIslandhttps://company-govcloud-ab-mc-stage.signin.amazonaws-us-gov.com/consoleKarel Somebody123213123131Console, Access Key
company-cob(Decommissioned 03/28/2019)company-COB COB, Client OnboardingAdvisory21234546789812NoIslandhttps://company-cob.signin.aws.amazon.com/console/Laurence LorcaPending DecommissionConsole, Access Key

I want the resulting CSV to have the following headers:

['Company Account Name', 'AWS Account Name', 'Description', 'LOB', 'AWS Account Number', 'CIDR Block', 'Connected to Montvale', 'Peninsula or Island', 'URL', 'Owner', 'Engagement Code', 'CloudOps Access Type']

On the original web page the data is in an HTML table, and the results are legible:

company-govcloud-ab-mc-stage-admin  company-us-aws-adv-ab-mc-govcloud-admin-stage   Commercial Account  Advisory    12345667890101  No  Island  https://company-govcloud-ab-mc-stage-admin.signin.aws.amazon.com/console    Karel Somebody  123456789101    Console, Access Key

Here is some sample HTML from the data that I'm extracting:

<tr><td class="confluenceTd">company-master</td><td class="confluenceTd">us-ktawsmasacct</td><td class="confluenceTd">Master Account</td><td class="confluenceTd">BPG</td><td class="confluenceTd"><span style="text-decoration: none;">123456789101</span></td><td colspan="1" class="confluenceTd"><br/></td><td class="confluenceTd">No</td><td class="confluenceTd">N/A - no cloud resources</td><td class="confluenceTd"><a href="https://us-ktech-aws-master-acct.signin.aws.amazon.com/console" class="external-link" rel="nofollow">https://us-ktech-aws-master-acct.signin.aws.amazon.com/console</a></td><td class="confluenceTd">Rahul Arya</td><td class="confluenceTd"> 245612345678</td><td class="confluenceTd">Console,   Access Key</td></tr><tr><td class="confluenceTd">company-transit-hub1</td><td class="confluenceTd">us-ktawsth1acct</td><td class="confluenceTd">Transit Hub</td><td class="confluenceTd">BPG</td><td class="confluenceTd"><span style="text-decoration: none;">303779310401</span></td><td colspan="1" class="confluenceTd"><span style="color: rgb(0,0,0);">10.47.0.0/24</span></td><td class="confluenceTd">No</td><td class="confluenceTd">Peninsula</td><td class="confluenceTd"><a href="https://company-transit-hub1.signin.aws.amazon.com/console" class="external-link" rel="nofollow">https://company-transit-hub1.signin.aws.amazon.com/console</a></td><td class="confluenceTd">Rahul Arya</td><td class="confluenceTd"> 245612345678</td><td class="confluenceTd">Console,   Access Key</td></tr>

<tr><td colspan="1" class="confluenceTd">company-transit-hub3 (lab)</td><td colspan="1" class="confluenceTd"><span style="color: rgb(68,68,68);text-decoration: none;">us-dbawsth3acct</span></td><td colspan="1" class="confluenceTd">Transit Hub</td><td colspan="1" class="confluenceTd">BPG</td><td colspan="1" class="confluenceTd"><span style="color: rgb(68,68,68);text-decoration: none;">1098765432101</span> </td><td colspan="1" class="confluenceTd"><span style="color: rgb(0,0,0);">10.0.0.0/24</span></td><td colspan="1" class="confluenceTd">No</td><td colspan="1" class="confluenceTd">Island</td><td colspan="1" class="confluenceTd"> <a href="https://company-transithub3-lab.signin.aws.amazon.com/console" class="external-link" rel="nofollow">https://company-transithub3-lab.signin.aws.amazon.com/console</a></td><td colspan="1" class="confluenceTd">Rahul Arya </td><td colspan="1" class="confluenceTd"> </td><td colspan="1" class="confluenceTd">Console, Access Key</td></tr>

<tr><td class="confluenceTd">company-security</td><td class="confluenceTd"><span style="color: rgb(68,68,68);text-decoration: none;">us-ktawssecacct</span></td><td class="confluenceTd">Security</td><td class="confluenceTd">BPG</td><td class="confluenceTd">254312345691</td><td colspan="1" class="confluenceTd"><br/></td><td class="confluenceTd">No</td><td class="confluenceTd"><span>connected through hub1</span></td><td class="confluenceTd"><a href="https://us-ktawssecacct.signin.aws.amazon.com/console" class="external-link" rel="nofollow">https://us-ktawssecacct.signin.aws.amazon.com/console</a></td><td class="confluenceTd">Rahul Arya</td><td class="confluenceTd"> 245612345678</td><td class="confluenceTd">Console,   Access Key</td></tr><tr><td class="confluenceTd">company-shared-services</td><td class="confluenceTd">us-ktawsssacct</td><td class="confluenceTd">Shared Services</td><td class="confluenceTd">BPG</td><td class="confluenceTd">300944922012</td><td colspan="1" class="confluenceTd"><br/></td><td class="confluenceTd">No</td><td class="confluenceTd"><span>connected through hub1</span></td><td class="confluenceTd"><a href="https://company-shared-services.signin.aws.amazon.com/console" class="external-link" rel="nofollow">https://company-shared-services.signin.aws.amazon.com/console</a></td><td class="confluenceTd">Rahul Arya</td><td class="confluenceTd">245612345678</td><td class="confluenceTd">Console,   Access Key</td></tr><tr>

<tr><td class="confluenceTd">company-logging</td><td class="confluenceTd">us-ktawslogmonacct</td><td class="confluenceTd">Logging</td><td class="confluenceTd">BPG</td><td class="confluenceTd">542348765123</td><td colspan="1" class="confluenceTd"><br/></td><td class="confluenceTd">No</td><td class="confluenceTd"><span>connected through hub1</span></td><td class="confluenceTd"><a href="https://company-logging.signin.aws.amazon.com/console" class="external-link" rel="nofollow">https://company-logging.signin.aws.amazon.com/console</a></td><td class="confluenceTd">Rahul Arya</td><td class="confluenceTd">800000039768</td><td class="confluenceTd">Console,   Access Key</td></tr><tr><td class="confluenceTd">company-spoke-acct1</td><td class="confluenceTd">us-ktawsspk1acct</td><td class="confluenceTd">Spoke Account</td><td class="confluenceTd">BPG</td><td class="confluenceTd"><span style="text-decoration: none;">103440952267</span></td><td colspan="1" class="confluenceTd"><span style="color: rgb(0,0,0);text-decoration: none;">10.47.8.0/24</span></td><td class="confluenceTd">No</td><td class="confluenceTd"><span>connected through hub1</span></td><td class="confluenceTd"><a href="https://block-chain.signin.aws.amazon.com/console" class="external-link" rel="nofollow">https://block-chain.signin.aws.amazon.com/console</a></td><td class="confluenceTd">Rahul Arya</td><td class="confluenceTd"><p>123456757897</p></td><td class="confluenceTd">Console,   Access Key</td></tr>

The problem is that when I scrape the data from the page, the data is run together, and I need to separate the data and insert commas.

How can I insert a comma between each field of the table data so that I can write it to a CSV file?

3
  • It is probably better to parse each row (tr), extract every element, e.g. td and create a list of lists, which can be read e.g. with pandas. Maybe this helps... if not, please post a sample of your data (page), so that your output can be reproduced. Commented Jul 1, 2019 at 22:43
  • Thanks. I've updated the OP with some sample HTML from the data that I'm trying to extract. I've also put it into a paste Commented Jul 2, 2019 at 12:21
  • Have a look at Andrej Keselys answer below, I think that is what you need. :-) Commented Jul 2, 2019 at 13:27

1 Answer 1

1

For writing the CSV file use built-in csv module:

data = '''
<table>
<tr>
<td>company-govcloud-ab-mc-stage-admin</td>
<td>company-us-aws-adv-ab-mc-govcloud-admin-stage</td>
<td>Commercial Account</td>
<td>Advisory</td>
<td>12345667890101</td>
<td>No</td>
<td>Island</td>
<td>https://company-govcloud-ab-mc-stage-admin.signin.aws.amazon.com/console</td>
<td>Karel Somebody</td>
<td>123456789101</td>
<td>Console, Access Key</td>
</tr>
</table>'''

headers = ['Company Account Name', 'AWS Account Name', 'Description', 'LOB', 'AWS Account Number', 'Connected to Montvale', 'Peninsula or Island', 'URL', 'Owner', 'Engagement Code', 'CloudOps Access Type']

from bs4 import BeautifulSoup
import csv

soup = BeautifulSoup(data, 'lxml')

rows = []
for tr in soup.select('tr'):
    rows.append([td.text for td in soup.select('td')])


with open('out.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile, delimiter=';',
                            quotechar='"', quoting=csv.QUOTE_MINIMAL)
    writer.writerow(headers)

    for row in rows:
        writer.writerow(row)

The file out.csv contains:

Company Account Name;AWS Account Name;Description;LOB;AWS Account Number;Connected to Montvale;Peninsula or Island;URL;Owner;Engagement Code;CloudOps Access Type
company-govcloud-ab-mc-stage-admin;company-us-aws-adv-ab-mc-govcloud-admin-stage;Commercial Account;Advisory;12345667890101;No;Island;https://company-govcloud-ab-mc-stage-admin.signin.aws.amazon.com/console;Karel Somebody;123456789101;Console, Access Key

Screenshot from LibreOffice Calc:

enter image description here

Sign up to request clarification or add additional context in comments.

4 Comments

Thank you. When I use your code almost exactly as presented it produces this result. The only difference between the code that I used and yours is that I am pulling the info from a web page, and I think you are using info embedded into the script. This the code I'm using. How can I get the result you show in your post?
This is the code I am using. Sorry, there was a problem with the link I've posted before.
@bluethundr the CSV seems ok, but the program you are opening it doesn't support ; as delimiter. Try to change it for comma for example. Now I'm on vacation, so I couldn't look at it closely
Thanks, I am using the comma as the delimeter: writer = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL) but I am getting the same result. I am opening the CSV in Excel. And the row data is duplicated 68 times going down. And the unique data is spreading across the page instead of going down.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.