Parsing a table in a HTML document into a csv using Python 3.7

Question

I have the below table in a html document:-

<BODY>

<TABLE cellspacing=0 class="emlhdr"><TBODY><TR style="font-size: 1px"><TD style="border: none; padding: 0px">&nbsp;</TD></TR>
</TBODY></TABLE><!-- BEGIN_EXCLUDE_MORE_DATA -->
<TABLE cellspacing=1 class="ad"><TBODY>
<TR class="even"><TH class="adlbl10"><NOBR>Title: </NOBR></TH><TD>Sample Title</TD></TR>
<TR class="odd"><TH class="adlbl10"><NOBR>Site: </NOBR></TH><TD> Sample Site </TD></TR>
<TR class="even"><TH class="adlbl10"><NOBR>URLIcon: </NOBR></TH><TD><style type="text/css">
</style>
</TD></TR>
<TR class="odd"><TH class="adlbl10"><NOBR>URL: </NOBR></TH><TD><style type="text/css">
</style>
</TD></TR>
<TR class="even"><TH class="adlbl10"><NOBR>Form: </NOBR></TH><TD>HistoryListEntry</TD></TR>
<TR class="odd"><TH class="adlbl10"><NOBR>Collaborators: </NOBR></TH><TD>1.&nbsp;&nbsp;John Doe<br>
2.&nbsp;&nbsp;Jane Doe<br>
3.&nbsp;&nbsp;Jack Doe</TD></TR>
<TR class="even"><TH class="adlbl10"><NOBR>SourceForm: </NOBR></TH><TD>Reply</TD></TR>
<TR class="odd"><TH class="adlbl10"><NOBR>$UpdatedBy: </NOBR></TH><TD>John Doe</TD></TR>
<TR class="even"><TH class="adlbl10"><NOBR>$Revisions: </NOBR></TH><TD>2/24/2020 9:37:13 AM +0000</TD></TR>
</TBODY></TABLE><!-- END_EXCLUDE_MORE_DATA -->
</BODY>

I am trying to parse the table so that the different entries go into columns in a .csv. Here is my Python 3.7 code so far:-

import os
from lxml import etree
from bs4 import BeautifulSoup
import csv

output_row = []

with open(x, 'r', encoding="ascii", errors="surrogateescape") as f:
    s = f.read()
    soup = BeautifulSoup(s, 'lxml') # Parse the HTML as a string
    table = soup.find_all('table')[1] # Grab the first table
    for table_row in table.findAll('tr'):
        columns = table_row.findAll('td')
        for column in columns:
            output_row.append(column.text)

with open('output.csv', 'w') as csvfile:
    writer = csv.writer(csvfile)
    for x in output_row:
        writer.writerows(x)

This appears to work find and extracts the data but I can't seem to align it to columns. What I would like is for each 'TR class' text values to be in a new column in the .csv. So the .csv would have 9 columns with

'Sample Title', 'Sample Site', '', '', 'HistoryListEntry', '1. John Doe 2. Jane Doe 3. Jack Doe', 'Reply', 'Joe Doe', '2/24/2020 9:37:13 AM +0000'

respectively in them.

Any suggestions on the ammendments I need for my code?

With thanks

Gerd · Accepted Answer · 2020-05-06 19:59:00Z

0

Your output_row seems to represent a single row, so you just need to write this row to your csv file (using writerow and without the for loop):

with open('output.csv', 'w') as csvfile:
    writer = csv.writer(csvfile, delimiter=',', quotechar="'", quoting=csv.QUOTE_ALL)
    writer.writerow(output_row)

For handling multiple data records, you will need a list of lists and iterate over the outer list (see also this discussion).

Also beware of the line breaks in your data fields, you should probably replace them to get a proper result.

answered May 6, 2020 at 19:59

Gerd

2,8832 gold badges11 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

dabingsou · Accepted Answer · 2020-07-14 05:57:59Z

Another method.

from simplified_scrapy import SimplifiedDoc, utils, req
html = '''<BODY>
<TABLE cellspacing=0 class="emlhdr"><TBODY><TR style="font-size: 1px"><TD style="border: none; padding: 0px">&nbsp;</TD></TR>
</TBODY></TABLE><!-- BEGIN_EXCLUDE_MORE_DATA -->
<TABLE cellspacing=1 class="ad"><TBODY>
<TR class="even"><TH class="adlbl10"><NOBR>Title: </NOBR></TH><TD>Sample Title</TD></TR>
<TR class="odd"><TH class="adlbl10"><NOBR>Site: </NOBR></TH><TD> Sample Site </TD></TR>
<TR class="even"><TH class="adlbl10"><NOBR>URLIcon: </NOBR></TH><TD><style type="text/css">
</style>
</TD></TR>
<TR class="odd"><TH class="adlbl10"><NOBR>URL: </NOBR></TH><TD><style type="text/css">
</style>
</TD></TR>
<TR class="even"><TH class="adlbl10"><NOBR>Form: </NOBR></TH><TD>HistoryListEntry</TD></TR>
<TR class="odd"><TH class="adlbl10"><NOBR>Collaborators: </NOBR></TH><TD>1.&nbsp;&nbsp;John Doe<br>
2.&nbsp;&nbsp;Jane Doe<br>
3.&nbsp;&nbsp;Jack Doe</TD></TR>
<TR class="even"><TH class="adlbl10"><NOBR>SourceForm: </NOBR></TH><TD>Reply</TD></TR>
<TR class="odd"><TH class="adlbl10"><NOBR>$UpdatedBy: </NOBR></TH><TD>John Doe</TD></TR>
<TR class="even"><TH class="adlbl10"><NOBR>$Revisions: </NOBR></TH><TD>2/24/2020 9:37:13 AM +0000</TD></TR>
</TBODY></TABLE><!-- END_EXCLUDE_MORE_DATA -->
</BODY>'''

doc = SimplifiedDoc(html)
# header = doc.selects('TABLE.ad>TR').select("TH>text()")
data = doc.selects('TABLE.ad>TR').select("TD>text()")
utils.save2csv('output.csv',[data])
print (data)

Result:

['Sample Title', 'Sample Site', '', '', 'HistoryListEntry', '1. John Doe 2. Jane Doe 3. Jack Doe', 'Reply', 'John Doe', '2/24/2020 9:37:13 AM +0000']

Here are more examples: https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples

Collectives™ on Stack Overflow

Parsing a table in a HTML document into a csv using Python 3.7

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related