1

I have the below table in a html document:-

<BODY>

<TABLE cellspacing=0 class="emlhdr"><TBODY><TR style="font-size: 1px"><TD style="border: none; padding: 0px">&nbsp;</TD></TR>
</TBODY></TABLE><!-- BEGIN_EXCLUDE_MORE_DATA -->
<TABLE cellspacing=1 class="ad"><TBODY>
<TR class="even"><TH class="adlbl10"><NOBR>Title: </NOBR></TH><TD>Sample Title</TD></TR>
<TR class="odd"><TH class="adlbl10"><NOBR>Site: </NOBR></TH><TD> Sample Site </TD></TR>
<TR class="even"><TH class="adlbl10"><NOBR>URLIcon: </NOBR></TH><TD><style type="text/css">
</style>
</TD></TR>
<TR class="odd"><TH class="adlbl10"><NOBR>URL: </NOBR></TH><TD><style type="text/css">
</style>
</TD></TR>
<TR class="even"><TH class="adlbl10"><NOBR>Form: </NOBR></TH><TD>HistoryListEntry</TD></TR>
<TR class="odd"><TH class="adlbl10"><NOBR>Collaborators: </NOBR></TH><TD>1.&nbsp;&nbsp;John Doe<br>
2.&nbsp;&nbsp;Jane Doe<br>
3.&nbsp;&nbsp;Jack Doe</TD></TR>
<TR class="even"><TH class="adlbl10"><NOBR>SourceForm: </NOBR></TH><TD>Reply</TD></TR>
<TR class="odd"><TH class="adlbl10"><NOBR>$UpdatedBy: </NOBR></TH><TD>John Doe</TD></TR>
<TR class="even"><TH class="adlbl10"><NOBR>$Revisions: </NOBR></TH><TD>2/24/2020 9:37:13 AM +0000</TD></TR>
</TBODY></TABLE><!-- END_EXCLUDE_MORE_DATA -->
</BODY>

I am trying to parse the table so that the different entries go into columns in a .csv. Here is my Python 3.7 code so far:-

import os
from lxml import etree
from bs4 import BeautifulSoup
import csv

output_row = []

with open(x, 'r', encoding="ascii", errors="surrogateescape") as f:
    s = f.read()
    soup = BeautifulSoup(s, 'lxml') # Parse the HTML as a string
    table = soup.find_all('table')[1] # Grab the first table
    for table_row in table.findAll('tr'):
        columns = table_row.findAll('td')
        for column in columns:
            output_row.append(column.text)

with open('output.csv', 'w') as csvfile:
    writer = csv.writer(csvfile)
    for x in output_row:
        writer.writerows(x)

This appears to work find and extracts the data but I can't seem to align it to columns. What I would like is for each 'TR class' text values to be in a new column in the .csv. So the .csv would have 9 columns with

'Sample Title', 'Sample Site', '', '', 'HistoryListEntry', '1. John Doe 2. Jane Doe 3. Jack Doe', 'Reply', 'Joe Doe', '2/24/2020 9:37:13 AM +0000' 

respectively in them.

Any suggestions on the ammendments I need for my code?

With thanks

2 Answers 2

0

Your output_row seems to represent a single row, so you just need to write this row to your csv file (using writerow and without the for loop):

with open('output.csv', 'w') as csvfile:
    writer = csv.writer(csvfile, delimiter=',', quotechar="'", quoting=csv.QUOTE_ALL)
    writer.writerow(output_row)

For handling multiple data records, you will need a list of lists and iterate over the outer list (see also this discussion).

Also beware of the line breaks in your data fields, you should probably replace them to get a proper result.

Sign up to request clarification or add additional context in comments.

Comments

0

Another method.

from simplified_scrapy import SimplifiedDoc, utils, req
html = '''<BODY>
<TABLE cellspacing=0 class="emlhdr"><TBODY><TR style="font-size: 1px"><TD style="border: none; padding: 0px">&nbsp;</TD></TR>
</TBODY></TABLE><!-- BEGIN_EXCLUDE_MORE_DATA -->
<TABLE cellspacing=1 class="ad"><TBODY>
<TR class="even"><TH class="adlbl10"><NOBR>Title: </NOBR></TH><TD>Sample Title</TD></TR>
<TR class="odd"><TH class="adlbl10"><NOBR>Site: </NOBR></TH><TD> Sample Site </TD></TR>
<TR class="even"><TH class="adlbl10"><NOBR>URLIcon: </NOBR></TH><TD><style type="text/css">
</style>
</TD></TR>
<TR class="odd"><TH class="adlbl10"><NOBR>URL: </NOBR></TH><TD><style type="text/css">
</style>
</TD></TR>
<TR class="even"><TH class="adlbl10"><NOBR>Form: </NOBR></TH><TD>HistoryListEntry</TD></TR>
<TR class="odd"><TH class="adlbl10"><NOBR>Collaborators: </NOBR></TH><TD>1.&nbsp;&nbsp;John Doe<br>
2.&nbsp;&nbsp;Jane Doe<br>
3.&nbsp;&nbsp;Jack Doe</TD></TR>
<TR class="even"><TH class="adlbl10"><NOBR>SourceForm: </NOBR></TH><TD>Reply</TD></TR>
<TR class="odd"><TH class="adlbl10"><NOBR>$UpdatedBy: </NOBR></TH><TD>John Doe</TD></TR>
<TR class="even"><TH class="adlbl10"><NOBR>$Revisions: </NOBR></TH><TD>2/24/2020 9:37:13 AM +0000</TD></TR>
</TBODY></TABLE><!-- END_EXCLUDE_MORE_DATA -->
</BODY>'''

doc = SimplifiedDoc(html)
# header = doc.selects('TABLE.ad>TR').select("TH>text()")
data = doc.selects('TABLE.ad>TR').select("TD>text()")
utils.save2csv('output.csv',[data])
print (data)

Result:

['Sample Title', 'Sample Site', '', '', 'HistoryListEntry', '1. John Doe 2. Jane Doe 3. Jack Doe', 'Reply', 'John Doe', '2/24/2020 9:37:13 AM +0000']

Here are more examples: https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.