I am wanting to use beautifulsoup to scrape HTML to pull out only two columns from every row in one table. However, each "tr" row has 10 "td" cells, and I only want the [1] and [8] "td" cell from each row. What is the most pythonic way to do this?
From my input below I've got one table, one body, three rows, and 10 cells per row.
Input
<table id ="tblMain">
<tbody>
<tr>
<td "text"</td>
<td "data1"</td>
<td "text"</td>
<td "text"</td>
<td "text"</td>
<td "text"</td>
<td "text"</td>
<td "text"</td>
<td "data2"</td>
<td "text"</td>
<tr>
<td "text"</td>
<td "data1"</td>
<td "text"</td>
<td "text"</td>
<td "text"</td>
<td "text"</td>
<td "text"</td>
<td "text"</td>
<td "data2"</td>
<td "text"</td>
<tr>
<td "text"</td>
<td "data1"</td>
<td "text"</td>
<td "text"</td>
<td "text"</td>
<td "text"</td>
<td "text"</td>
<td "text"</td>
<td "data2"</td>
<td "text"</td>
Things I Have Tried
I understand how to use the index of the cells in order to loop through and get "td" at [1] and [8]. However, I'm getting all confused when trying to get that data on one line written back to the csv.
table = soup.find('table', {'id':'tblMain'} )
table_body = table.find('tbody')
rows = table_body.findAll('tr')
data1_columns = []
data2_columns = []
for row in rows[1:]:
data1 = row.findAll('td')[1]
data1_columns.append(data1.text)
data2 = row.findAll('td')[8]
data2_columns.append(data2.text)
This is my current code which finds the table, rows, and all "td" cells and prints them correctly to a .csv. However, instead of writing all ten "td" cells per row back to the csv line, I just want to grab "td"[1] and "td"[8].
html = browser.page_source
soup = BeautifulSoup(html, 'html.parser')
table = soup.find('table', {'id':'tblMain'} )
table_body = table.find('tbody')
rows = table_body.findAll('tr')
filename = '%s.csv' % reportname
with open(filename, "wt+", newline="") as f:
writer = csv.writer(f)
for row in rows:
csv_row = []
for cell in row.findAll("td"):
csv_row.append(cell.get_text())
writer.writerow(csv_row)
Expected Results
I want to be able to write "td"[1] and "td"[8] back to my csv_row in order to write each list back to a the csv writer.writerow.
Writing row back to csv_row which then writes to my csv file:
['data1', 'data2']
['data1', 'data2']
['data1', 'data2']
row = row.findAll("td")andwriter.writerow( [row[1], row[8]] )for row in rows: