parse table using beautifulsoup in python

Question

I want to traverse through each row and capture values of td.text. However problem here is table does not have class. and all the td got same class name. I want to traverse through each row and want following output:

1st row)"AMERICANS SOCCER CLUB","B11EB - AMERICANS-B11EB-WARZALA","Cameron Coya","Player 228004","2016-09-10","player persistently infringes the laws of the game","C" (new line)

2nd row) "AVIATORS SOCCER CLUB","G12DB - AVIATORS-G12DB-REYNGOUDT","Saskia Reyes","Player 224463","2016-09-11","player/sub guilty of unsporting behavior"," C" (new line)

<div style="overflow:auto; border:1px #cccccc solid;">
<table cellspacing="0" cellpadding="3" align="left" border="0" width="100%">
    <tbody>
        <tr class="tblHeading">
            <td colspan="7">AMERICANS SOCCER CLUB</td>
        </tr>
        <tr bgcolor="#CCE4F1">
            <td colspan="7">B11EB - AMERICANS-B11EB-WARZALA</td> 
        </tr>
        <tr bgcolor="#FFFFFF">
            <td width="19%" class="tdUnderLine"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Cameron Coya                                       </td>
            <td width="19%" class="tdUnderLine">
                Rozel, Max
            </td>
            <td width="06%" class="tdUnderLine"> 
            09-11-2016
            </td>
            <td width="05%" class="tdUnderLine" align="center">         
                <a href="http://www.ncsanj.com/gameRefReportPrint.cfm?gid=228004" target="_blank">228004</a>    
            </td>
            <td width="16%" class="tdUnderLine" align="center"> 
                09/10/16 02:15 PM   
            </td>
            <td width="30%" class="tdUnderLine">                player persistently infringes the laws of the game   </td>
            <td class="tdUnderLine">                Cautioned    </td>
        </tr>
        <tr class="tblHeading">
            <td colspan="7">AVIATORS SOCCER CLUB</td>
        </tr>
        <tr bgcolor="#CCE4F1">
            <td colspan="7">G12DB - AVIATORS-G12DB-REYNGOUDT</td> 
        </tr>
        <tr bgcolor="#FBFBFB">
            <td width="19%" class="tdUnderLine"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Saskia Reyes                                       </td>
            <td width="19%" class="tdUnderLine">
                HollaenderNardelli, Eric
            </td>
            <td width="06%" class="tdUnderLine"> 
            09-11-2016
            </td>
            <td width="05%" class="tdUnderLine" align="center">         

                <a href="http://www.ncsanj.com/gameRefReportPrint.cfm?gid=224463" target="_blank">224463</a>    
            </td>
            <td width="16%" class="tdUnderLine" align="center"> 
                09/11/16 06:45 PM   
            </td>
            <td width="30%" class="tdUnderLine">                player/sub guilty of unsporting behavior     </td>
            <td class="tdUnderLine">                Cautioned    </td>
        </tr>
        <tr class="tblHeading">
            <td colspan="7">BERGENFIELD SOCCER CLUB</td>
        </tr>
        <tr bgcolor="#CCE4F1">
            <td colspan="7">B11CW - BERGENFIELD-B11CW-NARVAEZ</td> 
        </tr>
        <tr bgcolor="#FFFFFF">
            <td width="19%" class="tdUnderLine"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Christian Latorre                                  </td>
            <td width="19%" class="tdUnderLine">
                Coyle, Kevin
            </td>
            <td width="06%" class="tdUnderLine"> 
            09-10-2016
            </td>
            <td width="05%" class="tdUnderLine" align="center">         

                <a href="http://www.ncsanj.com/gameRefReportPrint.cfm?gid=226294" target="_blank">226294</a>    
            </td>
            <td width="16%" class="tdUnderLine" align="center"> 

                09/10/16 11:00 AM   

            </td>
            <td width="30%" class="tdUnderLine">                player persistently infringes the laws of the game   </td>
            <td class="tdUnderLine">                Cautioned    </td>
        </tr>

I tried with following code.

import requests
from bs4 import BeautifulSoup
import re
try:
    import urllib.request as urllib2
except ImportError:
    import urllib2

url = r"G:\Freelancer\NC Soccer\Northern Counties Soccer Association ©.html"
page = open(url, encoding="utf8")
soup = BeautifulSoup(page.read(),"html.parser")

#tableList = soup.findAll("table")

for tr in soup.find_all("tr"):
    for td in tr.find_all("td"):
        print(td.text.strip())

but it is obvious that it will return text form all td and I will not able to identify particular column name or will not able to determine start of new record. I want to know

1) how to identify each column(because class name is same) and there are headings as well (I will appreciate if you provide code for that)

2) how to identify new record in such structure

can you give the example of the output format you need it in — Sandeep
– Sandeep, Commented Sep 14, 2016 at 6:07
Please check it is given in question as 1st row and 2nd row. it is just sample, I will require 100s of such rows. but basically I need all fields comma separated, enclosed by double quotes. — Bhavesh Ghodasara
– Bhavesh Ghodasara, Commented Sep 14, 2016 at 6:13

Sohier Dane · Accepted Answer · 2016-09-14 05:02:18Z

1

If the data is really structured like a table, there's a good chance you can read it into pandas directly with pd.read_table(). Note that it accepts urls in the filepath_or_buffer argument. http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_table.html

answered Sep 14, 2016 at 5:02

Sohier Dane

1421 silver badge8 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Sandeep · Accepted Answer · 2016-09-14 06:47:34Z

1

count = 0
string = ""
for td in soup.find_all("td"):
string += "\""+td.text.strip()+"\","
count +=1
if(count % 9 ==0):
    print string[:-1] + "\n\n" # string[:-1] to remove the last ","
    string = ""

as the table is not in the proper required format we shall just go with the td rather than going into each row then going into td in each row which complicates the work. I just used a string you can append the data into a list of lists and get process it for later use.
Hope this solves your problem

answered Sep 14, 2016 at 6:47

Sandeep

1551 silver badge14 bronze badges

Comments

Nehal J Wani · Accepted Answer · 2016-09-14 18:50:33Z

0

from __future__ import print_function
import re
import datetime
from bs4 import BeautifulSoup

soup = ""
with open("/tmp/a.html") as page:
   soup = BeautifulSoup(page.read(),"html.parser")

table = soup.find('div', {'style': 'overflow:auto; border:1px #cccccc solid;'}).find('table')

trs = table.find_all('tr')

table_dict = {}
game = ""
section = ""

for tr in trs:
    if tr.has_attr('class'):
        game = tr.text.strip('\n')
    if tr.has_attr('bgcolor'):
        if tr['bgcolor'] == '#CCE4F1':
            section = tr.text.strip('\n')
        else:
            tds = tr.find_all('td')
            extracted_text = [re.sub(r'([^\x00-\x7F])+','', td.text) for td in tds]
            extracted_text = [x.strip() for x in extracted_text]
            extracted_text = list(filter(lambda x: len(x) > 2, extracted_text))
            extracted_text.pop(1)
            extracted_text[2] = "Player " + extracted_text[2]
            extracted_text[3] = datetime.datetime.strptime(extracted_text[3], '%m/%d/%y %I:%M %p').strftime("%Y-%m-%d")
            extracted_text = ['"' + x + '"' for x in [game, section] + extracted_text]
            print(','.join(extracted_text))

And when run:

$ python a.py

"AMERICANS SOCCER CLUB","B11EB - AMERICANS-B11EB-WARZALA","Cameron Coya","Player 228004","2016-09-10","player persistently infringes the laws of the game","C"
"AVIATORS SOCCER CLUB","G12DB - AVIATORS-G12DB-REYNGOUDT","Saskia Reyes","Player 224463","2016-09-11","player/sub guilty of unsporting behavior","C"
"BERGENFIELD SOCCER CLUB","B11CW - BERGENFIELD-B11CW-NARVAEZ","Christian Latorre","Player 226294","2016-09-10","player persistently infringes the laws of the game","C"

Based on further conversation with the OP, the input was https://paste.fedoraproject.org/428111/87928814/raw/ and the output after running the above code is: https://paste.fedoraproject.org/428110/38792211/raw/

edited Sep 14, 2016 at 18:50

answered Sep 14, 2016 at 6:37

Nehal J Wani

16.7k3 gold badges72 silver badges93 bronze badges

7 Comments

Bhavesh Ghodasara Over a year ago

for trs in chunks(table.find_all('tr'), 3): how you determined 3 here? is it based on number of records? here number of records are dynamic. Is there any way to find number of such rows in page?

Nehal J Wani Over a year ago

@BhaveshGhodasara According to the sample given by the OP, the records have a specific format, which keep repeating.

Bhavesh Ghodasara Over a year ago

it is not working exactly I wanted. how to save output in file. I tired with following saveFile.write(','.join(extracted_text)) It is giving all values in just one row. No splits. :(

Nehal J Wani Over a year ago

@BhaveshGhodasara Is it different from the output that I showed you? Share the xml file

Bhavesh Ghodasara Over a year ago

do you know how to share or attach file here?

|

Sachin · Accepted Answer · 2016-09-14 06:26:40Z

0

There seems to be a pattern. After every 7 tr(s), there is a new line. So, what you can do is keep a counter starting from 1, when it touches 7, append a new line and restart it to 0.

counter = 1
for tr in find_all("tr"):
    for td in tr.find_all("td"):
        # place code
    counter = counter + 1
    if counter == 7:
        print "\n"
        counter = 1

answered Sep 14, 2016 at 6:26

Sachin

3,6841 gold badge18 silver badges25 bronze badges

Collectives™ on Stack Overflow

parse table using beautifulsoup in python

4 Answers 4

Comments

Comments

7 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

7 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related