1

I want to traverse through each row and capture values of td.text. However problem here is table does not have class. and all the td got same class name. I want to traverse through each row and want following output:

1st row)"AMERICANS SOCCER CLUB","B11EB - AMERICANS-B11EB-WARZALA","Cameron Coya","Player 228004","2016-09-10","player persistently infringes the laws of the game","C" (new line)

2nd row) "AVIATORS SOCCER CLUB","G12DB - AVIATORS-G12DB-REYNGOUDT","Saskia Reyes","Player 224463","2016-09-11","player/sub guilty of unsporting behavior"," C" (new line)

<div style="overflow:auto; border:1px #cccccc solid;">
<table cellspacing="0" cellpadding="3" align="left" border="0" width="100%">
    <tbody>
        <tr class="tblHeading">
            <td colspan="7">AMERICANS SOCCER CLUB</td>
        </tr>
        <tr bgcolor="#CCE4F1">
            <td colspan="7">B11EB - AMERICANS-B11EB-WARZALA</td> 
        </tr>
        <tr bgcolor="#FFFFFF">
            <td width="19%" class="tdUnderLine"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Cameron Coya                                       </td>
            <td width="19%" class="tdUnderLine">
                Rozel, Max
            </td>
            <td width="06%" class="tdUnderLine"> 
            09-11-2016
            </td>
            <td width="05%" class="tdUnderLine" align="center">         
                <a href="http://www.ncsanj.com/gameRefReportPrint.cfm?gid=228004" target="_blank">228004</a>    
            </td>
            <td width="16%" class="tdUnderLine" align="center"> 
                09/10/16 02:15 PM   
            </td>
            <td width="30%" class="tdUnderLine">                player persistently infringes the laws of the game   </td>
            <td class="tdUnderLine">                Cautioned    </td>
        </tr>
        <tr class="tblHeading">
            <td colspan="7">AVIATORS SOCCER CLUB</td>
        </tr>
        <tr bgcolor="#CCE4F1">
            <td colspan="7">G12DB - AVIATORS-G12DB-REYNGOUDT</td> 
        </tr>
        <tr bgcolor="#FBFBFB">
            <td width="19%" class="tdUnderLine"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Saskia Reyes                                       </td>
            <td width="19%" class="tdUnderLine">
                HollaenderNardelli, Eric
            </td>
            <td width="06%" class="tdUnderLine"> 
            09-11-2016
            </td>
            <td width="05%" class="tdUnderLine" align="center">         

                <a href="http://www.ncsanj.com/gameRefReportPrint.cfm?gid=224463" target="_blank">224463</a>    
            </td>
            <td width="16%" class="tdUnderLine" align="center"> 
                09/11/16 06:45 PM   
            </td>
            <td width="30%" class="tdUnderLine">                player/sub guilty of unsporting behavior     </td>
            <td class="tdUnderLine">                Cautioned    </td>
        </tr>
        <tr class="tblHeading">
            <td colspan="7">BERGENFIELD SOCCER CLUB</td>
        </tr>
        <tr bgcolor="#CCE4F1">
            <td colspan="7">B11CW - BERGENFIELD-B11CW-NARVAEZ</td> 
        </tr>
        <tr bgcolor="#FFFFFF">
            <td width="19%" class="tdUnderLine"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Christian Latorre                                  </td>
            <td width="19%" class="tdUnderLine">
                Coyle, Kevin
            </td>
            <td width="06%" class="tdUnderLine"> 
            09-10-2016
            </td>
            <td width="05%" class="tdUnderLine" align="center">         

                <a href="http://www.ncsanj.com/gameRefReportPrint.cfm?gid=226294" target="_blank">226294</a>    
            </td>
            <td width="16%" class="tdUnderLine" align="center"> 

                09/10/16 11:00 AM   

            </td>
            <td width="30%" class="tdUnderLine">                player persistently infringes the laws of the game   </td>
            <td class="tdUnderLine">                Cautioned    </td>
        </tr>

I tried with following code.

import requests
from bs4 import BeautifulSoup
import re
try:
    import urllib.request as urllib2
except ImportError:
    import urllib2

url = r"G:\Freelancer\NC Soccer\Northern Counties Soccer Association ©.html"
page = open(url, encoding="utf8")
soup = BeautifulSoup(page.read(),"html.parser")

#tableList = soup.findAll("table")

for tr in soup.find_all("tr"):
    for td in tr.find_all("td"):
        print(td.text.strip())

but it is obvious that it will return text form all td and I will not able to identify particular column name or will not able to determine start of new record. I want to know

1) how to identify each column(because class name is same) and there are headings as well (I will appreciate if you provide code for that)

2) how to identify new record in such structure

2
  • can you give the example of the output format you need it in Commented Sep 14, 2016 at 6:07
  • Please check it is given in question as 1st row and 2nd row. it is just sample, I will require 100s of such rows. but basically I need all fields comma separated, enclosed by double quotes. Commented Sep 14, 2016 at 6:13

4 Answers 4

1

If the data is really structured like a table, there's a good chance you can read it into pandas directly with pd.read_table(). Note that it accepts urls in the filepath_or_buffer argument. http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_table.html

Sign up to request clarification or add additional context in comments.

Comments

1
count = 0
string = ""
for td in soup.find_all("td"):
string += "\""+td.text.strip()+"\","
count +=1
if(count % 9 ==0):
    print string[:-1] + "\n\n" # string[:-1] to remove the last ","
    string = ""

as the table is not in the proper required format we shall just go with the td rather than going into each row then going into td in each row which complicates the work. I just used a string you can append the data into a list of lists and get process it for later use.
Hope this solves your problem

Comments

0
from __future__ import print_function
import re
import datetime
from bs4 import BeautifulSoup

soup = ""
with open("/tmp/a.html") as page:
   soup = BeautifulSoup(page.read(),"html.parser")

table = soup.find('div', {'style': 'overflow:auto; border:1px #cccccc solid;'}).find('table')

trs = table.find_all('tr')

table_dict = {}
game = ""
section = ""

for tr in trs:
    if tr.has_attr('class'):
        game = tr.text.strip('\n')
    if tr.has_attr('bgcolor'):
        if tr['bgcolor'] == '#CCE4F1':
            section = tr.text.strip('\n')
        else:
            tds = tr.find_all('td')
            extracted_text = [re.sub(r'([^\x00-\x7F])+','', td.text) for td in tds]
            extracted_text = [x.strip() for x in extracted_text]
            extracted_text = list(filter(lambda x: len(x) > 2, extracted_text))
            extracted_text.pop(1)
            extracted_text[2] = "Player " + extracted_text[2]
            extracted_text[3] = datetime.datetime.strptime(extracted_text[3], '%m/%d/%y %I:%M %p').strftime("%Y-%m-%d")
            extracted_text = ['"' + x + '"' for x in [game, section] + extracted_text]
            print(','.join(extracted_text))

And when run:

$ python a.py

"AMERICANS SOCCER CLUB","B11EB - AMERICANS-B11EB-WARZALA","Cameron Coya","Player 228004","2016-09-10","player persistently infringes the laws of the game","C"
"AVIATORS SOCCER CLUB","G12DB - AVIATORS-G12DB-REYNGOUDT","Saskia Reyes","Player 224463","2016-09-11","player/sub guilty of unsporting behavior","C"
"BERGENFIELD SOCCER CLUB","B11CW - BERGENFIELD-B11CW-NARVAEZ","Christian Latorre","Player 226294","2016-09-10","player persistently infringes the laws of the game","C"

Based on further conversation with the OP, the input was https://paste.fedoraproject.org/428111/87928814/raw/ and the output after running the above code is: https://paste.fedoraproject.org/428110/38792211/raw/

7 Comments

for trs in chunks(table.find_all('tr'), 3): how you determined 3 here? is it based on number of records? here number of records are dynamic. Is there any way to find number of such rows in page?
@BhaveshGhodasara According to the sample given by the OP, the records have a specific format, which keep repeating.
it is not working exactly I wanted. how to save output in file. I tired with following saveFile.write(','.join(extracted_text)) It is giving all values in just one row. No splits. :(
@BhaveshGhodasara Is it different from the output that I showed you? Share the xml file
do you know how to share or attach file here?
|
0

There seems to be a pattern. After every 7 tr(s), there is a new line. So, what you can do is keep a counter starting from 1, when it touches 7, append a new line and restart it to 0.

counter = 1
for tr in find_all("tr"):
    for td in tr.find_all("td"):
        # place code
    counter = counter + 1
    if counter == 7:
        print "\n"
        counter = 1

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.