1

I am using the following code to scrape data from website.

from bs4 import BeautifulSoup
import urllib2
import re
for i in xrange(1,461,10):
  try:
    page = urllib2.urlopen("http://cms.onlinedemos.in/directory.php?click=n&startline={}#lst".format(i))
  except urllib2.HTTPError:
    continue
  else:
    pass
  finally:
    soup = BeautifulSoup(page)
    td1=soup.findAll('td', {'class':'comtext'})
    td2 = soup.findAll('td',{'class':'comuser'})
    td3 = soup.findAll('td',{'class':'com'})
    for td1s, td2s, td3s in zip(td1,td2,td3):
      data = [re.sub('\s+', '', text).strip().encode('utf8') for text in td1s.find_all(text=True) + td2s.find_all(text=True) + td3s.find_all(text=True)  if text.strip()]
      print ','.join(data)

My output is

A.T.E.EnterprisesPvt.Ltd.,,AnujBhagwati
A.T.E.Pvt.Ltd.,,AtulBhagwati
AalidhraTextileEngineersLtd.,,HansrajGondalia,Mumbai
AarBeeAssociates,Mr.Gopalsamy,022-22872245
ABCarterIndiaPvt.Ltd.,,B.B.Shetty,[email protected]
ABCCorporation,MittalPatel,Mumbai
ABCIndustrialFasteners,S.R.Sheth,022-22872245

But it is supposed to be like this

    A.T.E. Enterprises Pvt. Ltd.,   Anuj Bhagwati   Mumbai  022-22872245    [email protected]    

    A.T.E. Pvt. Ltd.,   Atul Bhagwati   Mumbai  022-22872245    [email protected]    

    Aalidhra Textile Engineers Ltd.,    Hansraj Gondalia    Surat   0261-2279520/30/40  [email protected]    

    Aar Bee Associates  Mr. Gopalsamy   Coimbatore  0422-2236250 / 2238560  [email protected]  

So you can see that the first row values Mumbai 022-22872245 [email protected] starts falling in third , fourth and fifth row. and it continues for all. I do know where I went wrong.

1
  • Do you need get tab-separated columns? Commented Nov 20, 2013 at 18:16

2 Answers 2

2

Taking a look at the HTML of this page, there are 3 columsn of class com for every row. Zipping a list of 10 items with another list of 10 items with a third list of 30 items will result in the type of output you're getting.

>>> len(td3)
30
>>> td3[0:3]
[<td class="com" width="100"></td>, <td class="com" width="160"></td>, <td class="com" width="185"></td>]
>>> td3[3:6]
[<td class="com" width="100">Mumbai</td>, <td class="com" width="160">022-22872245</td>, <td class="com" width="185">[email protected]</td>]
Sign up to request clarification or add additional context in comments.

Comments

1

@VooDooNOFX is right. To modify your code accordingly, try something like this:

from bs4 import BeautifulSoup
import urllib2
import re
for i in xrange(1,461,10):
  try:
    page = urllib2.urlopen("http://cms.onlinedemos.in/directory.php?click=n&startline={}#lst".format(i))
  except urllib2.HTTPError:
    continue
  else:
    pass
  finally:
    soup = BeautifulSoup(page) 
    td1=soup.findAll('td', {'class':'comtext'})    
    td2 = soup.findAll('td',{'class':'comuser'})
    td345 = soup.findAll('td',{'class':'com'})
    #for td3, td4, and td5, use slicing method: s[i:j:k] slice of s from i to j with step k
    td3 = td345[0::3]
    td4 = td345[1::3]
    td5 = td345[2::3]
    for td1s, td2s, td3s, td4s, td5s in zip(td1,td2,td3,td4,td5):
      data = [re.sub('\s+', ' ', text).strip().encode('utf8').replace(",", "") for text in td1s.find_all(text=True) + td2s.find_all(text=True) + td3s.find_all(text=True) + td4s.find_all(text=True) + td5s.find_all(text=True) if text.strip()]
      print ', '.join(data)

Output for the first page:

A.T.E. Enterprises Pvt. Ltd., Anuj Bhagwati, Mumbai, 022-22872245, [email protected]
A.T.E. Pvt. Ltd., Atul Bhagwati, Mumbai, 022-22872245, [email protected]
Aalidhra Textile Engineers Ltd., Hansraj Gondalia, Surat, 0261-2279520/30/40, [email protected]
Aar Bee Associates, Mr. Gopalsamy, Coimbatore, 0422-2236250 / 2238560, [email protected]
AB Carter India Pvt. Ltd., B.B. Shetty, Mumbai, 022-66662961 / 62, [email protected]
ABC Corporation, Mittal Patel, Ahmedabad, 079-40068999 / 26582333, [email protected]
ABC Industrial Fasteners, S.R. Sheth, Mumbai, 022-28470806 / 66923987, [email protected]
Abhishek Enterprises, N.C. Jain, Bhilwara, 01482-264250, [email protected]
Accurate Trans Heat Pvt. Ltd., Kedarmal Dargar, Surat, 0261-2397268, [email protected]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.