0

I want to extract info from an website link:

http://www.website.com

There is a string that appears few times: "STRING TO CAPTURE", but I want to capture the FIRST time appears. It will be inside the following structure:

<td width="10%" bgcolor="#FFFFFF"><font class="bodytext9">1-Jun-2013</font></td>
<td width="4%" bgcolor="#FFFFFF" align=center><font class="bodytext9">Sat</font></td>
<td width="4%" bgcolor="#FFFFFF" align="center"><font class="bodytext9">TIME</font></td>
<td width="15%" bgcolor="#FFFFFF" align="center"><a class="black_9" href="link1">Some Text here</a></td>
<td width="5%" bgcolor="#FFFFFF" align="center"><font class="bodytext9"><img src="img/colors/pink.gif"></font></td>
<td width="5%" bgcolor="#FFFFFF" align="center"></td>
<td width="5%" bgcolor="#FFFFFF" align="center"><font class="bodytext9">Another Text</font></td>
<td width="5%" bgcolor="#FFFFFF" align="center"></td>
<td width="5%" bgcolor="#FFFFFF" align="center"><font class="bodytext9"><img src="img/colors/white.gif"></font></td>
<td width="15%" bgcolor="#FFFFFF" align="center"><a class="black_9" href="link2">Here is also Text</a></td>
<td width="15%" bgcolor="#FFFFFF" align="center"><a href="LINKtoWeb" class=list><u>STRING TO CAPTURE</u></a></td>
<td width="4%" bgcolor="#FFFFFF" align="center"><a target="_new" href="AnotherLink"><img src="img/img2.gif" border="0"></a></td>
</tr>

This is a fix format, where between the is 12 lines start with and all other tags; I want to extract the text in each line, eg.

1-Jun-2013
Sat
TIME
Some Text here
...
STRING TO CAPTURE

and I also want to extract the link at line contain "STRING TO CAPTURE" which is:

LINKtoWeb

In my opinion, python could be very functional to do this task, but I also too new to python to get it works, hope python experts here can show me how. I have no idea where to start, search around and find this could be solution:

use YAML;
my $data = Load(http://www.website.com);
say $data->{"<tr>"}->{"<td>"}->{"STRING TO CAPTURE"};

But I don't know how to deal with all the texts in these 12 lines ?

4
  • 1
    Use a module like BeautifulSoup or Scrapy Commented May 30, 2013 at 6:10
  • 2
    BeautifulSoup or lxml can do the job Commented May 30, 2013 at 6:14
  • that code you have is Perl Commented May 30, 2013 at 6:18
  • I need to do this process on my server, when they load the website, can the tools you suggested be used for that purpose, how is the steps ? Commented May 30, 2013 at 6:21

1 Answer 1

1

Download and Install BeautifulSoup then

html = urllib.urlopen('http://www.website.com').read()
soup = BeautifulSoup.BeautifulSoup(html)
texts = soup.findAll(text=True)

def get_stuff(element):
    if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
        return False
    elif re.match('<!--.*-->', str(element)):
        return False
    return True

visible_texts = filter(get_stuff, texts)

source - BeautifulSoup Grab Visible Webpage Text

Sign up to request clarification or add additional context in comments.

6 Comments

So for my server and website, I need to install it some where ?
install this python package in the machine where you will be running the crawling python script. import BeautifulSoup should work without error...
from bs4 import BeautifulSoup BeautifulSoup is provided through a package called bs4, providing some other functionalities, among them UnicodeDammit.
My server is support Python 2.7 (they installed in all servers). Is it ok for Beautifulsoup to run ? I need to copy the BeautifulSoup to where then can use "import BeautifulSoup" in my code ? Sorry for my stupidity
yes it should work. please follow these instructions for installing - crummy.com/software/BeautifulSoup/bs4/doc/…
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.