extract text from website source code

Question

I want to extract info from an website link:

http://www.website.com

There is a string that appears few times: "STRING TO CAPTURE", but I want to capture the FIRST time appears. It will be inside the following structure:

<td width="10%" bgcolor="#FFFFFF"><font class="bodytext9">1-Jun-2013</font></td>
<td width="4%" bgcolor="#FFFFFF" align=center><font class="bodytext9">Sat</font></td>
<td width="4%" bgcolor="#FFFFFF" align="center"><font class="bodytext9">TIME</font></td>
<td width="15%" bgcolor="#FFFFFF" align="center"><a class="black_9" href="link1">Some Text here</a></td>
<td width="5%" bgcolor="#FFFFFF" align="center"><font class="bodytext9"><img src="img/colors/pink.gif"></font></td>
<td width="5%" bgcolor="#FFFFFF" align="center"></td>
<td width="5%" bgcolor="#FFFFFF" align="center"><font class="bodytext9">Another Text</font></td>
<td width="5%" bgcolor="#FFFFFF" align="center"></td>
<td width="5%" bgcolor="#FFFFFF" align="center"><font class="bodytext9"><img src="img/colors/white.gif"></font></td>
<td width="15%" bgcolor="#FFFFFF" align="center"><a class="black_9" href="link2">Here is also Text</a></td>
<td width="15%" bgcolor="#FFFFFF" align="center"><a href="LINKtoWeb" class=list><u>STRING TO CAPTURE</u></a></td>
<td width="4%" bgcolor="#FFFFFF" align="center"><a target="_new" href="AnotherLink"><img src="img/img2.gif" border="0"></a></td>
</tr>

This is a fix format, where between the is 12 lines start with and all other tags; I want to extract the text in each line, eg.

1-Jun-2013
Sat
TIME
Some Text here
...
STRING TO CAPTURE

and I also want to extract the link at line contain "STRING TO CAPTURE" which is:

LINKtoWeb

In my opinion, python could be very functional to do this task, but I also too new to python to get it works, hope python experts here can show me how. I have no idea where to start, search around and find this could be solution:

use YAML;
my $data = Load(http://www.website.com);
say $data->{"<tr>"}->{"<td>"}->{"STRING TO CAPTURE"};

But I don't know how to deal with all the texts in these 12 lines ?

I need to do this process on my server, when they load the website, can the tools you suggested be used for that purpose, how is the steps ? — user1314404
– user1314404, Commented May 30, 2013 at 6:21

Community · Accepted Answer · 2017-05-23 10:25:52Z

1

Download and Install BeautifulSoup then

html = urllib.urlopen('http://www.website.com').read()
soup = BeautifulSoup.BeautifulSoup(html)
texts = soup.findAll(text=True)

def get_stuff(element):
    if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
        return False
    elif re.match('<!--.*-->', str(element)):
        return False
    return True

visible_texts = filter(get_stuff, texts)

source - BeautifulSoup Grab Visible Webpage Text

edited May 23, 2017 at 10:25

CommunityBot

11 silver badge

answered May 30, 2013 at 6:25

Srikar Appalaraju

74k55 gold badges221 silver badges265 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

user1314404 Over a year ago

So for my server and website, I need to install it some where ?

Srikar Appalaraju Over a year ago

install this python package in the machine where you will be running the crawling python script. import BeautifulSoup should work without error...

Balthazar Rouberol Over a year ago

from bs4 import BeautifulSoup BeautifulSoup is provided through a package called bs4, providing some other functionalities, among them UnicodeDammit.

user1314404 Over a year ago

My server is support Python 2.7 (they installed in all servers). Is it ok for Beautifulsoup to run ? I need to copy the BeautifulSoup to where then can use "import BeautifulSoup" in my code ? Sorry for my stupidity

Srikar Appalaraju Over a year ago

yes it should work. please follow these instructions for installing - crummy.com/software/BeautifulSoup/bs4/doc/…

|

Collectives™ on Stack Overflow

extract text from website source code

1 Answer 1

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related