PHP Extract Text from Webpage

Question

Is it possible to do something with PHP where I can set up a connection to a URL like http://en.wikipedia.org/wiki/Wiki and extract any words that contain a prefix like "Exa" and "ins" such that the resulting PHP page will print out all the words that it found. For example with "Exa", the word "Example" would be printed out each time it found an instance of "Example". Same thing for words that start with "ins".

Your question is very broad, and almost impossible to answer in a post. Consider breaking this task down into chunks and working on each one separately, and asking for help as necessary. — eykanal
– eykanal, Commented May 9, 2011 at 18:12
Also, just an FYI: you'll want to check if accessing a website via PHP is against their terms/conditions. — sdleihssirhc
– sdleihssirhc, Commented May 9, 2011 at 18:14

cutsoy · Accepted Answer · 2011-05-09 18:20:32Z

2

$data = strip_tags(file_get_contents($url));
$matches = array();
preg_match('/\bExa|ins([^\b]+)/', $data, &$matches);
for ($i = 1; $i < count($matches); $i++) {
    echo "Match: '".$matches[$i]."'\r\n";
}

Probably something like this, though I'm not so sure about the regex, I haven't tested it yet...

Edit: I changed it, it should work now... (\B => \b and strip_tags to prevent HTML-classes from being matched).

edited May 9, 2011 at 18:20

answered May 9, 2011 at 18:13

cutsoy

10.3k4 gold badges43 silver badges59 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

invertedSpear · Accepted Answer · 2011-05-09 18:11:16Z

1

I don't have a full answer with example to give you, but yes, you should be able to read the whole page into a string variable and then do normal string operations on it. It will read in all the HTML, so you will probably need to do a lot of regex to eliminate tags if you don't want them.

answered May 9, 2011 at 18:11

invertedSpear

11.1k6 gold badges43 silver badges77 bronze badges

Comments

Oswald · Accepted Answer · 2011-05-09 18:09:17Z

0

Read the page into a string using file_get_contents. Use one of the various string functions to examine the page.

answered May 9, 2011 at 18:09

Oswald

31.8k3 gold badges45 silver badges72 bronze badges

2 Comments

Oswald Over a year ago

Yep. But just now I realized that people do not always ask the question that they want to have answered, so I changed the answer to match the question that I presume Tereive wanted to ask.

cutsoy Over a year ago

@Viswanathan Yeah, I don't think so. He probably does want some help about how to do that, not just "no" or "yes"...

John Parker · Accepted Answer · 2011-05-09 18:17:06Z

0

Yes, this possible. A potential approach would be to:

Use something like fopen (if allow_url_fopen is enabled - failing that use CURL) to grab the external web page content.
Remove the (presumably not required) HTML tags via strip_tags.
Use strtok to tokenise and iterate over the remaining content, checking for whatever conditions you require.

answered May 9, 2011 at 18:17

John Parker

54.5k11 gold badges133 silver badges131 bronze badges

2 Comments

dooby Over a year ago

$middaparka: I know this is possible. This is what I'm doing already. But if you load a webpage that has an iframe and javascript to generate the content, the strings will not appear when fopen() is called. so is there a way to get the strings that is generated by the javascript function. in other words, I want to get the text by programming it and not copying and pasting it.

John Parker Over a year ago

@dooby What you're talking about isn't possible - the browser executes the JavaScript, etc. so you'd need to emulate (or indeed use) a browser. Incidentally, you should create your own question rather than adding a comment against an existing question's answer, especially as it sounds like you're attempting to solve a subtly different problem.

Collectives™ on Stack Overflow

PHP Extract Text from Webpage

4 Answers 4

Comments

Comments

2 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related