0

Is it possible to do something with PHP where I can set up a connection to a URL like http://en.wikipedia.org/wiki/Wiki and extract any words that contain a prefix like "Exa" and "ins" such that the resulting PHP page will print out all the words that it found. For example with "Exa", the word "Example" would be printed out each time it found an instance of "Example". Same thing for words that start with "ins".

2
  • Your question is very broad, and almost impossible to answer in a post. Consider breaking this task down into chunks and working on each one separately, and asking for help as necessary. Commented May 9, 2011 at 18:12
  • 1
    Also, just an FYI: you'll want to check if accessing a website via PHP is against their terms/conditions. Commented May 9, 2011 at 18:14

4 Answers 4

2
$data = strip_tags(file_get_contents($url));
$matches = array();
preg_match('/\bExa|ins([^\b]+)/', $data, &$matches);
for ($i = 1; $i < count($matches); $i++) {
    echo "Match: '".$matches[$i]."'\r\n";
}

Probably something like this, though I'm not so sure about the regex, I haven't tested it yet...

Edit: I changed it, it should work now... (\B => \b and strip_tags to prevent HTML-classes from being matched).

Sign up to request clarification or add additional context in comments.

Comments

1

I don't have a full answer with example to give you, but yes, you should be able to read the whole page into a string variable and then do normal string operations on it. It will read in all the HTML, so you will probably need to do a lot of regex to eliminate tags if you don't want them.

Comments

0

Read the page into a string using file_get_contents. Use one of the various string functions to examine the page.

2 Comments

Yep. But just now I realized that people do not always ask the question that they want to have answered, so I changed the answer to match the question that I presume Tereive wanted to ask.
@Viswanathan Yeah, I don't think so. He probably does want some help about how to do that, not just "no" or "yes"...
0

Yes, this possible. A potential approach would be to:

  1. Use something like fopen (if allow_url_fopen is enabled - failing that use CURL) to grab the external web page content.

  2. Remove the (presumably not required) HTML tags via strip_tags.

  3. Use strtok to tokenise and iterate over the remaining content, checking for whatever conditions you require.

2 Comments

$middaparka: I know this is possible. This is what I'm doing already. But if you load a webpage that has an iframe and javascript to generate the content, the strings will not appear when fopen() is called. so is there a way to get the strings that is generated by the javascript function. in other words, I want to get the text by programming it and not copying and pasting it.
@dooby What you're talking about isn't possible - the browser executes the JavaScript, etc. so you'd need to emulate (or indeed use) a browser. Incidentally, you should create your own question rather than adding a comment against an existing question's answer, especially as it sounds like you're attempting to solve a subtly different problem.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.