Simple PHP code for extracting data from the HTML source code

Question

I know I can use xpath, but in this case it wouldn't work because of the complexity of the navigation of the site.

I can only use the source code.

I have browsed all over the place and couldn't find a simple php solution that would:

Open the HTML source code page (I already have an exact source code page URL).
Select and extract the text between two codes. Not between a div. But I know the start and end variables.

So, basically, I need to extract the text between

knownhtmlcodestart> Text to extract <knownhtmlcodeend

What I'm trying to achieve in the end is this:

Go to a source code URL.
Extract the text between two codes.
Store the data temporarily (define the time manually for how long) on my web server in a simple text file.
Define the waiting time and then repeat the whole process again.

The website that I'm going to extract data from is changing dynamically. So it would always store new data into the same file.

Then I would use that data (but that's a question for another time).

I would appreciate it if anyone could lead me to a simple solution.

Not asking to write a code, but maybe someone did anything similar and sharing the code here would be helpful.

Thanks

You can use Regex to extract the text as mentioned above. there is a solution for regex provided in stackoverflow.com/questions/2403122/… if you are interested. — Kishen Nagaraju
– Kishen Nagaraju, Commented Mar 24, 2021 at 18:43

Eriks Klotins · Accepted Answer · 2021-03-24 19:06:21Z

1

I (shamefully) found the following function useful to extract stuff from HTML. Regexes sometimes are too complex to extract large stuff, e.g. a whole <table>

/*
   $start - string marking the start of the sequence you want to extract
   $end - string marking the end of it..
   $offset - starting position in case you need to find multiple occurrences
   returns the string between `$start` and `$end`, and the indexes of start and end
*/
function strExt($str, $start, $end = null, $offset = 0)
{
    $p1 = mb_strpos($str,$start,$offset);
    if ($p1 === false) return false;
    $p1 += mb_strlen($start);

    $p2 = $end === null ? mb_strlen($str) : mb_strpos($str,$end, $p1+1);
    return 
        [
            'str'   => mb_substr($str, $p1, $p2-$p1),
            'start' => $p1,
            'end'   => $p2];
}

answered Mar 24, 2021 at 19:06

Eriks Klotins

4,1701 gold badge14 silver badges27 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Phaelax · Accepted Answer · 2021-03-24 19:11:43Z

1

This would assume the opening and closing tag are on the same line (as in your example). If the tags can be on separate lines, it wouldn't be difficult to adapt this.

$html = file_get_contents('website.com');


$lines = explode("\n", $html); 

foreach($lines as $word) {
    $t1 = strpos($word, "knownhtmlcodestart");
    $t2 = strpos($word, "knownhtmlcodeend");
    
    if ($t1)
        $c1 = $t1;
    
    if ($t2)
        $c2 = $t2;
    
    if ($c1 && $c2){
        $text = substring($word, $c1, $c2-$c1);
        break;  
    }
}

echo $text;

answered Mar 24, 2021 at 19:11

Phaelax

2,0542 gold badges11 silver badges26 bronze badges

Collectives™ on Stack Overflow

Simple PHP code for extracting data from the HTML source code

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related