1

I am scraping the DOM of a static site with PHP and pulling out specific bit's of data so I can put stuff into a database.

For this example I am storing the inner HTML of an element to $domString, I can see the string is 'Description', but when I compare $domString to 'Description' in the code there isn't a match.

if($domString == 'Description') {
    // This is not happening, even though I know
    // $domString contains 'Description' :(
}

I have striped whitespace and stuff, when I var_dump() them both out I get this:

string(45) "Description"
string(11) "Description"

Running them both through bin2hex() as Álvaro G. Vicario suggests returns the following two values respectively:

3c74642076616c69676e3d22746f702220636f6c7370616e3d2232223e4465736372697074696f6e3c2f74643e
4465736372697074696f6e

I need a way to strip wahtever is beefing that first string out.

6
  • what if u do trim($domString) == 'Description' ? Commented Apr 3, 2014 at 12:01
  • 1
    @AbhikChakraborty i guess this isnt the problem, because usually whitespaces are shown in var_dump(); Commented Apr 3, 2014 at 12:04
  • This might happen if the two strings have different encoding Commented Apr 3, 2014 at 12:04
  • Are there any soft hyphen chars in there? Commented Apr 3, 2014 at 12:04
  • 1
    try mb_detect_encoding($str) for both of the string, then use mb_convert_encoding(domString, /*same_encoding*/) to convert both of them to the same encoding and see if they are still identical. Commented Apr 3, 2014 at 12:10

3 Answers 3

4

The number in parenthesis is the total byte count. Obviously, a 45-byte string cannot be identical to a 11-byte one.

You can use bin2hex() to inspect the exact bytes. I also suggest you don't see the output as HTML—In most browsers you can hit Ctrl+U.

Edit: asking why two given strings render the same words after being processed by a web browser is better answered by actually looking at the real raw data (as opposed to just looking at the output produced by the browser).

Edit #2:

var_dump( hex2bin('3c74642077696474683d223832222076616c69676e3d22746f70223e547970653c2f74643e') );

... prints this:

string(37) "<td width="82" valign="top">Type</td>"

Do you want to strip HTML tags or something? Did you see the raw HTML?

Sign up to request clarification or add additional context in comments.

8 Comments

Yeh, the byte string is different. How can I make this comparison though, is it reasonable that I want to normalise this?
How can you compare two fruits? It depends on your data and your definition of equal. Is a peach different to an apple? What if you ask whether they're spherical?
I'm saying that as a human; If I see a string 'Description', and another string 'Description' I would like to get a positive match in the same way that their rendering on the screen is matching. Is this a totally insane request?
To being with, I'd inspect the actual data. Why guess?
bin2Hex() shows: 3c74642077696474683d223832222076616c69676e3d22746f70223e547970653c2f74643e and 4465736372697074696f6e
|
0

You should as question why this one happens

string(45) "Description"
string(11) "Description"

Second one is 11 chars, first one is 45! Why? So there are some hidden (not showed) characters\symbols. That's why this strings not equal.

Try this one Remove control characters from php String

Comments

0

Solution is to use a regex like this

    function clean($string) {
$string = str_replace(' ', '-', $string); // Replaces all spaces with hyphens.
return preg_replace('/[^A-Za-z0-9\-\;\,\?\*\%\@\$\!\(\)\#\=\&]/', '', $string); // Removes special chars
}

Adapt it to the special char you need or not add the one you want to keep catching like this \# or esle \=

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.