0

Is there a way to use preg_replace() to add a string "utm=some&medium=stuff" at the end of all found urls found in $html_text?

$html_text = 'Lorem ipsum <a href="http://www.me.com">dolor sit</a> amet, 
              <a href="http://www.me.com/page.php?id=10">consectetur</a> elit.';

So the result should be

href="http://www.me.com" ›››››
href="http://www.me.com?utm=some&medium=stuff"

href="http://www.me.com/page.php?id=1" ›››››
href="http://www.me.com/page.php?id=1&utm=some&medium=stuff"

So, if the url contains a question mark (second url) it should add a ampersand "&" instead of a question mark "?" in front of "utm=some..."

Ultimately it would only alter urls for the domain me.com.

5 Answers 5

4

This is a little bit tricky, but the following code should work if your URLs are all enclosed in quotation marks (single or double). It will also handle fragment identifiers (like #section-2).

$url_modifier = 'utm=some&medium=stuff';
$url_modifier_domain = preg_quote('www.me.com');

$html_text = preg_replace_callback(
              '#((?:https?:)?//'.$url_modifier_domain.'(/[^\'"\#]*)?)(?=[\'"\#])#i',
              function($matches){
                global $url_modifier;
                if (!isset($matches[2])) return $matches[1]."/?$url_modifier";
                $q = strpos($matches[2],'?');
                if ($q===false) return $matches[1]."?$url_modifier";
                if ($q==strlen($matches[2])-1) return $matches[1].$url_modifier;
                return $matches[1]."&$url_modifier";
              },
              $html_text);

Input:

<a href="http://www.me.com">Lorem</a>
<a href="http://www.me.com/">ipsum</a>
<a href="http://www.me.com/#section-2">dolor</a>
<a href="http://www.me.com/path-to-somewhere/file.php">sit</a>
<a href="http://www.me.com/?">amet</a>,
<a href="http://www.me.com/?foo=bar">consectetur</a>
<a href="http://www.me.com/?foo=bar#section-3">elit</a>.

Output:

<a href="http://www.me.com/?utm=some&medium=stuff">Lorem</a>
<a href="http://www.me.com/?utm=some&medium=stuff">ipsum</a>
<a href="http://www.me.com/?utm=some&medium=stuff#section-2">dolor</a>
<a href="http://www.me.com/path-to-somewhere/file.php?utm=some&medium=stuff">sit</a>
<a href="http://www.me.com/?utm=some&medium=stuff">amet</a>,
<a href="http://www.me.com/?foo=bar&utm=some&medium=stuff">consectetur</a>
<a href="http://www.me.com/?foo=bar&utm=some&medium=stuff#section-3">elit</a>.
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you very much! I ended up using this solution.
1

This is a trivial task using DOMDocument:

$html_text = 'Lorem ipsum <a href="http://www.me.com">dolor sit</a> amet, <a href="http://www.me.com/page.php?id=10">consectetur</a> elit.';

$html = new DOMDocument();
$html->loadHtml($html_text);

foreach ($html->getElementsByTagName('a') as $element)
{
    $href = $element->getAttribute('href');
    if (!empty($href)) // only edit the attribute if it's set
    {
        // check if we need to append with ? or &
        if (strpos($href, '?') === false)
            $href .= '?';
        else
            $href .= '&';

        // append querystring
        $href .= 'utm=some&medium=stuff';

        // set attribute
        $element->setAttribute('href', $href);
    }
}

// output altered code
echo $html->C14N();

Fiddle: http://phpfiddle.org/lite/code/wvq-ujk

1 Comment

Thanks! This works great and the code is easy to read as well.
1

You can achieve this by using preg_replace, 2 patterns and two replacememts:

<?php
$add = "utm=some&medium=stuff";
$patterns = array(
                '/(https?:\/\/(?:www)?me\.com(?=.*?\?)[^"]*)/',  # positive lookahead to check if there is a ? mark in url
                '/(https?:\/\/(?:www)?me\.com(?!.*?\?)[^"]*)/' # negative lookahead to check if ? mark is not in
        );
$replacements = array(
                    "$1&".$add, # replacement if first pattern take place
                    '$1?'.$add  # replacement if second pattern take place
            );
$str = 'Lorem ipsum <a href="http://www.me.com">dolor sit</a> amet, <a href="http://www.me.com/page.php?id=10">consectetur</a> elit.';
$str = preg_replace($patterns, $replacements, $str);
echo $str;

/* Output:
Lorem ipsum <a href="http://www.me.com&utm=some&medium=stuff">dolor sit</a> amet, <a href="http://www.me.com/page.php?id=10&utm=some&medium=stuff">consectetur</a> elit.
*/
?>

I liked others answers using DOM-solutions, then I tested the time each snippet takes for the following input:

<a href="http://www.me.com">Lorem</a>
<a href="http://www.me.com/">ipsum</a>
<a href="http://www.me.com/#section-2">dolor</a>
<a href="http://www.me.com/path-to-somewhere/file.php">sit</a>
<a href="http://www.me.com/?">amet</a>,
<a href="http://www.me.com/?foo=bar">consectetur</a>
<a href="http://www.me.com/?foo=bar#section-3">elit</a>.

With microtime:

$ts = microtime(true);
// codes
printf("%.10f\n", microtime(true) - $ts);

That you can see them below (ms):

@squeamish ossifrage:  0.0001089573
@Cobra_Fast:           0.0003509521
@Emissary:             0.0094890594
@Me:                   0.0000669956

That was interesting to me, RegExes done well.

2 Comments

Your solution is indeed very slick! But there seems to be a bug in the domain checking pattern.. This does not work: '/(https?:\/\/(?:www)?me.com(?=.*?\?)[^"]*)/', '/(https?:\/\/(?:www)?me.com(?!.*?\?)[^"]*)/' This work: '/(https?:\/\/(?=.*?\?)[^"]*)/', '/(https?:\/\/(?!.*?\?)[^"]*)/'
@Maria oh yes. I just forgot to put a backslash before dot : me\.com I updated answer.
0

If you'd like to abstract all the nasty parsing away from your script you can always use a DOM parser of which there are many available. For this example I've opted for Simple HTML-DOM as It's the only one I'm actually familiar with (it's admittedly not the most efficient library but you aren't doing anything intensive).

include 'simple_html_dom.php';
$html = str_get_html($htmlString);

foreach($html->find('a') as $a){
    $url = strtolower($a->href);
    if( strpos($url, 'http://me.com')     === 0 ||
        strpos($url, 'http://www.me.com') === 0 ||
        strpos($url, 'http://') !== 0 // local url
    ){
        $url = explode('?', $url, 2);
        if(count($url)<2) $qry = array();
        else parse_str($url[1], $qry);
        $qry = array_merge($qry, array(
            'utm'    => 'some',
            'medium' => 'stuff'
        ));
        $parts = array();
        foreach($qry as $key => $val)
            $parts[] = "{$key}={$val}";
        $a->href = sprintf("%s?%s", $url[0], implode('&', $parts));
    }
}

echo $html;

In this example I've assumed that me.com is your website and that local paths should also qualify. I am also assuming that query strings are likely to be simple key:value pairs. In it's current form, if a URL already has one of your query parameters then it is over-written. If you'd like to retain the existing values then you will need to swap the order of the parameters in the array_merge function.

input:

<a href="http://me.com/">test</a> 
<a href="http://WWW.me.com/">test</a> 
<a href="local.me.com.php">test</a> 
<a href="http://notme.com">test</a> 
http://me.com/not-a-link
<a href="http://me.com/?id=10&utm=bla">test</a>

output:

<a href="http://me.com/?utm=some&medium=stuff">test</a> 
<a href="http://www.me.com/?utm=some&medium=stuff">test</a> 
<a href="local.me.com.php?utm=some&medium=stuff">test</a> 
<a href="http://notme.com">test</a> 
http://me.com/not-a-link 
<a href="http://me.com/?id=10&utm=some&medium=stuff">test</a>

Comments

0

If you have problems with DOMDocument and utf8, try the following:

$html_text = '<p>This is a text with speical chars ÄÖÜ <a 
href="http://example.com/This-is-my-Page" 
target="_self">here</a>.</p>';
$html_text .= '<p>continue</p>';

$html = new DOMDocument('1.0', 'utf-8');

// Set charset-header for DOMDocument
$html_prepared = '<html>'
  . '<head>'
  . '<meta http-equiv="content-type" content="text/html; charset=UTF-8">'
  . '</head>'
  . '<body>'
  . '<div>' . $html_text . '</div>'
  . '</body>';


$html->loadHtml($html_prepared);


foreach ($html->getElementsByTagName('a') as $element)
{
    $href = $element->getAttribute('href');
    if (!empty($href)) // only edit the attribute if it's set
    {
        // check if we need to append with ? or &
        if (strpos($href, '?') === false)
            $href .= '?';
        else
            $href .= '&';

        // append querystring
        $href .= 'utm=some&medium=stuff';

        // set attribute
        $element->setAttribute('href', $href);
    }
}


// 1) Remove doctype-declaration
$html->removeChild($html->firstChild);
// 2) Remove head
$html->firstChild->removeChild($html->firstChild->firstChild);
// 3) Only keep body's first Child
$html->replaceChild($html->firstChild->firstChild->firstChild, $html->firstChild);

print $html->saveHTML();

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.