1

I'm having 800 entries that are very similar, but they need some stuff done to them. The format is like this:

<td class="description">
Describing text.
Might very well be 2 paragraphs
</td>

I need to do some stuff to the text inside the cell. I've tried to use preg_replace('/(.+)</td>/'). It ends up with two problems.

  1. I don't manage to fetch what's inside the parenthesis, but it will also fetch the cell tags.
  2. It will fetch everything until the last </td> in the document. I just want it to go to the first occurrence of </td>

Thanks in advance

2

6 Answers 6

1

First of all, .+ will grab everything... it won't just start at <td>. You will want to add a regex to pull the beginning of the table col:

<td[^>]*?>

(note, [^>]* means match non-> characters until we find one.)

Also, .+ and .* are greedy, meaning that it will grab as much as possible. To change this behavior, add a ? after it, like such: .+?. This makes it satisfy only as much as it needs to.

So, you will have

<td[^>]*)>(.*?)<\/td>

This was a lesson on regex, but I really think you shouldn't be using regex for this. Regex can break pretty easily once you start having nested tables or anything more complicated than simple html.

Sign up to request clarification or add additional context in comments.

Comments

1

D̨͙̯̹̼ỏ͇̥̱͚̲͖̣͢ǹ̶̥͉̳͈͈̏̉ͧ'ͧͬ͏̪̩͓̳̬̱ͅt͇̝̖ͦ̏̏̍̉͠ ͙̺̹͚͎̐̒ͥ͑̀ṷ͍̖͕̐ͫ̚s̤͖͇̲̪͊͋̉ͨͪ̚e͚̲͎͓̟͊̍ ̲̬̩͇̗̭̌̊̑̊͝r̷̦͔̞̜̬ͦe̔̓͒͊̌g̹̘̬̭ͨ̐̽̐̂u̼̹̔ͣ͑͐̓͋l͈̤̘͉̰̏͌̚a̵̤̞̥̋rͭ ̦̝͓̟̣̯̄́̎̀̔ͥe̢̟̥̹̊̅̌̅̋x̠̠̲͚̝͋ͪp̧̽̉ṟ͉̏͌̊̐ͅe͖͎̞͇̽͛̀s͓͈̒s̴͚̮̹ͧ̽i̐ͪ̈́̏̑o͇͓̎n͎̐̃ͨ͢s̜͉̼̹͇̐ͥ̏̈́̽̔͐ ̛̑ͧf̩̋ͨ͑ö̮̗̩́̏̀ͩ̆r̮͓͊̌ ̸̪͈̫̬̭̻̮͊ͧ͂ͬ̌H͎̤̟͙̞ͪ͐̃̿ͮͭͅT͚̉͑͛̉M̴̦͖͇͔͚̙ͭͭ̽L͗ͦ̋̓͑ ͍͈͙̞͍̻̉̆͆̃͘p̓̉̃͆͛ͦ́͟r͕͙ͭͭͦ͡ő̹͍̳̳ͯ̐c̵̙͇͋̅è͖̘̲̰͉͉̺͛́ͪͩ̋͜s̾͑ͬͬ͐̋̀s̜̼̰̞̺͗ͫ̒ͫͧͥͅḭ̪ͫ͋ͫ̚n̿͐҉̺̩̟̻̳g͑̀̑̆̈̾!̠̓ͭ̈͜

If you still want to try it ... use non-capturing groups (?:) to exclude the tags and a lazy quantifier *? to match only up to the first closing tag.

(?:<td[^>]*>).*?(?:</td>)

This requires dot-all mode and may still fail if for example the description attribute contains a closing angle bracket.

Comments

0

If you're certain that there is no HTML in the table cells, the following non-regex code may help:

// $entries contains all of the table cell entries.
$newentries = "";
$cells = split("</td>",$entries);
while (list(,$data) = each($cells)) {
    $newentries .= "<td class=\"description\">";
    $text = substr($data,strpos($data, ">") + 1);
    // perform modifications on $text
    // i.e. $text = "<B>" . $text . "</B>";
    $newentries .= $text;
    $newentries .= "</td>";
}

// $newentries now contains the modified cell entries.

This probably isn't 100% what you're looking for, but maybe it will help.

Comments

0

You may use:

preg_replace(
  '/<td (.*?)>(.*?)<\/td>/sm',
  '<td class="description"><strong>$2</strong></td>',
  $data
)

If what you are trying to do with the text inside is complicate, use a callback function.

Comments

0

As all the other ones have said: RegExp is bad, at least here!

So, basic Regex is

#<td[^>]*>(.*?)</td>#s

(Note I used the s-Modifier, otherwise the RegExp wouldn't work.)

Now, this RegExp is wrong, even though it may be okay for your purposes. To be more strict you have to know, that > is allowed in attributes. Therefore this Regex may break things.

#<td(\s+\w+="[^"]+")\s*>(.*?)</td>#s

I think this now will be quite secure if you're dealing with XML. But sure, it may break on rare occasions, which I right now can't think off.

Comments

0
$d = new DOMDocument();
$d->loadHTML($htmlstring);
$x = new DOMXPath($d);
$tds = $x->query("//td[@class='description']//text()");
for($i = 1; $i <= $tds->length; $i++){
    $tds->item($i)->replaceData(0,mb_strlen($tds->item($i)->wholeText),strtoupper($tds->item($i)->wholeText));   
}
var_dump($d->saveHTML());

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.