Regex to automate some html tagging

Question

I'm having 800 entries that are very similar, but they need some stuff done to them. The format is like this:

<td class="description"> Describing text. Might very well be 2 paragraphs </td>

I need to do some stuff to the text inside the cell. I've tried to use preg_replace('/(.+)</td>/'). It ends up with two problems.

I don't manage to fetch what's inside the parenthesis, but it will also fetch the cell tags.
It will fetch everything until the last </td> in the document. I just want it to go to the first occurrence of </td>

Thanks in advance

Don't mix HTML and Regex! stackoverflow.com/questions/1732348/… — Daniel Brückner
– Daniel Brückner, Commented Jul 14, 2010 at 13:54

Donald Miner · Accepted Answer · 2010-07-14 13:53:30Z

1

First of all, .+ will grab everything... it won't just start at <td>. You will want to add a regex to pull the beginning of the table col:

<td[^>]*?>

(note, [^>]* means match non-> characters until we find one.)

Also, .+ and .* are greedy, meaning that it will grab as much as possible. To change this behavior, add a ? after it, like such: .+?. This makes it satisfy only as much as it needs to.

So, you will have

<td[^>]*)>(.*?)<\/td>

This was a lesson on regex, but I really think you shouldn't be using regex for this. Regex can break pretty easily once you start having nested tables or anything more complicated than simple html.

answered Jul 14, 2010 at 13:53

Donald Miner

40.1k10 gold badges99 silver badges118 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Community · Accepted Answer · 2017-05-23 12:01:15Z

D̨͙̯̹̼ỏ͇̥̱͚̲͖̣͢ǹ̶̥͉̳͈͈̏̉ͧ'ͧͬ͏̪̩͓̳̬̱ͅt͇̝̖ͦ̏̏̍̉͠ ͙̺̹͚͎̐̒ͥ͑̀ṷ͍̖͕̐ͫ̚s̤͖͇̲̪͊͋̉ͨͪ̚e͚̲͎͓̟͊̍ ̲̬̩͇̗̭̌̊̑̊͝r̷̦͔̞̜̬ͦe̔̓͒͊̌g̹̘̬̭ͨ̐̽̐̂u̼̹̔ͣ͑͐̓͋l͈̤̘͉̰̏͌̚a̵̤̞̥̋rͭ ̦̝͓̟̣̯̄́̎̀̔ͥe̢̟̥̹̊̅̌̅̋x̠̠̲͚̝͋ͪp̧̽̉ṟ͉̏͌̊̐ͅe͖͎̞͇̽͛̀s͓͈̒s̴͚̮̹ͧ̽i̐ͪ̈́̏̑o͇͓̎n͎̐̃ͨ͢s̜͉̼̹͇̐ͥ̏̈́̽̔͐ ̛̑ͧf̩̋ͨ͑ö̮̗̩́̏̀ͩ̆r̮͓͊̌ ̸̪͈̫̬̭̻̮͊ͧ͂ͬ̌H͎̤̟͙̞ͪ͐̃̿ͮͭͅT͚̉͑͛̉M̴̦͖͇͔͚̙ͭͭ̽L͗ͦ̋̓͑ ͍͈͙̞͍̻̉̆͆̃͘p̓̉̃͆͛ͦ́͟r͕͙ͭͭͦ͡ő̹͍̳̳ͯ̐c̵̙͇͋̅è͖̘̲̰͉͉̺͛́ͪͩ̋͜s̾͑ͬͬ͐̋̀s̜̼̰̞̺͗ͫ̒ͫͧͥͅḭ̪ͫ͋ͫ̚n̿͐҉̺̩̟̻̳g͑̀̑̆̈̾!̠̓ͭ̈͜

If you still want to try it ... use non-capturing groups (?:) to exclude the tags and a lazy quantifier *? to match only up to the first closing tag.

(?:<td[^>]*>).*?(?:</td>)

This requires dot-all mode and may still fail if for example the description attribute contains a closing angle bracket.

Fosco · Accepted Answer · 2010-07-14 14:03:42Z

0

If you're certain that there is no HTML in the table cells, the following non-regex code may help:

// $entries contains all of the table cell entries.
$newentries = "";
$cells = split("</td>",$entries);
while (list(,$data) = each($cells)) {
    $newentries .= "<td class=\"description\">";
    $text = substr($data,strpos($data, ">") + 1);
    // perform modifications on $text
    // i.e. $text = "<B>" . $text . "</B>";
    $newentries .= $text;
    $newentries .= "</td>";
}

// $newentries now contains the modified cell entries.

This probably isn't 100% what you're looking for, but maybe it will help.

answered Jul 14, 2010 at 14:03

Fosco

38.6k7 gold badges90 silver badges101 bronze badges

Comments

Narcis Radu · Accepted Answer · 2010-07-14 14:13:58Z

0

You may use:

preg_replace(
  '/<td (.*?)>(.*?)<\/td>/sm',
  '<td class="description"><strong>$2</strong></td>',
  $data
)

If what you are trying to do with the text inside is complicate, use a callback function.

edited Jul 14, 2010 at 14:13

answered Jul 14, 2010 at 14:06

Narcis Radu

2,54722 silver badges33 bronze badges

Comments

NikiC · Accepted Answer · 2010-07-14 14:23:04Z

0

As all the other ones have said: RegExp is bad, at least here!

So, basic Regex is

#<td[^>]*>(.*?)</td>#s

(Note I used the s-Modifier, otherwise the RegExp wouldn't work.)

Now, this RegExp is wrong, even though it may be okay for your purposes. To be more strict you have to know, that > is allowed in attributes. Therefore this Regex may break things.

#<td(\s+\w+="[^"]+")\s*>(.*?)</td>#s

I think this now will be quite secure if you're dealing with XML. But sure, it may break on rare occasions, which I right now can't think off.

answered Jul 14, 2010 at 14:23

NikiC

102k39 gold badges194 silver badges226 bronze badges

Comments

Wrikken · Accepted Answer · 2010-07-14 14:30:41Z

0

$d = new DOMDocument();
$d->loadHTML($htmlstring);
$x = new DOMXPath($d);
$tds = $x->query("//td[@class='description']//text()");
for($i = 1; $i <= $tds->length; $i++){
    $tds->item($i)->replaceData(0,mb_strlen($tds->item($i)->wholeText),strtoupper($tds->item($i)->wholeText));   
}
var_dump($d->saveHTML());

answered Jul 14, 2010 at 14:30

Wrikken

70.8k8 gold badges99 silver badges136 bronze badges

Collectives™ on Stack Overflow

Regex to automate some html tagging

6 Answers 6

Comments

Comments

Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

Comments

Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related