3

The title of my question is a bit complicated, I know, but here is basically what I want to do:

Say I have this piece of text:

[table]
[tr]
[td]test str 1[/td]
[td]test str 2[/td]
[/tr]
[/table]

Would there be a regex, that allows me to find:

  • A string that is between the [td] and [/td] tag
  • Of which the entire part from [td] to [/td] is itself between the [table] and [/table] tags
  • And the text that is between the [table] and [td] tags can't contain the [/table] tag
  • And the text that is between the [/td] and [/table] tags can't contain the
    [table] tag

It might sound obvious, but it should be a safe regex because this regex will be used to handle user input, and if a user were to enter a [td] outside of a table (all the tags are converted to html), it could affect the tables used for the layout of my site's page.

So it should match "test str 1" first, and on the next go "test str 2", but only if that string is within the td tags, which should in turn be within the table tags between which may not be another table tag.

This is as close as I've gotten:

/\[table(.*?)\]((?!\[\/table\]).*?)\[td(.*?)\](.*?)\[\/td\]((?!\[table(.*?)\]).*?)\[\/table\]/si

But I think I'm missing something in the parts where the table tags should not be there, so between the table and td tags.

11
  • Don't regex html. Just write a parser, or use a library. Commented Sep 1, 2012 at 1:00
  • You have a better way to parse this non HTML stuff? @zellio Commented Sep 1, 2012 at 1:03
  • If I was going to parse a non-regular language I would use a parser. And this is just html. changing the <> to [] doesn't change it. It takes user input and is converted to HTML. Commented Sep 1, 2012 at 1:04
  • Although I agree with the point about not the right tool for the job, but for some things there are no parsers ^^ (don't know whether that is the case here though because I have no idea where that stuff is coming from) Commented Sep 1, 2012 at 1:05
  • Well I can't use a parser because I need to do this in a php environment that only accepts regex. I'm writing it in a plugin for a forum software, which will only accept regex writtin in a php environment. Its a real pain but I know it should be possible and I think my regex is real close to a solution, I just can't find the missing link. Commented Sep 1, 2012 at 1:06

1 Answer 1

1

HTML is a Context-Free Language, whereas a regular expression is for Regular Languages. If you look at the Chomsky hierarchy of formal languages, you'll see that what you're trying to do isn't possible to do in any reliable way.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.