1

I am trying to use a RegEx to search through a long string, and I am having trouble coming up with an expression. I am trying to search through some HTML for a set of tags beginning with a tag containing a certain value and ending with a different tag containing another value. The code I am currently using to attempt this is as follows:

matcher = new RegExp(".*(<[^>]+" + startText + "((?!" + endText + ").)*" + endText + ")", 'g');

data.replace(matcher, "$1");

The strangeness around the middle ( ((\\?\\!endText).)* ) is borrowed from another thread, found here, that seems to describe my problem. The issue I am facing is that the expression matches the beginning tag, but it does not find the ending tag and instead includes the remainder of the data. Also, the lookaround in the middle slowed the expression down a lot. Any suggestions as to how I can get this working?

EDIT: I understand that parsing HTML in RegEx isn't the best option (makes me feel dirty), but I'm in a time-crunch and any other alternative I can think of will take too long. It's hard to say what exactly the markup I will be parsing will look like, as I am creating it on the fly. The best I can do is to say that I am looking at a large table of data that is collected for a range of items on a range of dates. Both of these ranges can vary, and I am trying to select a certain range of dates from a single row. The approximate value of startText and endText are \\@\\@ASSET_ID\\@\\@_<YYYY_MM_DD>. The idea is to find the code that corresponds to this range of cells. (This edit could quite possibly have made this even more confusing, but I'm not sure how much more information I could really give without explaining the entire application).

EDIT: Well, this was a stupid question. Apparently, I just forgot to add .* after the last paren. Can't believe I spent so long on this! Thanks to those of you that tried to help!

14
  • 2
    Aside from the flood of comments that are on their way about not parsing HTML with Regex (which you shouldn't do - it's not a Regular language), we are at the very least going to need to see sample data - what are you replacing, what is your start and end text, expected output, actual output, etc etc. Commented Aug 12, 2013 at 21:54
  • 3
    Don't listen to the trolls. Every tool has its time and place. I'll take a look at your question and try to help you out, give me a minute. Commented Aug 12, 2013 at 22:07
  • 1
    @Suamere: what 'trolls'? The reason that posts asking about parsing HTML with regex get lots of (valid) comments about not parsing HTML with regex is because it's the wrong tool for the job, for precisely the reason given by Frankie. And, Crash: please post your solution as an answer to your question. That way it might be of benefit to other users in future (given the specificity of the regular expression this is, perhaps, unlikely, but it's never a bad thing to answer a question). Commented Aug 12, 2013 at 22:18
  • 1
    Every tool has a place, (Click that) don't fall in with the trolls who blindly throw away parding HTML with Regex. Commented Aug 12, 2013 at 22:20
  • 1
    Not sure why your last edit says you added .* after the last parenthesis. Your situation was obviously difficult to describe, but I still suggest taking a look at my answer to help clean up your data. I'd be curious to know more about what you were dealing with and why .* solved your issue. It really shouldn't ever be used and is very slow. But if it works, use it. Commented Aug 12, 2013 at 22:31

1 Answer 1

3

First of all, why is there a .* Dot Asterisk in the beginning? If you have text like the following:

This is my Text

And you want "my Text" pulled out, you do my\sText. You don't have to do the .*.

That being said, since all you'll be matching now is what you need, you don't need the main Capture Group around "Everything". This: .*(xxx) is a huge no-no, and can almost always be replaced with this: xxx. In other words, your regex could be replaced with:

<[^>]+xxx((?!zzz).)*zzz

From there I examine what it's doing.

  1. You are looking for an HTML opening Delimeter <. You consume it.
  2. You consume at least one character that is NOT a Closing HTML Delimeter, but can consume many. This is important, because if your tag is <table border=2>, then you have, at minimum, so far consumed <t, if not more.
  3. You are now looking for a StartText. If that StartText is table, you'll never find it, because you have consumed the t. So replace that + with a *.
  4. The regex is still success if the following is NOT the closing text, but starts from the VERY END of the document, because the Asterisk is being Greedy. I suggest making it lazy by adding a ?.
  5. When the backtracking fails, it will look for the closing text and gather it successfully.

The result of that logic:

<[^>]*xxx((?!zzz).)*?zzz

If you're going to use a dot anyway, which is okay for new Regex writers, but not suggested for seasoned, I'd go with this:

<[^>]*xxx.*?zzz

So for Javascript, your code would say:

matcher = new RegExp("<[^>]*" + startText + ".*?" + endText, 'gi');

I put the IgnoreCase "i" in there for good measure, but you may or may not want that.

Sign up to request clarification or add additional context in comments.

2 Comments

You may want to give a brief explanation of the difference between replace, match, and search statements in here - as I think that's what OP missed.
Right, after posting this I noticed he was doing a replace possibly incorrectly. I almost edited my answer for that, except that he didn't say the purpose of the replace or note it at all in his question. So if he used my answer, he could just say OriginalSource = Regex Result (Paracode). But I think he's left by now since he found his answer with .*

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.