9

I got a collection of string and all i want for regex is to collect all started with http..

href="http://www.test.com/cat/1-one_piece_episodes/"href="http://www.test.com/cat/2-movies_english_subbed/"href="http://www.test.com/cat/3-english_dubbed/"href="http://www.exclude.com"

this is my regular expression pattern..

href="(.*?)[^#]"

and return this

href="http://www.test.com/cat/1-one_piece_episodes/"
href="http://www.test.com/cat/2-movies_english_subbed/"
href="http://www.xxxx.com/cat/3-english_dubbed/"
href="http://www.exclude.com"

what is the pattern for excluding the last match.. or excluding matches that has the exclude domain inside like href="http://www.exclude.com"

EDIT: for multiple exclusion

href="((?:(?!"|\bexclude\b|\bxxxx\b).)*)[^#]"
1
  • Would you want the url http://www.test.com/fish/exclude included? what about http://www.exclude.co.uk or http://www.exclude.test.com Commented Aug 5, 2011 at 12:24

3 Answers 3

17

@ridgerunner and me would change the regex to:

href="((?:(?!\bexclude\b)[^"])*)[^#]"

It matches all href attributes as long as they don't end in # and don't contain the word exclude.

Explanation:

href="     # Match href="
(          # Capture...
 (?:       # the following group:
  (?!      # Look ahead to check that the next part of the string isn't...
   \b      # the entire word
   exclude # exclude
   \b      # (\b are word boundary anchors)
  )        # End of lookahead
  [^"]     # If successful, match any character except for a quote
 )*        # Repeat as often as possible
)          # End of capturing group 1
[^#]"      # Match a non-# character and the closing quote.

To allow multiple "forbidden words":

href="((?:(?!\b(?:exclude|this|too)\b)[^"])*)[^#]"
Sign up to request clarification or add additional context in comments.

5 Comments

parsing "href="((?:(?!"|\bexclude\b).)*[^#]"" - Not enough )'s its ok now.. i just read the explanation.. href="((?:(?!"|\bexclude\b).)*)[^#]"
additional question sir.. how about if i exclude additional string xxxx ?
@vrynxzent: Sorry, I had dropped a closing parenthesis. But you have found the correct solution, obviously :)
+1 for this awesome explanation! I know regex editors that do this but somehow, looking at their output I was always perplexed. Yours is so much succinct!
@ridgerunner: Thanks! I had planned to do this but forgot it completely when writing the regex...
2

Your input doesn't look like a valid string (unless you escape the quotes in them) but you can do it without regex too:

string input = "href=\"http://www.test.com/cat/1-one_piece_episodes/\"href=\"http://www.test.com/cat/2-movies_english_subbed/\"href=\"http://www.test.com/cat/3-english_dubbed/\"href=\"http://www.exclude.com\"";

List<string> matches = new List<string>();

foreach(var match in input.split(new string[]{"href"})) {
   if(!match.Contains("exclude.com"))
      matches.Add("href" + match);
}

Comments

0

Will this do the job?

href="(?!http://[^/"]+exclude.com)(.*?)[^#]"

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.