2

I have some website source stream I am trying to parse. My current Regex is this:

Regex pattern = new Regex (
@"<a\b             # Begin start tag
    [^>]+?             # Lazily consume up to id attribute
    id\s*=\s*['""]?thread_title_([^>\s'""]+)['""]?  # $1: id
    [^>]+?             # Lazily consume up to href attribute
    href\s*=\s*['""]?([^>\s'""]+)['""]?             # $2: href
    [^>]*              # Consume up to end of open tag
    >                  # End start tag
    (.*?)                                           # $3: name
    </a\s*>            # Closing tag",
RegexOptions.Singleline | RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace );

But it doesn't match the links anymore. I included a sample string here.

Basically I am trying to match these:

<a href="http://visitingspain.com/forum/f89/how-to-get-a-travel-visa-3046631/" id="thread_title_3046631">How to Get a Travel Visa</a>

"http://visitingspain.com/forum/f89/how-to-get-a-travel-visa-3046631/" is the **Link**
304663` is the **TopicId**
"How to Get a Travel Visa" is the **Title**

In the sample I posted, there are at least 3, I didn't count the other ones.

Also I use RegexHero (online and free) to see my matching interactively before adding it to code.

2
  • @Joan Venge For reference: stackoverflow.com/questions/1732348/… Commented Sep 25, 2011 at 4:40
  • Thanks pst, haven't seen that one. Commented Sep 25, 2011 at 4:42

3 Answers 3

4

For completeness, here how it's done with the Html Agility Pack, which is a robust HTML parser for .Net (also available through NuGet, so installing it takes about 20 seconds).

Loading the document, parsing it, and finding the 3 links is as simple as:

string linkIdPrefix = "thread_title_";
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://jsbin.com/upixof");
IEnumerable<HtmlNode> threadLinks = doc.DocumentNode.Descendants("a")
                              .Where(link => link.Id.StartsWith(linkIdPrefix));

That's it, really. Now you can easily get the data:

foreach (var link in threadLinks)
{
    string href = link.GetAttributeValue("href", null);
    string id = link.Id.Substring(linkIdPrefix.Length); // remove "thread_title_"
    string text = link.InnerHtml; // or link.InnerText
    Console.WriteLine("{0} - {1}", id, href);
}
Sign up to request clarification or add additional context in comments.

Comments

3

This is quite simple, the markup changed, and now the href attribute appears before the id:

<a\b             # Begin start tag
    [^>]+?             # Lazily consume up to href attribute
    href\s*=\s*['""]?([^>\s'""]+)['""]?             # $1: href
    [^>]+?             # Lazily consume up to id attribute
    id\s*=\s*['""]?thread_title_([^>\s'""]+)['""]?  # $2: id
    [^>]*              # Consume up to end of open tag
    >                  # End start tag
    (.*?)                                           # $3: name
    </a\s*>            # Closing tag

Note that:

  • This is mainly why this is a bad idea.
  • The group numbers have changed. You can use named groups instead, while you're at it: (?<ID>[^>\s'""]+) instead of ([^>\s'""]+).
  • The quotes are still escaped (this should be OK in character sets)

Example on regex hero.

9 Comments

Thanks, in your example link, is it modified? When I open it, it says 0 matches.
@JoanVenge - That is strange... I'll let it be, it already failed me trice, but I think the idea is clear anyway :) Thanks!
Regex Hero truncates the target string when using the permalink feature if it's longer than 4,000 characters. It occurs to me that I should probably raise the limit. @Joan - If you copy and paste your original html, then Kobi's regular expression should work.
I raised the limit to 500,000 characters. So this should work... regexhero.net/tester/?id=2509fab5-243f-4fa3-aeb2-61658ae38f7b
@Joan and Kobi - You're welcome. And you're absolutely right in using HTML Agility Pack in the scenario. It's what I would do as well. By the way, I'm working on a new tool called XML Hero which will help with things like this.
|
1

Don't do that (well, almost, but it's not for everyone). Parsers are meant for that type of thing.

1 Comment

Thanks but I need a quickfix, not a major change. Besides it's a personal tool no one uses anyway. Also I see many instances of similar practice in production code, so I think even most programmers don't follow these good practices.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.