Simple regex help using C# (Regex pattern included)

Question

I have some website source stream I am trying to parse. My current Regex is this:

Regex pattern = new Regex (
@"<a\b             # Begin start tag
    [^>]+?             # Lazily consume up to id attribute
    id\s*=\s*['""]?thread_title_([^>\s'""]+)['""]?  # $1: id
    [^>]+?             # Lazily consume up to href attribute
    href\s*=\s*['""]?([^>\s'""]+)['""]?             # $2: href
    [^>]*              # Consume up to end of open tag
    >                  # End start tag
    (.*?)                                           # $3: name
    </a\s*>            # Closing tag",
RegexOptions.Singleline | RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace );

But it doesn't match the links anymore. I included a sample string here.

Basically I am trying to match these:

<a href="http://visitingspain.com/forum/f89/how-to-get-a-travel-visa-3046631/" id="thread_title_3046631">How to Get a Travel Visa</a>

"http://visitingspain.com/forum/f89/how-to-get-a-travel-visa-3046631/" is the **Link**
304663` is the **TopicId**
"How to Get a Travel Visa" is the **Title**

In the sample I posted, there are at least 3, I didn't count the other ones.

Also I use RegexHero (online and free) to see my matching interactively before adding it to code.

@Joan Venge For reference: stackoverflow.com/questions/1732348/… — user166390
– user166390, Commented Sep 25, 2011 at 4:40

Kobi · Accepted Answer · 2011-09-25 04:43:49Z

For completeness, here how it's done with the Html Agility Pack, which is a robust HTML parser for .Net (also available through NuGet, so installing it takes about 20 seconds).

Loading the document, parsing it, and finding the 3 links is as simple as:

string linkIdPrefix = "thread_title_";
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://jsbin.com/upixof");
IEnumerable<HtmlNode> threadLinks = doc.DocumentNode.Descendants("a")
                              .Where(link => link.Id.StartsWith(linkIdPrefix));

That's it, really. Now you can easily get the data:

foreach (var link in threadLinks)
{
    string href = link.GetAttributeValue("href", null);
    string id = link.Id.Substring(linkIdPrefix.Length); // remove "thread_title_"
    string text = link.InnerHtml; // or link.InnerText
    Console.WriteLine("{0} - {1}", id, href);
}

Kobi · Accepted Answer · 2011-09-25 18:19:12Z

3

This is quite simple, the markup changed, and now the href attribute appears before the id:

<a\b             # Begin start tag
    [^>]+?             # Lazily consume up to href attribute
    href\s*=\s*['""]?([^>\s'""]+)['""]?             # $1: href
    [^>]+?             # Lazily consume up to id attribute
    id\s*=\s*['""]?thread_title_([^>\s'""]+)['""]?  # $2: id
    [^>]*              # Consume up to end of open tag
    >                  # End start tag
    (.*?)                                           # $3: name
    </a\s*>            # Closing tag

Note that:

This is mainly why this is a bad idea.
The group numbers have changed. You can use named groups instead, while you're at it: (?<ID>[^>\s'""]+) instead of ([^>\s'""]+).
The quotes are still escaped (this should be OK in character sets)

Example on regex hero.

edited Sep 25, 2011 at 18:19

answered Sep 25, 2011 at 4:12

Kobi

139k41 gold badges259 silver badges302 bronze badges

9 Comments

Joan Venge Over a year ago

Thanks, in your example link, is it modified? When I open it, it says 0 matches.

Kobi Over a year ago

@JoanVenge - That is strange... I'll let it be, it already failed me trice, but I think the idea is clear anyway :) Thanks!

Steve Wortham Over a year ago

Regex Hero truncates the target string when using the permalink feature if it's longer than 4,000 characters. It occurs to me that I should probably raise the limit. @Joan - If you copy and paste your original html, then Kobi's regular expression should work.

Steve Wortham Over a year ago

I raised the limit to 500,000 characters. So this should work... regexhero.net/tester/?id=2509fab5-243f-4fa3-aeb2-61658ae38f7b

Steve Wortham Over a year ago

@Joan and Kobi - You're welcome. And you're absolutely right in using HTML Agility Pack in the scenario. It's what I would do as well. By the way, I'm working on a new tool called XML Hero which will help with things like this.

|

Community · Accepted Answer · 2017-05-23 11:47:52Z

1

Don't do that (well, almost, but it's not for everyone). Parsers are meant for that type of thing.

edited May 23, 2017 at 11:47

CommunityBot

11 silver badge

answered Sep 25, 2011 at 4:00

Icarus

64k14 gold badges102 silver badges116 bronze badges

1 Comment

Joan Venge Over a year ago

Thanks but I need a quickfix, not a major change. Besides it's a personal tool no one uses anyway. Also I see many instances of similar practice in production code, so I think even most programmers don't follow these good practices.

Collectives™ on Stack Overflow

Simple regex help using C# (Regex pattern included)

3 Answers 3

Comments

9 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

9 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related