0

I'm trying to pull out a string between 2 other strings. But to make it more complicated the proceeding contents will often differ.

enter image description here

The string I'm trying to retrieve is Christchurch.

The regex I have so far is (?<=300px">).*(?=</td) and it will pull out the string I'm looking fine but it will also return dozens of other strings through out the LARGE text file I'm searching.

What I'd like to do is limit the prefix to start seraching from Office:, all the way to 300px"> but, the contents between those 2 strings will sometimes differ dependant upon user preferences.

To put it in crude non regex terms I want to do the following: Starting at Office: all the way to 300px> find the string that starts here and ends with </td. Thus resulting in Christchurch.

0

4 Answers 4

3

Have you considered using the HTMLAgilityPack instead? It's a Nuget package for handling HTML which is able to handle malformed HTML pretty well. Most on Stack Overflow would recommend against using Regex for HTML - see here: RegEx match open tags except XHTML self-contained tags

Here's how you'd do it for your example:

using HtmlAgilityPack; //This is a nuget package!
var html = @"<tr >
               <td align=""right"" valign=""top""><strong>Office:</strong>&nbsp; </td>
               <td align=""left"" class=""stippel"" style=""white-space: wrap;max-width:300px"">Christchurch </td>
             </tr>";

var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);

var node = htmlDoc.SelectSingleNode("//td[@class='stippel']");
Console.WriteLine(node.InnerHtml);

I haven't tested this code but it should do what you need.

Sign up to request clarification or add additional context in comments.

1 Comment

The advantage to this is that you can probably look for the tag with your class in it and just pull out its value.
0

I guess you need something like this:

office.*\n.*|(?<=300px">).*(?=<\/td)

5 Comments

Similar to the answer offered by adamdc78 this is not exactly what I want. I only want to retrieve the string Christchurch.
Starting at Office: all the way to 300px> find the string that starts here and ends with </td what does that mean?
Can you not see the source code in my initial post?
I can't see how this doesn't work with you! doesn't it retrieve only Christchurch ? I re-checked your initial post and I don't see where you stuck now
I've included a link to a screenshot of the html code I'm working with. I could not attach a picture because my rep is too low.
0

The issue you're encountering is that * is greedy. Use the lazy/reluctant version *?.

Office:[\s\S]*?300px">(.*?)</td

This solution uses a group match rather than look-arounds.

4 Comments

That will not help. As mentioned in the post that regex '(?<=300px">).*(?=</td)' works and it does return Christchurch, but in the text file I have 40Kb it is also returning hundreds of other results that match. I want to start searching from Office [any and all characters all the way through] 300px and THEN retrieve the value.
Still not exactly what I'm looking for. The revised regex returns everything from Office to <td, inclusive.
I've included a link to a screenshot of the html code I'm working with. I could not attach a picture because my rep is too low.
It will match everything, but the first group will be the one you want.
0

Thanks to the posts from adamdc78 and greg I have the been able to come up with the below regex. This is exactly what I needed.

Thanks for you help.

(?<=office.*\n.*300px">).*(?=<\/td)

1 Comment

welcome to StackOverflow: you should accept their answers (since they helped) and not add a thank you answer

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.