Get value between unknown string

Question

I'm trying to pull out a string between 2 other strings. But to make it more complicated the proceeding contents will often differ.

enter image description here

The string I'm trying to retrieve is Christchurch.

The regex I have so far is (?<=300px">).*(?=</td) and it will pull out the string I'm looking fine but it will also return dozens of other strings through out the LARGE text file I'm searching.

What I'd like to do is limit the prefix to start seraching from Office:, all the way to 300px"> but, the contents between those 2 strings will sometimes differ dependant upon user preferences.

To put it in crude non regex terms I want to do the following: Starting at Office: all the way to 300px> find the string that starts here and ends with </td. Thus resulting in Christchurch.

Greg the Incredulous · Accepted Answer · 2018-04-13 03:33:27Z

3

Have you considered using the HTMLAgilityPack instead? It's a Nuget package for handling HTML which is able to handle malformed HTML pretty well. Most on Stack Overflow would recommend against using Regex for HTML - see here: RegEx match open tags except XHTML self-contained tags

Here's how you'd do it for your example:

using HtmlAgilityPack; //This is a nuget package!
var html = @"<tr >
               <td align=""right"" valign=""top""><strong>Office:</strong>&nbsp; </td>
               <td align=""left"" class=""stippel"" style=""white-space: wrap;max-width:300px"">Christchurch </td>
             </tr>";

var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);

var node = htmlDoc.SelectSingleNode("//td[@class='stippel']");
Console.WriteLine(node.InnerHtml);

I haven't tested this code but it should do what you need.

edited Apr 13, 2018 at 3:33

answered Mar 3, 2015 at 0:35

Greg the Incredulous

1,8765 gold badges34 silver badges44 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

adamdc78 Over a year ago

The advantage to this is that you can probably look for the tag with your class in it and just pull out its value.

chouaib · Accepted Answer · 2015-03-03 00:50:59Z

0

I guess you need something like this:

office.*\n.*|(?<=300px">).*(?=<\/td)

answered Mar 3, 2015 at 0:50

chouaib

2,8175 gold badges23 silver badges36 bronze badges

5 Comments

Mike Stephens Over a year ago

Similar to the answer offered by adamdc78 this is not exactly what I want. I only want to retrieve the string Christchurch.

chouaib Over a year ago

Starting at Office: all the way to 300px> find the string that starts here and ends with </td what does that mean?

Mike Stephens Over a year ago

Can you not see the source code in my initial post?

chouaib Over a year ago

I can't see how this doesn't work with you! doesn't it retrieve only Christchurch ? I re-checked your initial post and I don't see where you stuck now

Mike Stephens Over a year ago

I've included a link to a screenshot of the html code I'm working with. I could not attach a picture because my rep is too low.

adamdc78 · Accepted Answer · 2015-03-03 01:13:50Z

0

The issue you're encountering is that * is greedy. Use the lazy/reluctant version *?.

Office:[\s\S]*?300px">(.*?)</td

This solution uses a group match rather than look-arounds.

edited Mar 3, 2015 at 1:13

answered Mar 3, 2015 at 0:31

adamdc78

1,1611 gold badge8 silver badges18 bronze badges

4 Comments

Mike Stephens Over a year ago

That will not help. As mentioned in the post that regex '(?<=300px">).*(?=</td)' works and it does return Christchurch, but in the text file I have 40Kb it is also returning hundreds of other results that match. I want to start searching from Office [any and all characters all the way through] 300px and THEN retrieve the value.

Mike Stephens Over a year ago

Still not exactly what I'm looking for. The revised regex returns everything from Office to <td, inclusive.

Mike Stephens Over a year ago

I've included a link to a screenshot of the html code I'm working with. I could not attach a picture because my rep is too low.

adamdc78 Over a year ago

It will match everything, but the first group will be the one you want.

Mike Stephens · Accepted Answer · 2015-03-03 01:39:17Z

0

Thanks to the posts from adamdc78 and greg I have the been able to come up with the below regex. This is exactly what I needed.

Thanks for you help.

(?<=office.*\n.*300px">).*(?=<\/td)

answered Mar 3, 2015 at 1:39

Mike Stephens

334 bronze badges

1 Comment

chouaib Over a year ago

welcome to StackOverflow: you should accept their answers (since they helped) and not add a thank you answer

Collectives™ on Stack Overflow

Get value between unknown string

4 Answers 4

1 Comment

5 Comments

4 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

5 Comments

4 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related