How to extract values from HTML using RegEx?

Question

Given the following HTML:

<p><span class="xn-location">OAK RIDGE, N.J.</span>, <span class="xn-chron">March 16, 2011</span> /PRNewswire/ -- Lakeland Bancorp, Inc. (Nasdaq:   <a href='http://studio-5.financialcontent.com/prnews?Page=Quote&Ticker=LBAI' target='_blank' title='LBAI'> LBAI</a>), the holding company for Lakeland Bank, today announced that it redeemed <span class="xn-money">$20 million</span> of the Company's outstanding <span class="xn-money">$39 million</span> in Fixed Rate Cumulative Perpetual Preferred Stock, Series A that was issued to the U.S. Department of the Treasury under the Capital Purchase Program on <span class="xn-chron">February 6, 2009</span>, thereby reducing Treasury's investment in the Preferred Stock to <span class="xn-money">$19 million</span>. The Company paid approximately <span class="xn-money">$20.1 million</span> to the Treasury to repurchase the Preferred Stock, which included payment for accrued and unpaid dividends for the shares. &#160;This second repayment, or redemption, of Preferred Stock will result in annualized savings of <span class="xn-money">$1.2 million</span> due to the elimination of the associated preferred dividends and related discount accretion. &#160;A one-time, non-cash charge of <span class="xn-money">$745 thousand</span> will be incurred in the first quarter of 2011 due to the acceleration of the Preferred Stock discount accretion. &#160;The warrant previously issued to the Treasury to purchase 997,049 shares of common stock at an exercise price of <span class="xn-money">$8.88</span>, adjusted for stock dividends and subject to further anti-dilution adjustments, will remain outstanding.</p>

I'd like to get the values inside the  elements. I'd also like to get the value of the class attribute on the  elements.

Ideally I could just run some HTML through a function and get back a dictionary of extracted entities (based on the  parsing defined above).

The above code is a snippet from a larger source HTML file, which fails to pare with an XML parser. So I'm looking for a possible regular expression to help extract the information of interest.

What programming language are you using? There are libraries that will take HTML that isn't valid XML and still allow you to use xpath expressions etc. to query the information. — a'r
– a'r, Commented Mar 16, 2011 at 15:26

Varun Chatterji · Accepted Answer · 2011-03-16 16:16:42Z

10

Use this tool (free): http://www.radsoftware.com.au/regexdesigner/

Use this Regex:

"<span[^>]*>(.*?)</span>"

The values in Group 1 (for each match) will be the text that you need.

In C# it will look like:

            Regex regex = new Regex("<span[^>]*>(.*?)</span>");
            string toMatch = "<span class=\"ajjsjs\">Some text</span>";
            if (regex.IsMatch(toMatch))
            {
                MatchCollection collection = regex.Matches(toMatch);
                foreach (Match m in collection)
                {
                    string val = m.Groups[1].Value;
                    //Do something with the value
                }
            }

Ammended to answer comment:

            Regex regex = new Regex("<span class=\"(.*?)\">(.*?)</span>");
            string toMatch = "<span class=\"ajjsjs\">Some text</span>";
            if (regex.IsMatch(toMatch))
            {
                MatchCollection collection = regex.Matches(toMatch);
                foreach (Match m in collection)
                {
                    string class = m.Groups[1].Value;
                    string val = m.Groups[2].Value;
                    //Do something with the class and value
                }
            }

edited Mar 16, 2011 at 16:16

answered Mar 16, 2011 at 15:53

Varun Chatterji

5,0991 gold badge25 silver badges24 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Varun Chatterji Over a year ago

My sample code will not work for nested spans, but then again there are none in the sample html you supplied....

Paul Fryer Over a year ago

This works good for getting the value, thanks. Do you have any ideas how I could get the value of the "class" attribute also?

Paul Fryer Over a year ago

this is exactly what I'm looking for - you rock! Thanks

Mr. Llama · Accepted Answer · 2011-03-16 15:39:35Z

2

Assuming that you have no nested span tags, the following should work:

/<span(?:[^>]+class=\"(.*?)\"[^>]*)?>(.*?)<\/span>/

I only did some basic testing on it, but it'll match the class of the span tag (if it exists) along with the data until the tag is closed.

answered Mar 16, 2011 at 15:39

Mr. Llama

21k3 gold badges70 silver badges121 bronze badges

1 Comment

Paul Fryer Over a year ago

Cool, do you have any ideas how I could use this in C# to return a dictionary of values extracted? Thanks.

Community · Accepted Answer · 2017-05-23 10:27:46Z

1

I strongly advise you to use a real HTML or XML parser for this instead. You cannot reliably parse HTML or XML with regular expressions--the most you can do is come close, and the closer you get, the more convoluted and time-consuming your regex will be. If you have a large HTML file to parse, it's highly likely to break any simple regex pattern.

Regex like <span[^>]*>(.*?) will work on your example, but there's a LOT of XML-valid code that's difficult or even impossible to parse with regex (for example, foo bar will break the above pattern). If you want something that's going to work on other HTML samples, regex isn't the way to go here.

Since your HTML code isn't XML-valid, consider the HTML Agility Pack, which I've heard is very good.

edited May 23, 2017 at 10:27

CommunityBot

11 silver badge

answered Mar 16, 2011 at 15:53

Justin Morgan

30.7k13 gold badges82 silver badges109 bronze badges

Collectives™ on Stack Overflow

How to extract values from HTML using RegEx?

3 Answers 3

3 Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related