Regex multiple lines with html code?

Question

how could I extract the following with regex?

String string = "<h1>1st header</h1>" + "<h2>second header</h2>" +
"<p>some text</p>" + "<hr />";

Pattern p = Pattern.compile("</h1>(\\S+)<hr />", Pattern.MULTILINE);

Output is empty, but why?

Oh, dear! I hear hoof-beats! stackoverflow.com/questions/1732348/… — Jonathan M
– Jonathan M, Commented May 15, 2012 at 21:50

Community · Accepted Answer · 2017-05-23 10:09:03Z

4

The output is empty because the characters between </h1> and <hr /> include spaces. Your \S+ will fail as soon as it encounters a space.

If you replace \\S+ with, say, .+, it should catch everything in your highly specific example string. However, if you'd like to do this "right", and be able to match arbitrary HTML that doesn't perfectly fit your example, use an HTML parser like the HTML Agility Pack. A parser-based version will be easy, correct, and won't endanger your sanity and/or the universe.

edited May 23, 2017 at 10:09

CommunityBot

11 silver badge

answered May 15, 2012 at 21:57

Justin Morgan

30.7k13 gold badges82 silver badges109 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Neil Coffey Over a year ago

Though you don't have to jump to an HTML parser like a bull at a gate if using a regex genuinely serves your purposes and you're careful about the expression that you use.

Jonathan M Over a year ago

@NeilCoffey, "...and you're careful...", and if you control the HTML you're parsing. If others control it, they will always be able to come up with a legit tag that the regex can't match. That's the main reason to not use regex.

Neil Coffey Over a year ago

Well, maybe... if you're operating in an environment where somebody is deliberately trying to break your HTML parsing for some reason then that's obviously a different scenario to the case of parsing some HTML documents 'as they are'. I don't disagree that there are scenarios where you need to be wary of using regex to parse HTML. But there are scenarios where regex provides a succinct, working solution and there's really no need to be paranoid about Angering The God Of HTML Parsers if you opt for the simple solution in such cases. But yes, you need to be aware of the issues as you point out.

Jonathan M Over a year ago

@NeilCoffey, it's really not about angering anyone, or even someone deliberately breaking something. It's just that HTML is widely varied, and if you're trying to scrape, you can't count on anything being consistent. Also, DOM-based solutions are pretty easy to implement these days with good libraries such as mentioned in this answer. It's too easy to do it right to mess with regex.

Justin Morgan Over a year ago

@NeilCoffey - You're right that regex can be the quickest, easiest fix in certain (limited) tasks involving HTML/XML. I'm urging a parser because a) his sample input gives very few clues as to what he's going to be working with, and b) it sounds to me like he's looking for a robust solution. The .+ suggestion will work with his sample string, but a parser is the safe way to go.

Community · Accepted Answer · 2017-05-23 11:47:33Z

3

The regex \S+ will not match the space between "some text". Also, don't use regex to parse HTML if you value your sanity.

edited May 23, 2017 at 11:47

CommunityBot

11 silver badge

answered May 15, 2012 at 21:54

Chris Nava

6,8203 gold badges28 silver badges31 bronze badges

Collectives™ on Stack Overflow

Regex multiple lines with html code?

2 Answers 2

5 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related