0

how could I extract the following with regex?

String string = "<h1>1st header</h1>" + "<h2>second header</h2>" +
"<p>some text</p>" + "<hr />";

Pattern p = Pattern.compile("</h1>(\\S+)<hr />", Pattern.MULTILINE);

Output is empty, but why?

1

2 Answers 2

4

The output is empty because the characters between </h1> and <hr /> include spaces. Your \S+ will fail as soon as it encounters a space.

If you replace \\S+ with, say, .+, it should catch everything in your highly specific example string. However, if you'd like to do this "right", and be able to match arbitrary HTML that doesn't perfectly fit your example, use an HTML parser like the HTML Agility Pack. A parser-based version will be easy, correct, and won't endanger your sanity and/or the universe.

Sign up to request clarification or add additional context in comments.

5 Comments

Though you don't have to jump to an HTML parser like a bull at a gate if using a regex genuinely serves your purposes and you're careful about the expression that you use.
@NeilCoffey, "...and you're careful...", and if you control the HTML you're parsing. If others control it, they will always be able to come up with a legit tag that the regex can't match. That's the main reason to not use regex.
Well, maybe... if you're operating in an environment where somebody is deliberately trying to break your HTML parsing for some reason then that's obviously a different scenario to the case of parsing some HTML documents 'as they are'. I don't disagree that there are scenarios where you need to be wary of using regex to parse HTML. But there are scenarios where regex provides a succinct, working solution and there's really no need to be paranoid about Angering The God Of HTML Parsers if you opt for the simple solution in such cases. But yes, you need to be aware of the issues as you point out.
@NeilCoffey, it's really not about angering anyone, or even someone deliberately breaking something. It's just that HTML is widely varied, and if you're trying to scrape, you can't count on anything being consistent. Also, DOM-based solutions are pretty easy to implement these days with good libraries such as mentioned in this answer. It's too easy to do it right to mess with regex.
@NeilCoffey - You're right that regex can be the quickest, easiest fix in certain (limited) tasks involving HTML/XML. I'm urging a parser because a) his sample input gives very few clues as to what he's going to be working with, and b) it sounds to me like he's looking for a robust solution. The .+ suggestion will work with his sample string, but a parser is the safe way to go.
3

The regex \S+ will not match the space between "some text". Also, don't use regex to parse HTML if you value your sanity.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.