1

I have a html text file that has headings I would like to extract the only the text inside

Example:

<h1 class="title"><a href="dtb.htm#rgn_txt_0001_0001">Fire Safety</a></h1>
<h1><a href="dtb.htm#rgn_txt_0002_0001">About this book</a></h1>
<h1><a href="dtb.htm#rgn_par_0002_0008">1</a></h1>
<h1><a href="dtb.htm#rgn_txt_0003_0001">Contents of this book</a></h1>

I would like extract only the following text from HTML code:

Fire Safety, About this book, 1, Contents of this book

I tried lot of things like:

Pattern pattern = Pattern.compile("<a[^>]href\\s=\\s*\"\\s*([^\"]*)");
Matcher matcher = pattern.matcher(input);

where input is the html data.

Didn't get any results on the console or sometimes are i am getting only href :(

How do I get to fix this?

Let me know! Thanks!

5
  • Please Please Please! Don't parse HTML with Regex. Try jsoup.org Commented Dec 18, 2012 at 7:05
  • You cannot parse HTML with Regex, lest this happens again: stackoverflow.com/a/1732454/504685 Commented Dec 18, 2012 at 7:06
  • 1
    @RohitJain It's not that you shouldn't parse HTML with RegEx, it's that you can't. Commented Dec 18, 2012 at 7:12
  • K. why can't I use regEX. What is issue behind it? More over it is not HTML file but it is just HTML source code that are on a text? Commented Dec 18, 2012 at 8:54
  • @user1443051 HTML is a non-regular context free language. You can only describe regular languages with regular expressions though. See any introductory article on formal languages for details. Commented Dec 18, 2012 at 12:25

1 Answer 1

3

I would strongly recommend to use an HTML parser, something like TagSoup, Jericho, NekoHTML, HTML Parser, etc

Sign up to request clarification or add additional context in comments.

2 Comments

I don't have any information on the parser that are available.
@user1443051 ... and that's why NullPointer gave you links to 4 of them.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.