Extract Text from HTML String Java

Question

Ive got a string of HTML code filled with tags and special characters, for example:

 &lt;p class=&quot;MsoNormal&quot;&gt;&lt;span style=&quot;font-size: 14pt; font-family: TimesNewRoman;&quot;&gt; I Just want this Text here?&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;

or

&lt;div&gt;This is more text i would like. :( &lt;/div&gt;&lt;div&gt;&amp;nbsp;&lt;/div&gt;

Im just wondering if there is any way to extract the text from the html strings. I have tried to use some regex to replace strings but it didnt seem like the bay way to do it. Have also tried JSoup but didnt have much luck with that.

Any ideas? Regards.

First decode those html entities, then you can use JSoup to parse the actual HTML query it for particular strings. — RealSkeptic
– RealSkeptic, Commented Oct 18, 2015 at 12:21

AdeelMufti · Accepted Answer · 2015-10-18 12:32:26Z

3

Are you sure you were using JSoup correctly? That would be perfect for this, and I use it all the time to do the same.

Your code would look like this:

String stringWithHtml="<div>&nbsp;test&nbsp;</div>";
String extractedText = Jsoup.parse(stringWithHtml).text();
//extractedText is now "test"

Make sure the JSoup library is in your classpath.

answered Oct 18, 2015 at 12:32

AdeelMufti

3512 silver badges9 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

anas p a · Accepted Answer · 2015-10-18 12:32:46Z

2

you can solve this issue by combined operation of Jsoup and regular expression

  String st="&lt;p class=&quot;MsoNormal&quot;&gt;&lt;span style=&quot;font-size: 14pt; font-family: TimesNewRoman;&quot;&gt; I Just want this Text here?&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;";
  System.out.println(Jsoup.parse(st).text().replaceAll("\\<.*?>",""));

answered Oct 18, 2015 at 12:32

anas p a

4096 silver badges22 bronze badges

1 Comment

samuelmadethis Over a year ago

Worked like a charm, Thank you.

Dale · Accepted Answer · 2015-10-18 12:21:03Z

1

This is actually a possible duplicate. Your solution looks something like this.

    String inputString = "&lt;div&gt;This is more text i would like. :( &lt;/div&gt;&lt;div&gt;&amp;nbsp;&lt;/div&gt;";
    inputString = inputString.replace("&lt;", "<");
    inputString = inputString.replace("&gt;", ">");
    inputString = inputString.replaceAll("<[^>]*>", "");
    System.out.println(inputString);

This would extract all items that are not in html tags. I wasn't sure if you wanted the first element or all elements. Here it's assuming all html tags would be removed leaving all text in its place including the ampersand. The escaped ampersand could be handled with a replace or strategies.

answered Oct 18, 2015 at 12:21

Dale

1,6264 gold badges24 silver badges43 bronze badges

Comments

Kumaresan Perumal · Accepted Answer · 2015-10-18 12:27:19Z

1

You have another is aspose. have a look at the link

http://www.aspose.com/java/word-component.aspx

Document doc = new Document();
DocumentBuilder builder = new DocumentBuilder(doc);
builder.insertHtml(
        "<P align='right'>Paragraph right</P>" +
                "<b>Implicit paragraph left</b>" +
                "<div align='center'>Div center</div>" +
                "<h1 align='left'>Heading 1 left.</h1>");

doc.save(getMyDir() + "DocumentBuilder.InsertHtml Out.doc");

answered Oct 18, 2015 at 12:27

Kumaresan Perumal

1,9602 gold badges30 silver badges40 bronze badges

Collectives™ on Stack Overflow

Extract Text from HTML String Java

4 Answers 4

Comments

1 Comment

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related