1

Ive got a string of HTML code filled with tags and special characters, for example:

 <p class="MsoNormal"><span style="font-size: 14pt; font-family: TimesNewRoman;"> I Just want this Text here?<o:p></o:p></span></p>

or

<div>This is more text i would like. :( </div><div> </div>

Im just wondering if there is any way to extract the text from the html strings. I have tried to use some regex to replace strings but it didnt seem like the bay way to do it. Have also tried JSoup but didnt have much luck with that.

Any ideas? Regards.

1
  • First decode those html entities, then you can use JSoup to parse the actual HTML query it for particular strings. Commented Oct 18, 2015 at 12:21

4 Answers 4

3

Are you sure you were using JSoup correctly? That would be perfect for this, and I use it all the time to do the same.

Your code would look like this:

String stringWithHtml="<div>&nbsp;test&nbsp;</div>";
String extractedText = Jsoup.parse(stringWithHtml).text();
//extractedText is now "test"

Make sure the JSoup library is in your classpath.

Sign up to request clarification or add additional context in comments.

Comments

2

you can solve this issue by combined operation of Jsoup and regular expression

  String st="&lt;p class=&quot;MsoNormal&quot;&gt;&lt;span style=&quot;font-size: 14pt; font-family: TimesNewRoman;&quot;&gt; I Just want this Text here?&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;";
  System.out.println(Jsoup.parse(st).text().replaceAll("\\<.*?>",""));

1 Comment

Worked like a charm, Thank you.
1

This is actually a possible duplicate. Your solution looks something like this.

    String inputString = "&lt;div&gt;This is more text i would like. :( &lt;/div&gt;&lt;div&gt;&amp;nbsp;&lt;/div&gt;";
    inputString = inputString.replace("&lt;", "<");
    inputString = inputString.replace("&gt;", ">");
    inputString = inputString.replaceAll("<[^>]*>", "");
    System.out.println(inputString);

This would extract all items that are not in html tags. I wasn't sure if you wanted the first element or all elements. Here it's assuming all html tags would be removed leaving all text in its place including the ampersand. The escaped ampersand could be handled with a replace or strategies.

Comments

1

You have another is aspose. have a look at the link

http://www.aspose.com/java/word-component.aspx

Document doc = new Document();
DocumentBuilder builder = new DocumentBuilder(doc);
builder.insertHtml(
        "<P align='right'>Paragraph right</P>" +
                "<b>Implicit paragraph left</b>" +
                "<div align='center'>Div center</div>" +
                "<h1 align='left'>Heading 1 left.</h1>");

doc.save(getMyDir() + "DocumentBuilder.InsertHtml Out.doc");

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.