Java HTML Stripping

Question

//Method for Strip HTML
public static String stripHtml(String inStr) {
  boolean inTag = false;
  char c;
  StringBuffer outStr = new StringBuffer();
  int len = inStr.length();
  for (int i = 0; i < len; i++) {
    c = inStr.charAt(i);
    if (c == '<') {
      inTag = true;
    }
    if (!inTag) {
      outStr.append(c);
    }
    if (c == '>') {
      inTag = false;
    }
  }
  //Print to show that the this method is removing the necessary characters
  System.out.println(outStr);
  return outStr.toString();
}

So I need all outputs containing <> to be cleansed and everything in between it, and it should still print out the remaining characters. for instance

input:app<html>le
expected:apple

however it should also remove if it finds just "<" or ">" but my method isn't doing so.

input:app<le
output:app<le
expected:apple

please let me know what to fix.

You have several choices. One alternative is to use an HTML parser like jsoup. Another is to use Java String.replaceAll(), with a regex. You should also look hard at whether or not you actually want "remove if it finds just "<" or ">"... Would this be "malformed HTML"... or would you risk corrupting a valid expression like "1 < 2"? — paulsm4
– paulsm4, Commented Dec 1, 2022 at 22:13

queeg · Accepted Answer · 2022-12-01 20:57:40Z

2

Try parsing HTML using an HTML parser like JSoup or TagSoup. Once you have the DOM, on the root element just call getTextContent().

From the API documentation (never versions of Java act the same): This attribute returns the text content of this node and its descendants. [...] no serialization is performed, the returned string does not contain any markup.

Comments

ﾓｷｬﾃﾞ · Accepted Answer · 2022-12-01 21:50:40Z

It works fine with Jsoup, as someone said.

String input = "app<html>le";
Document doc = Jsoup.parse(input);
System.out.println(doc.wholeText());  // or doc.text()

output:

apple

But the example you gave is not a proper XML document and cannot be processed using XML parsers.

You can also modify your program slightly.

public static String stripHtml(String inStr) {
    boolean inTag = false;
    StringBuffer outStr = new StringBuffer();
    int len = inStr.length();
    for (int i = 0; i < len; i++) {
        char c = inStr.charAt(i);
        if (c == '<') {
            inTag = true;
        } else if (c == '>') {
            inTag = false;
        } else if (!inTag) {
            outStr.append(c);
        }
    }
    return outStr.toString();
}

and

String input = "app<html>le";
System.out.println(stripHtml(input));

output:

apple

Joop Eggen · Accepted Answer · 2022-12-01 22:07:55Z

0

Your requirement is to remove a paired <...> and not handle sole <s.

This means that your code may only drop the in-tag characters when encountering >-

Your code could also use ìnt i2 = inStr.indexOf('>', i+1); to find a closing > at <.

However simpler is to use a regular expression replace:

public static String stripHtml(String s) {
    return s.replaceAll("<[^>]*>", "");
}

This searches all:

<
a not->, 0 or more times (*)
>

answered Dec 1, 2022 at 22:07

Joop Eggen

110k8 gold badges89 silver badges142 bronze badges

Collectives™ on Stack Overflow

Java HTML Stripping

3 Answers 3

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related