0

I am using a java code to extract information from the web for processing, and I am using the jsoup library to clean the html tags in the responses I get from website. Now in order to extract info from these codes I have to replace the html tags with a rarely used character such as '~'.

So here's my question:

How do I convert this:

<h1>This is heading 1</h1>
<h2>This is heading 2</h2>
<h3>This is heading 3</h3>
<h4>This is heading 4</h4>
<h5>This is heading 5</h5>
<h6>This is heading 6</h6>

Into this:

   ~This is heading 1~
   ~This is heading 2~
   ~This is heading 3~
   ~This is heading 4~
   ~This is heading 5~
   ~This is heading 6~

using jsoup?

3
  • modify org.jsoup.safety.Cleaner ?? Commented Oct 4, 2013 at 8:17
  • 2
    What have you tried? I mean besides asking us. Commented Oct 4, 2013 at 8:42
  • i tried a ceratin method but it only replaced contents inside a tag...not the entire tag Commented Oct 4, 2013 at 8:44

1 Answer 1

1
String cssSelector = //add your selector. from the example you include i cant get a proper selector.
Document doc = Jsoup.parse("html")
Elements elms = doc.select(cssSelector)
for(Element elm:elms){
     System.out.println("~" + elm.text() + "~")
}

update

if you want to replace ALL elements you can do this:

html = html.replaceAll("<[^>]*>","~")
Sign up to request clarification or add additional context in comments.

10 Comments

probably, what are you trying to select? can you post a sample of the html?
Sorry for the trouble..this is the first time I'm using Jsoup
so what elements do you want to select, theres not h1,h2 in there
I wanna replace ALL the HTML tags with '~'
i thought you wanted to replace specifc elements.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.