3

I wanted to extract the various HTML tags available from the source code of a web page is there any method in Java to do that or do HTML parser support this?

I want to seperate all the HTML tags .

1

5 Answers 5

1

Java comes with an XML parser with similar methods to the DOM in JavaScript:

DocumentBuilder builder = DocumentBuilderFactory.newDocumentBuilder();
Document doc = builder.parse(html);
doc.getElementById("someId");
doc.getElementsByTagName("div");
doc.getChildNodes();

The document builder can take many different inputs (input stream, raw html string, etc).

http://download.oracle.com/javase/1.5.0/docs/api/org/w3c/dom/Document.html

The cyber neko parser is also good if you need more.

Sign up to request clarification or add additional context in comments.

Comments

0

Check out CyberNeko HTML Parser.

Comments

0

You can use regular expressions. If your html is valid XML -- you can use XML parser

2 Comments

If his HTML is valid XML, then it's actually XHTML.
0

I've used HTMLParser in one project, was pretty happy with it.

Edit: If you check the samples page, the parser sample does pretty much what you're asking for.

Comments

0

You can write your own util method to extract tags.

Check for < and /> or > for complete tag and write those tags to another file.

2 Comments

Come on! A typo every now and then is not that bad, but 4 wrong words in the first 6?
Before your edit you had "u can write ur won uitill". You (correctly) replaced "u" with "you" ('though it should be capitalized), "ur" with "your" and "uitill" with "util". "won" should still be "own". And regarding your content: I wouldn't suggest trying to implement a HTML parser from scratch. It's a tough problem and others have already (mostly) solved it.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.