1

I need to extract text from html tags. I have written a code but the text is not being extracted. Below is my code

import java.util.regex.Matcher;
import java.io.BufferedReader;
import java.io.FileReader;
import java.util.regex.Pattern;
class getFontTagText{
String result = null;
public static void main(String args[]){
    try{
           getFontTagText text = new getFontTagText();
           BufferedReader r = new BufferedReader(new FileReader("target.html"));
           Pattern p = Pattern.compile("<FONT FACE=\"Arial\" SIZE=\"1\" COLOR=\"\\W|_000000\" LETTERSPACING=\"0\" KERNING=\"0\">(//AZUZZU Full Service Provision)</FONT>",Pattern.MULTILINE);
           String line;
           System.out.println("Came here");
           while((line = r.readLine()) != null){
           Matcher mat = p.matcher(line);

           while(mat.find()){
                System.out.println("Came here");
                String st = mat.group(1);
                System.out.format("'%s'\n", st);
            }
        }
    }catch (Exception e){
        System.out.println(e);
    }
}

}

and the html file is here

     <P ALIGN="LEFT">
         <FONT FACE="Arial" SIZE="1" COLOR="#000000" LETTERSPACING="0" KERNING="0">ZUZZU Full Service Provision</FONT>
     </P>
     <P ALIGN="LEFT">
         <FONT FACE="Arial" SIZE="1" COLOR="#000000" LETTERSPACING="0" KERNING="0">&uuml; &ouml; &auml; &Auml; &Uuml; &Ouml; &szlig;</FONT>
     </P>

mat.group(1) is being printed 'null' instead of text. Any help is much appreciated.

1 Answer 1

1

I would recommend to use jsoup. jsoup is a Java library for extracting and manipulating HTML data, using CSS, and jquery-like methods. In your case it could look like something like this :

    public static void jsoup() throws IOException{
    File input = new File("C:\\users\\uzochi\\desktop\\html.html");
    Document doc = Jsoup.parse(input, "UTF-8");
    Elements es = doc.select("FONT");//select tag 
    for(Element e : es){
        System.out.println(e.text());
    }    
}

If you prefer to use regex just match the text between > and < , for example

public static void regex(){
Pattern pat = Pattern.compile("<FONT [^>]*>(.*?)</FONT>");//
String s = "<html>\n" +
            "<body>\n" +
            "\n" +
            "<P ALIGN=\"LEFT\">\n" +
            "         <FONT FACE=\"Arial\" SIZE=\"1\" COLOR=\"#000000\" LETTERSPACING=\"0\" KERNING=\"0\">ZUZZU Full Service Provision</FONT>\n" +
            "     </P>\n" +
            "     <P ALIGN=\"LEFT\">\n" +
            "         <FONT FACE=\"Arial\" SIZE=\"1\" COLOR=\"#000000\" LETTERSPACING=\"0\" KERNING=\"0\">&uuml; &ouml; &auml; &Auml; &Uuml; &Ouml; &szlig;</FONT>\n" +
            "     </P>\n" +
            "\n" +
            "</body>\n" +
            "</html>";
Matcher m = pat.matcher(s);
while (m.find()) {
    String found = m.group(1);
    System.out.println("Found : " + found);      
}    

}

Sign up to request clarification or add additional context in comments.

3 Comments

Made some progress by your suggestion. But still I am getting the only word 'Provision' in my output. I want to get the text of all font tags.
I am also getting these > < characters in output. I want to eliminate those also.
i make some improvement and edited my last answer. Try if it works for you. But i still recomend to use a html parser like jsoup, which will make parsing html very easy

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.