Extracting text in html using Java Regex

Question

I need to extract text from html tags. I have written a code but the text is not being extracted. Below is my code

import java.util.regex.Matcher;
import java.io.BufferedReader;
import java.io.FileReader;
import java.util.regex.Pattern;
class getFontTagText{
String result = null;
public static void main(String args[]){
    try{
           getFontTagText text = new getFontTagText();
           BufferedReader r = new BufferedReader(new FileReader("target.html"));
           Pattern p = Pattern.compile("<FONT FACE=\"Arial\" SIZE=\"1\" COLOR=\"\\W|_000000\" LETTERSPACING=\"0\" KERNING=\"0\">(//AZUZZU Full Service Provision)</FONT>",Pattern.MULTILINE);
           String line;
           System.out.println("Came here");
           while((line = r.readLine()) != null){
           Matcher mat = p.matcher(line);

           while(mat.find()){
                System.out.println("Came here");
                String st = mat.group(1);
                System.out.format("'%s'\n", st);
            }
        }
    }catch (Exception e){
        System.out.println(e);
    }
}

}

and the html file is here

     <P ALIGN="LEFT">
         <FONT FACE="Arial" SIZE="1" COLOR="#000000" LETTERSPACING="0" KERNING="0">ZUZZU Full Service Provision</FONT>
     </P>
     <P ALIGN="LEFT">
         <FONT FACE="Arial" SIZE="1" COLOR="#000000" LETTERSPACING="0" KERNING="0">&uuml; &ouml; &auml; &Auml; &Uuml; &Ouml; &szlig;</FONT>
     </P>

mat.group(1) is being printed 'null' instead of text. Any help is much appreciated.

Eritrean · Accepted Answer · 2016-05-12 06:47:09Z

1

I would recommend to use jsoup. jsoup is a Java library for extracting and manipulating HTML data, using CSS, and jquery-like methods. In your case it could look like something like this :

    public static void jsoup() throws IOException{
    File input = new File("C:\\users\\uzochi\\desktop\\html.html");
    Document doc = Jsoup.parse(input, "UTF-8");
    Elements es = doc.select("FONT");//select tag 
    for(Element e : es){
        System.out.println(e.text());
    }    
}

If you prefer to use regex just match the text between > and < , for example

public static void regex(){
Pattern pat = Pattern.compile("<FONT [^>]*>(.*?)</FONT>");//
String s = "<html>\n" +
            "<body>\n" +
            "\n" +
            "<P ALIGN=\"LEFT\">\n" +
            "         <FONT FACE=\"Arial\" SIZE=\"1\" COLOR=\"#000000\" LETTERSPACING=\"0\" KERNING=\"0\">ZUZZU Full Service Provision</FONT>\n" +
            "     </P>\n" +
            "     <P ALIGN=\"LEFT\">\n" +
            "         <FONT FACE=\"Arial\" SIZE=\"1\" COLOR=\"#000000\" LETTERSPACING=\"0\" KERNING=\"0\">&uuml; &ouml; &auml; &Auml; &Uuml; &Ouml; &szlig;</FONT>\n" +
            "     </P>\n" +
            "\n" +
            "</body>\n" +
            "</html>";
Matcher m = pat.matcher(s);
while (m.find()) {
    String found = m.group(1);
    System.out.println("Found : " + found);      
}

}

edited May 12, 2016 at 6:47

answered May 10, 2016 at 8:28

Eritrean

16.6k3 gold badges25 silver badges28 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Charan Putrevu Over a year ago

Made some progress by your suggestion. But still I am getting the only word 'Provision' in my output. I want to get the text of all font tags.

Charan Putrevu Over a year ago

I am also getting these > < characters in output. I want to eliminate those also.

Eritrean Over a year ago

i make some improvement and edited my last answer. Try if it works for you. But i still recomend to use a html parser like jsoup, which will make parsing html very easy

Collectives™ on Stack Overflow

Extracting text in html using Java Regex

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related