Parse the html code or use regex with java?

Question

I'm trying to extract the values of this piece of html code:

<ul id="tree-dotlrn_class_instance">
<li>
      <a href="/dotlrn/classes/c033/13000/c12c033a13000gA/">**2011-12 Ampl.Arquit.Computadors Gr.A  (13000)**</a>
<ul>
    <li>
        <a href="/dotlrn/classes/c033/13022/c12c033a13022gA/c12c033a13022gAsT00/">**2011-12 Entorns d'Usuari Gr.A  Sgr.T00 (13022)** </a>
    </li>
    <li>
      <a href="/dotlrn/classes/c033/13036/c12c033a13036gA/c12c033a13036gAsT00/">**2011-12 Eng.Serv.Telemàtics Gr.A  Sgr.T00 (13036)** </a>
    </li>
</ul>
</li>

<li>
      <a href="/dotlrn/classes/c033/13038/c12c033a13038gA/">**2011-12 Intel·lig.Artif.Enginyer.Coneixem. Gr.A  (13038)**</a>
</li>

<li>
      <a href="/dotlrn/classes/c033/13048/c12c033a13048gA/">**2011-12 Processad.Llenguatge Gr.A  (13048)**</a>
<ul>
    <li>
        <a href="/dotlrn/classes/c033/13048/c12c033a13048gA/c12c033a13048gAsL01/">**2011-12 Processad.Llenguatge Gr.A  Sgr.L01 (13048)** </a>
    </li>
    <li>
      <a href="/dotlrn/classes/c033/13048/c12c033a13048gA/c12c033a13048gAsT00/">**2011-12 Processad.Llenguatge Gr.A  Sgr.T00 (13048)** </a>
    </li>
    <li>
      <a href="/dotlrn/classes/c033/13052/c12c033a13052gA/c12c033a13052gAsL02/">**2011-12 Sist.Basats Microprocessadors Gr.A  Sgr.L02 (13052)** </a>
    </li>
</ul>
</li>

<li>
      <a href="/dotlrn/classes/c033/13055/c12c033a13055gAA/">**2011-12 Sist.Informàtics Gr.AA (13055)**</a>
</li>

<li>
      <a href="/dotlrn/classes/c033/14009/c12c033a14009gA/">**2011-12 Administrac. Gestió de Xarxes Gr.A  (14009)**</a>
</li>

<li>
      <a href="/dotlrn/classes/c033/15656/c12c033a15656gA/">**2011-12 Transmissió de Dades Gr.A**  (15656)</a>        
</li>
</ul>

All that it's in strong black (between**)with his href value into a HashMap. First I try with jericho html parser but I think is so complicated, then I try with Regex, but I don't know how to do it exactly. Can you help me ??

Thanks!

Update: I'm trying this, but it's not the right way.

Source s = new Source(answer);
    List<Element> Form1 = s.getAllElements(HTMLElementName.UL);
    int tam1 = Form1.size();
        for(int j = 0; j < tam1; j++){
            Element e1 = Form1.get(j);
            if("tree-dotlrn_class_instance".equals(e1.getAttributeValue("id"))){
                List<Element> L1 = e1.getAllElements(HTMLElementName.UL);
                for (int k = 0; k < L1.size(); k++){
                    Element e2 = L1.get(k);
                    System.out.println("Elemento de la lista L1: "+e2.getContent());
                    List<Element> L2 = e2.getAllElements(HTMLElementName.LI);
                    for(int m = 0; m < L2.size(); m++){
                        Element e3 = L2.get(m);
                        System.out.println("Elemento de la lista L2: "+e3.getContent());
                        asignaturas.add(e3.getContent().toString());
                        System.out.println("Lista de asignaturas "+m+" "+asignaturas.get(0));
                    }
                }

            }
        }

There's nothing in strong black in your **code block**. Next, HTML is not Regular so you can't use Regular Expressions to parse it reliably. — Richard JP Le Guen
– Richard JP Le Guen, Commented Jan 8, 2013 at 16:42
Just read @m0skit0 's link. But there is a valid case which is very narrow. That's the case where the HTML is generated on-the-fly by some application and you either own the application or otherwise know when it changes what it generates (or only need it for the week and assume it won't change this week). Then you can parse what will be well-formed HTML. Its a pretty selective case but the example HTML seems to fall into this case. It just depends. — Lee Meador
– Lee Meador, Commented Jan 8, 2013 at 17:06
There is nothing in strong black because it's code and then this property doesn't' apply, but I put the ** in the part of the code I want to extract. Ok, now I decided to parse with a HTML parser, how can I do, because I want all the href with the text that it's associated in the browser view. — Carlos del Blanco
– Carlos del Blanco, Commented Jan 8, 2013 at 17:26
@LeeMeador and you will keep updating the Java app each time the output of the other app changes? Poor foresight IMHO. Do it right the first time and you're done. — m0skit0
– m0skit0, Commented Jan 8, 2013 at 22:02

nicholas.hauschild · Accepted Answer · 2013-01-08 16:48:26Z

5

Take a look at JSoup's selector syntax.

If you are looking for all a elements with an href attribute, you can find them like this:

String theHtmlInYourExample = "...";
Document doc = Jsoup.parse(theHtmlInYourExample);
Elements links = doc.select("a[href]");

From there, you should be able to extract the text of the element and the value of the href attribute to create your HashMap.

edited Jan 8, 2013 at 16:48

answered Jan 8, 2013 at 16:39

nicholas.hauschild

42.9k9 gold badges129 silver badges120 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Carlos del Blanco Over a year ago

Not all the <a> elements, only the elements in this list

nicholas.hauschild Over a year ago

The beautiful part of JSoup is that you can use the selector syntax to do just that! Take a look at the link, it should provide plenty of details to get you further than my small example.

Lee Meador · Accepted Answer · 2013-01-08 16:52:57Z

0

Regex:

\<a\s+href\s*\=\s*["']/dotlrn/classes/c033.+\>(.*)\(\d+\)\</a\>

Java String:

"\\<a\\s+href\\s*\\=\\s*[\"']/dotlrn/classes/c033.+\\>(.*)\\(\\d+\\)\\</a\\>"

You probably won't find it reliable but the 1st matching group will be your desired string if the pages match what you supplied.

Here is a place to test Java regular expressions

answered Jan 8, 2013 at 16:52

Lee Meador

13k2 gold badges38 silver badges43 bronze badges

Comments

ldam · Accepted Answer · 2013-01-08 16:54:58Z

0

Why not use the DOM API? You can get attributes and values fairly trivially with it.

answered Jan 8, 2013 at 16:54

ldam

4,6146 gold badges50 silver badges81 bronze badges

Comments

Waleed Almadanat · Accepted Answer · 2013-01-08 16:57:51Z

0

You can surely try using XML Pull Parsing or DOM, given that the input HTML is well formed.

answered Jan 8, 2013 at 16:57

Waleed Almadanat

1,03711 silver badges24 bronze badges

Collectives™ on Stack Overflow

Parse the html code or use regex with java?

4 Answers 4

2 Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related