2

I'm trying to extract the values of this piece of html code:

<ul id="tree-dotlrn_class_instance">
<li>
      <a href="/dotlrn/classes/c033/13000/c12c033a13000gA/">**2011-12 Ampl.Arquit.Computadors Gr.A  (13000)**</a>
<ul>
    <li>
        <a href="/dotlrn/classes/c033/13022/c12c033a13022gA/c12c033a13022gAsT00/">**2011-12 Entorns d'Usuari Gr.A  Sgr.T00 (13022)** </a>
    </li>
    <li>
      <a href="/dotlrn/classes/c033/13036/c12c033a13036gA/c12c033a13036gAsT00/">**2011-12 Eng.Serv.Telemàtics Gr.A  Sgr.T00 (13036)** </a>
    </li>
</ul>
</li>

<li>
      <a href="/dotlrn/classes/c033/13038/c12c033a13038gA/">**2011-12 Intel·lig.Artif.Enginyer.Coneixem. Gr.A  (13038)**</a>
</li>

<li>
      <a href="/dotlrn/classes/c033/13048/c12c033a13048gA/">**2011-12 Processad.Llenguatge Gr.A  (13048)**</a>
<ul>
    <li>
        <a href="/dotlrn/classes/c033/13048/c12c033a13048gA/c12c033a13048gAsL01/">**2011-12 Processad.Llenguatge Gr.A  Sgr.L01 (13048)** </a>
    </li>
    <li>
      <a href="/dotlrn/classes/c033/13048/c12c033a13048gA/c12c033a13048gAsT00/">**2011-12 Processad.Llenguatge Gr.A  Sgr.T00 (13048)** </a>
    </li>
    <li>
      <a href="/dotlrn/classes/c033/13052/c12c033a13052gA/c12c033a13052gAsL02/">**2011-12 Sist.Basats Microprocessadors Gr.A  Sgr.L02 (13052)** </a>
    </li>
</ul>
</li>

<li>
      <a href="/dotlrn/classes/c033/13055/c12c033a13055gAA/">**2011-12 Sist.Informàtics Gr.AA (13055)**</a>
</li>

<li>
      <a href="/dotlrn/classes/c033/14009/c12c033a14009gA/">**2011-12 Administrac. Gestió de Xarxes Gr.A  (14009)**</a>
</li>

<li>
      <a href="/dotlrn/classes/c033/15656/c12c033a15656gA/">**2011-12 Transmissió de Dades Gr.A**  (15656)</a>        
</li>
</ul>

All that it's in strong black (between**)with his href value into a HashMap. First I try with jericho html parser but I think is so complicated, then I try with Regex, but I don't know how to do it exactly. Can you help me ??

Thanks!

Update: I'm trying this, but it's not the right way.

Source s = new Source(answer);
    List<Element> Form1 = s.getAllElements(HTMLElementName.UL);
    int tam1 = Form1.size();
        for(int j = 0; j < tam1; j++){
            Element e1 = Form1.get(j);
            if("tree-dotlrn_class_instance".equals(e1.getAttributeValue("id"))){
                List<Element> L1 = e1.getAllElements(HTMLElementName.UL);
                for (int k = 0; k < L1.size(); k++){
                    Element e2 = L1.get(k);
                    System.out.println("Elemento de la lista L1: "+e2.getContent());
                    List<Element> L2 = e2.getAllElements(HTMLElementName.LI);
                    for(int m = 0; m < L2.size(); m++){
                        Element e3 = L2.get(m);
                        System.out.println("Elemento de la lista L2: "+e3.getContent());
                        asignaturas.add(e3.getContent().toString());
                        System.out.println("Lista de asignaturas "+m+" "+asignaturas.get(0));
                    }
                }

            }
        }
7
  • 7
    Never parse HTML/XML with regexes Commented Jan 8, 2013 at 16:41
  • There's nothing in strong black in your **code block**. Next, HTML is not Regular so you can't use Regular Expressions to parse it reliably. Commented Jan 8, 2013 at 16:42
  • 1
    Just read @m0skit0 's link. But there is a valid case which is very narrow. That's the case where the HTML is generated on-the-fly by some application and you either own the application or otherwise know when it changes what it generates (or only need it for the week and assume it won't change this week). Then you can parse what will be well-formed HTML. Its a pretty selective case but the example HTML seems to fall into this case. It just depends. Commented Jan 8, 2013 at 17:06
  • There is nothing in strong black because it's code and then this property doesn't' apply, but I put the ** in the part of the code I want to extract. Ok, now I decided to parse with a HTML parser, how can I do, because I want all the href with the text that it's associated in the browser view. Commented Jan 8, 2013 at 17:26
  • @LeeMeador and you will keep updating the Java app each time the output of the other app changes? Poor foresight IMHO. Do it right the first time and you're done. Commented Jan 8, 2013 at 22:02

4 Answers 4

5

Take a look at JSoup's selector syntax.

If you are looking for all a elements with an href attribute, you can find them like this:

String theHtmlInYourExample = "...";
Document doc = Jsoup.parse(theHtmlInYourExample);
Elements links = doc.select("a[href]");

From there, you should be able to extract the text of the element and the value of the href attribute to create your HashMap.

Sign up to request clarification or add additional context in comments.

2 Comments

Not all the <a> elements, only the elements in this list
The beautiful part of JSoup is that you can use the selector syntax to do just that! Take a look at the link, it should provide plenty of details to get you further than my small example.
0

Regex:

\<a\s+href\s*\=\s*["']/dotlrn/classes/c033.+\>(.*)\(\d+\)\</a\>

Java String:

"\\<a\\s+href\\s*\\=\\s*[\"']/dotlrn/classes/c033.+\\>(.*)\\(\\d+\\)\\</a\\>"

You probably won't find it reliable but the 1st matching group will be your desired string if the pages match what you supplied.

Here is a place to test Java regular expressions

Comments

0

Why not use the DOM API? You can get attributes and values fairly trivially with it.

Comments

0

You can surely try using XML Pull Parsing or DOM, given that the input HTML is well formed.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.