2

Input data: (some nested list with links )

<ul>
    <li><a>1</a>
        <ul>
            <li><a>11</a>
                <ul>
                    <li><a>111</a></li>
                    <li><a>112</a></li>
                    <li><a>113</a>
                    <ul>
                        <li><a>1131</a></li>
                        <li><a>1132</a></li>
                        <li><a>1133</a></li>
                    </ul></li>
                    <li><a>114</a></li>
                    <li><a>115</a></li>
                </ul>
            </li>
            <li><a>12</a>
                <ul>
                    <li><a>121</a>
                    <ul>
                        <li><a>1211</a></li>
                        <li><a>1212</a></li>
                        <li><a>1213</a></li>
                    </ul></li>
                    <li><a>122</a></li>
                </ul>
            </li>
        </ul>
    </li>
</ul>

Output array of strings:

 1,11,111
 1,11,112
 1,11,113,1131
 1,11,113,1132
 1,11,113,1133
 1,11,114
 1,11,115
 1,12,121,1211
 1,12,121,1212
 1,12,121,1213
 1,12,122

Full path with text of element which in

  • without childs.

    What I tried: 1. XML::SAX::ParserFactory

    https://gist.github.com/7266638 Alot of problem here. How to detect if li last, how to save path etc. I think its bad way.

    1. Its totaly not a regexp, cos in real life example html much worse. Alot of tags, divs, spans etc

    Dom? But how?

  • 2
    • I don't understand why 1131, 1132 and 1133 appear after 113 when they are at same level in the tree. Commented Nov 1, 2013 at 15:37
    • Yes! Thank you! I place wrong tag /li Commented Nov 1, 2013 at 16:10

    1 Answer 1

    3

    You can try with XML::Twig module. It saves all text from <a> elements and only prints them when there is no child <ul> under one of <li> elements.

    #!/usr/bin/env perl
    
    use warnings;
    use strict;
    use XML::Twig;
    
    my (@li);
    
    my $twig = XML::Twig->new(
            twig_handlers => {
                    'a' => sub {
                            if ( $_->prev_elt('li') ) { 
                                    push @li, $_->text;
                            }   
                    },  
                    'li' => sub {
                            unless ( $_->children('ul') ) { 
                                    printf qq|%s\n|, join q|,|, @li;
                            }   
                            pop @li;
                    },  
            },  
    )->parsefile( shift );
    

    Run it like:

    perl script.pl xmlfile
    

    That yields:

    1,11,111
    1,11,112
    1,11,113,1131
    1,11,113,1132
    1,11,113,1133
    1,11,114
    1,11,115
    1,12,121,1211
    1,12,121,1212
    1,12,121,1213
    1,12,122
    
    Sign up to request clarification or add additional context in comments.

    1 Comment

    Thank you! Its awesome. Also XML::Twig - is good point for me.

    Your Answer

    By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

    Start asking to get answers

    Find the answer to your question by asking.

    Ask question

    Explore related questions

    See similar questions with these tags.