2

For those veterans who haven't tried Hpple, it's great. It uses Xpath for searching through HTML/XML documents. It gets the job done and it's easy enough for a newbie like me to understand. However, I'm having trouble.

I have this chunk of HTML:

    <ul class="challengesList dailyChallengesList">

<li>
<div class="corner topLeft"></div>
<img id="ctl00_mainContent_dailyChallengesRepeater_ctl00_challengeImage" title="Gunslinger" src="/images/reachstats/challenges/0.png" alt="Gunslinger" style="border-width:0px;">
<div class="info">
<div class="rFloat">
<p id="ctl00_mainContent_dailyChallengesRepeater_ctl00_challengeExpiration" class="timeDisplay dailyExpirationCountdown"><span>0d</span><span>19h</span><span>9m</span><span class="seconds">37s</span></p>
<p>1500cR</p>
</div>
<h5>Gunslinger</h5>
<p class="description">Kill 150 enemies in multiplayer Matchmaking.</p>
<div class="reward">

<div id="ctl00_mainContent_dailyChallengesRepeater_ctl00_progressBox" class="barContainer">
<div id="ctl00_mainContent_dailyChallengesRepeater_ctl00_progressBar" class="bar" style="width:21%;"><span></span></div> 
<p>31/150</p>
</div>
</div>
</div>
<div class="clear"></div>
</li>

<li>
<div class="corner topLeft"></div>
<img id="ctl00_mainContent_dailyChallengesRepeater_ctl01_challengeImage" title="A Great Friend" src="/images/reachstats/challenges/0.png" alt="A Great Friend" style="border-width:0px;">
<div class="info">
<div class="rFloat">
<p id="ctl00_mainContent_dailyChallengesRepeater_ctl01_challengeExpiration" class="timeDisplay dailyExpirationCountdown"><span>0d</span><span>19h</span><span>9m</span><span class="seconds">37s</span></p>
<p>1400cR</p>
</div>
<h5>A Great Friend</h5>
<p class="description">Earn 15 assists today in multiplayer Matchmaking.</p>
<div class="reward">

<div id="ctl00_mainContent_dailyChallengesRepeater_ctl01_progressBox" class="barContainer">
<div id="ctl00_mainContent_dailyChallengesRepeater_ctl01_progressBar" class="bar" style="width:40%;"><span></span></div> 
<p>6/15</p>
</div>
</div>
</div>
<div class="clear"></div>
</li>

<li>
<div class="corner topLeft"></div>
<img id="ctl00_mainContent_dailyChallengesRepeater_ctl02_challengeImage" title="Cannon Fodder" src="/images/reachstats/challenges/2.png" alt="Cannon Fodder" style="border-width:0px;">
<div class="info">
<div class="rFloat">
<p id="ctl00_mainContent_dailyChallengesRepeater_ctl02_challengeExpiration" class="timeDisplay dailyExpirationCountdown"><span>0d</span><span>19h</span><span>9m</span><span class="seconds">37s</span></p>
<p>1000cR</p>
</div>
<h5>Cannon Fodder</h5>
<p class="description">Kill 50 infantry-class foes in the Campaign today.</p>
<div class="reward">

<div id="ctl00_mainContent_dailyChallengesRepeater_ctl02_progressBox" class="barContainer">
<div id="ctl00_mainContent_dailyChallengesRepeater_ctl02_progressBar" class="bar" style="width:0%;"><span></span></div> 
<p>0/50</p>
</div>
</div>
</div>
<div class="clear"></div>
</li>

<li>
<div class="corner topLeft"></div>
<img id="ctl00_mainContent_dailyChallengesRepeater_ctl03_challengeImage" title="Heroic Demon" src="/images/reachstats/challenges/3.png" alt="Heroic Demon" style="border-width:0px;">
<div class="info">
<div class="rFloat">
<p id="ctl00_mainContent_dailyChallengesRepeater_ctl03_challengeExpiration" class="timeDisplay dailyExpirationCountdown"><span>0d</span><span>19h</span><span>9m</span><span class="seconds">37s</span></p>
<p>1500cR</p>
</div>
<h5>Heroic Demon</h5>
<p class="description">Kill 30 Elites in Firefight Matchmaking on Heroic or harder.</p>
<div class="reward">

<div id="ctl00_mainContent_dailyChallengesRepeater_ctl03_progressBox" class="barContainer">
<div id="ctl00_mainContent_dailyChallengesRepeater_ctl03_progressBar" class="bar" style="width:0%;"><span></span></div> 
<p>0/30</p>
</div>
</div>
</div>
<div class="clear"></div>
</li>

</ul>

The nutty part is, I cannot get Hpple to "see" the <div class="reward">. I'm using the following to find it:

NSArray * rawProgress = [doc search:@"//ul[@class='challengesList']
                                          /li/div[@class='info']
                                                 /div[@class='reward']/p"];

This always returns an empty array. It's driving me nuts, as the same kind of thing worked for all of the other elements in this project...

Any help would be appreciated :)

EDIT

This works:

NSArray * rawDescriptions = [doc search:@"//ul[@class='challengesList']
                                              /li/div[@class='info']
                                                     /p[@class='description']"];

This doesn't:

NSArray * rawProgress = [doc search:@"//ul[@class='challengesList']
                                          /li/div[@class='info']
                                                 /div[@class='reward']
                                                     /div[@id]//p"];

Furthermore, trying to list the child nodes of rFloat or reward produces a crash :(

5
  • Don't forget to put backquotes around the <div ...> element in the text of your question... fixed it for you. Commented Dec 7, 2010 at 13:14
  • It got unfixed by your edit. I'll leave it to you to put the backquotes in where needed, after 'cannot get Hpple to "see" the'. Commented Dec 7, 2010 at 13:22
  • Can you post more of your input HTML? And triple-check that what you posted is really what's coming in as input? Commented Dec 7, 2010 at 15:37
  • I put in 4/5ths of the input HTML. You can view the full source at: view-source:bungie.net/Stats/Reach/Challenges.aspx?player=Aurum+Aquila Commented Dec 7, 2010 at 15:54
  • Also note, the original page is here: bungie.net/Stats/Reach/Challenges.aspx?player=Aurum+Aquila Commented Dec 7, 2010 at 15:55

2 Answers 2

1

Your "p" element is not an immediate child of div class="reward".

Using XML you provided, XPath expression

div[@class='info']/div[@class='reward']//p

will work.

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks for the recommendation, but this returns a null value. I've added an example of an expression that does work.
@Aurum - @Flack is right that your first XPath, as given, should not return anything because div[@class='reword'] has no immediate p child element.
But the problem is, when I ask it to list reward's children, there appears to be nothing in it. When I ask it to list info, reward does not appear.
0
  • See this SO question for a similar report on problems with Hpple and a list of alternatives.

You may be seeing a bug. According to this page,

It's classified as an experimental project by the developer, but so far it's "worked for me"

UPDATE: seems to be kinda broken now. Anyone got a better solution?

You may want to enter a bug report, and if the project is still being maintained, maybe the developer will respond with a fix or solution. Or you could leave a comment on this page that recommended hpple, and see if that blogger or one of his readers can address the problem or tell you if hpple is active at all.

You could also see if you can find HyperParser. "It's a simple HTML parser that has API similar to NSXMLParser. Designed specially to parse semi-valid HTML." But it doesn't seem to be there at the link where it used to be.

4 Comments

Yeah, it's an issue with libxml. I tried using it straight up, with the same result. I think the website has malformed HTML... So, I'm thinking about using HTML tidy or scraping the stats from someone else with better HTML.
Could this have anything to do with the fact that the img tags aren't closed?
@Aurum: Seems unlikely, since one of your XPath expressions with li/div[@class='info'] is working, when there is an <img> before that div. But based on what you found out about libxml, why lean on a broken reed? HTMLTidy sounds like a good solution.
According to the site, HyperParser is now part of the BaseAppKit, located here: baseappkit.com

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.