2

I'm trying to scrap a website I'm a newbie using regular expressions. I have a long character vector, this is the line that I'm aiming:

<h3 class=\"title4\">Results: <span id=\"hitCount.top\">10,079</span></h3>\n  

I want to extract the number that it is in between <span id=\"hitCount.top\"> and </span>. In this case 10,079. My approach so far, though, not really working.

x <- '<h3 class=\"title4\">Results: <span id=\"hitCount.top\">10,079</span>'
m <- gregexpr(pattern="[<span id=\"hitCount.top\">].+[</span>]", x, ignore.case = FALSE, perl = FALSE,
         fixed = FALSE, useBytes = FALSE)
regmatches(x, m)

Any help will be appreciated.

3
  • sub("<span id=\"hitCount.top\">(.*?)<.*", "\\1", x) Commented Apr 29, 2016 at 10:26
  • Why not obtain the plain text value and get the number from it? It would be "cleaner". Commented Apr 29, 2016 at 10:26
  • You are using character class..That will match any character from [] and not whole string Commented Apr 29, 2016 at 10:38

2 Answers 2

1

Just to illustrate how easy it may become if you are using XML package:

> library("XML")
> url = "PATH_TO_HTML"
> parsed_doc = htmlParse(file=url, useInternalNodes = TRUE)
> h3title4 <- getNodeSet(doc = parsed_doc, path = "//h3[@class='title4']")
> plain_text <- sapply(h3title4, xmlValue)
> plain_text
[1] "Results: 10,079"
> sub("\\D*", "", plain_text)
[1] "10,079"

The sub("\\D*", "", plain_text) line will remove the first chunk of 0+ non-digits in the input, that is, \D* will match Results: and will replace it with an empty string.

The example HTML I used was

<html>
<body>
<h3 class="title4">Results: <span id="hitCount.top">10,079</span></h3>
<img width="10%" height="10%" src="https://upload.wikimedia.org/wikipedia/commons/thumb/f/fe/Green-Up-Arrow.svg/2000px-Green-Up-Arrow.svg.png"/>
</body>
</html>

Sign up to request clarification or add additional context in comments.

Comments

1

Using stringr library

> library(stringr)
> str_extract(x, "(?<=<span id=\"hitCount.top\">)(.*?)(?=</span>)")
[1] "10,079"

Using gsub (sub can also be used here instead of gsub)

> gsub(".*<span id=\"hitCount.top\">(.*?)</span>.*", "\\1", x)
[1] "10,079"

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.