Finding a pattern and extracting strings

Question

I'm trying to scrap a website I'm a newbie using regular expressions. I have a long character vector, this is the line that I'm aiming:

<h3 class=\"title4\">Results: <span id=\"hitCount.top\">10,079</span></h3>\n

I want to extract the number that it is in between <span id=\"hitCount.top\"> and </span>. In this case 10,079. My approach so far, though, not really working.

x <- '<h3 class=\"title4\">Results: <span id=\"hitCount.top\">10,079</span>'
m <- gregexpr(pattern="[<span id=\"hitCount.top\">].+[</span>]", x, ignore.case = FALSE, perl = FALSE,
         fixed = FALSE, useBytes = FALSE)
regmatches(x, m)

Any help will be appreciated.

Why not obtain the plain text value and get the number from it? It would be "cleaner". — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Apr 29, 2016 at 10:26
You are using character class..That will match any character from [] and not whole string — rock321987
– rock321987, Commented Apr 29, 2016 at 10:38

Wiktor Stribiżew · Accepted Answer · 2016-04-29 11:58:04Z

1

Just to illustrate how easy it may become if you are using XML package:

> library("XML")
> url = "PATH_TO_HTML"
> parsed_doc = htmlParse(file=url, useInternalNodes = TRUE)
> h3title4 <- getNodeSet(doc = parsed_doc, path = "//h3[@class='title4']")
> plain_text <- sapply(h3title4, xmlValue)
> plain_text
[1] "Results: 10,079"
> sub("\\D*", "", plain_text)
[1] "10,079"

The sub("\\D*", "", plain_text) line will remove the first chunk of 0+ non-digits in the input, that is, \D* will match Results: and will replace it with an empty string.

The example HTML I used was

<html>
<body>
<h3 class="title4">Results: <span id="hitCount.top">10,079</span></h3>
<img width="10%" height="10%" src="https://upload.wikimedia.org/wikipedia/commons/thumb/f/fe/Green-Up-Arrow.svg/2000px-Green-Up-Arrow.svg.png"/>
</body>
</html>

edited Apr 29, 2016 at 11:58

answered Apr 29, 2016 at 11:50

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

rock321987 · Accepted Answer · 2016-04-29 10:26:46Z

1

Using stringr library

> library(stringr)
> str_extract(x, "(?<=<span id=\"hitCount.top\">)(.*?)(?=</span>)")
[1] "10,079"

Using gsub (sub can also be used here instead of gsub)

> gsub(".*<span id=\"hitCount.top\">(.*?)</span>.*", "\\1", x)
[1] "10,079"

answered Apr 29, 2016 at 10:26

rock321987

11.1k1 gold badge34 silver badges44 bronze badges

Collectives™ on Stack Overflow

Finding a pattern and extracting strings

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related