javascript regex for links and links class

Question

I need to collect all links out of text in javascript with regex, separating the actual content of href and the text of the link. So if the link is

<a href="someplace/topics/us/john.htm" class="r_lapi">John Dow</a>

I want to collect the content of href and "John Dow".

The links have class="r_lapi" in them that would identify the links I'm looking for. What I have right now is:

     var link_regex = new RegExp("/<a[^]*</a>/");
     var match = content.match(link_regex, 'i');
     console.log("match =", match );

Which does absolutely nothing. Any help is very much appreciated.

Why use regex? Why not use the DOM? Are you doing this outside the browser? — T.J. Crowder
– T.J. Crowder, Commented Jun 13, 2014 at 17:09

T.J. Crowder · Accepted Answer · 2014-06-14 07:32:23Z

If you can use the DOM (you've said you want regex, but...)

var i;
var links = document.querySelectorAll("a.r_lapi");
for (i = 0; i < links.length; ++i) {
    // use `links[i].innerHTML` here
}

You've said in a comment that you're trying to do this with regex because you're receiving the link HTML (presumably mixed with a bunch of other stuff) via ajax. You can use the browser to parse it and then look for the links in the parsed result, without adding the HTML to your document, using a disconnected element:

var div, links, i;

// Create an element; note we don't append it anywhere
div = document.createElement('div');

// Fill it in with the HTML
div.innerHTML = text;

// Find relevant links (same as the earlier example)
links = div.querySelectorAll("a.r_lapi");
for (i = 0; i < links.length; ++i) {
    // use `links[i].innerHTML` here
}

Live Example, using this text returned via ajax:

<a href="someplace/topics/us/john.htm" class="r_lapi">John Dow</a>
<a href="foo">Don't pick me</a>
<a href="blahblahblah" class="r_lapi">Jane Bloggs</a>

The only real "gotcha" here is that if the HTML contains image tags, the browser will start downloading those images (even though they won't be shown anywhere). This is true even if you use a document fragment, which is part of why I didn't bother above. (script tags in the text aren't a problem, they aren't executed when you use innerHTML but beware they are executed by things like jQuery's html function.)

Or if the data is coming back to you in some other form (like JSON), with the HTML in it, parse the JSON (or whatever) and then run each HTML fragment through the div one at a time:

function handleLinks(data) {
  var div, links, htmlIndex, linkIndex;

  div = document.createElement('div');
  for (htmlIndex = 0; htmlIndex < data.htmlList.length; ++htmlIndex) {
    div.innerHTML = data.htmlList[htmlIndex];
    links = div.querySelectorAll("a.r_lapi");
    for (linkIndex = 0; linkIndex < links.length; ++linkIndex) {
      // Use `links[linkIndex].innerHTML` here
    }
  }
}

Live Example, using this JSON returned via ajax:

{
    "htmlList": [
        "blah blah <a href=\"someplace/topics/us/john.htm\" class=\"r_lapi\">John Dow</a> blah blah",
        "<a href=\"foo\">Don't pick me</a>",
        "Two in this one <a href=\"blahblahblah\" class=\"r_lapi\">Jane Bloggs</a> and <a href=\"blahblahblah\" class=\"r_lapi\">Trevor Bloggs</a>"
    ]
}

If you really need to use regex:

Beware that you cannot do this reliably with regular expressions in JavaScript; you need a parser.

You can get close with a couple of assumptions.

 var link_regex = /<a(?:>|\s[^>]*>)(.*?)<\/a>/i;
 var match = content.match(link_regex);
 if (match) {
     // Use match[1], which contains it
 }

Live illustration

That looks for this:

The literal text <a
Either a > immediately following, or at least one whitespace character followed by any number of characters that aren't a >, followed by a >
Any number of characters, minimal-match
The literal text </a>

The "minimal match" in Step 3 is so we don't get more than we want if we have <a>first</a><a>second</a>.

I haven't tried to limit the regex by the class, I'll leave that as an exercise for the reader. :-)

Again, though, this is a bad idea. Instead, use the DOM (if you're doing this outside a browser, there are plenty of DOM implementations you can use).

One of the primary assumptions made with the above above are that there is never a > character within an attribute value in the anchor (e.g., <a href="..." data-something="I have a > in me">John Dow></a>). It's perfectly valid to have a>` inside an attribute value, so that assumption is invalid.

Thanks a lot for the help! I got regex working. I expect links coming back via ajax, otherwise I would have definitely went for the solution with querySelectorAll. Also, my links are going to be related to a particular source, of a predictable format, so I'm not expecting special characters in them. I'm set for now!
@lw0: Glad that helped! You still don't have to use regex with data returned via ajax, by the way. I've added a couple of examples showing how to do that.

Bart · Accepted Answer · 2014-06-13 17:18:58Z

1

If you're in a browser, you really should be using the native DOM.

If you're not, assuming the href does not contain weird characters like > or ", you could use following regex:

var matches = link.match(/^<a\s+[^>]*href="([^"]+)"[^>]*>([^<]*)<\/a>$/);
matches[1] == "someplace/topics/us/john.htm";
matches[2] == "John Dow";

Please note that this will fail on certain links like

<a href=">">test</a>
<a href="test">John <b>Dow</b></a>

For a complete solution, use a HTML parser.

answered Jun 13, 2014 at 17:18

Bart

27.5k1 gold badge26 silver badges24 bronze badges

1 Comment

lw0 Over a year ago

Thank you very much for the response. For some reason teh expression you have wasn't working for me. I settle for the following regex, it gets me pretty close to what I need: content.match(/<a\shref="([^"]*)"[^>]*>([^<]*)<\/a>/g);

Collectives™ on Stack Overflow

javascript regex for links and links class

2 Answers 2

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related