0

I need to collect all links out of text in javascript with regex, separating the actual content of href and the text of the link. So if the link is

<a href="someplace/topics/us/john.htm" class="r_lapi">John Dow</a>

I want to collect the content of href and "John Dow".

The links have class="r_lapi" in them that would identify the links I'm looking for. What I have right now is:

     var link_regex = new RegExp("/<a[^]*</a>/");
     var match = content.match(link_regex, 'i');
     console.log("match =", match );

Which does absolutely nothing. Any help is very much appreciated.

1
  • 2
    Why use regex? Why not use the DOM? Are you doing this outside the browser? Commented Jun 13, 2014 at 17:09

2 Answers 2

1

If you can use the DOM (you've said you want regex, but...)

var i;
var links = document.querySelectorAll("a.r_lapi");
for (i = 0; i < links.length; ++i) {
    // use `links[i].innerHTML` here
}

You've said in a comment that you're trying to do this with regex because you're receiving the link HTML (presumably mixed with a bunch of other stuff) via ajax. You can use the browser to parse it and then look for the links in the parsed result, without adding the HTML to your document, using a disconnected element:

var div, links, i;

// Create an element; note we don't append it anywhere
div = document.createElement('div');

// Fill it in with the HTML
div.innerHTML = text;

// Find relevant links (same as the earlier example)
links = div.querySelectorAll("a.r_lapi");
for (i = 0; i < links.length; ++i) {
    // use `links[i].innerHTML` here
}

Live Example, using this text returned via ajax:

<a href="someplace/topics/us/john.htm" class="r_lapi">John Dow</a>
<a href="foo">Don't pick me</a>
<a href="blahblahblah" class="r_lapi">Jane Bloggs</a>

The only real "gotcha" here is that if the HTML contains image tags, the browser will start downloading those images (even though they won't be shown anywhere). This is true even if you use a document fragment, which is part of why I didn't bother above. (script tags in the text aren't a problem, they aren't executed when you use innerHTML but beware they are executed by things like jQuery's html function.)

Or if the data is coming back to you in some other form (like JSON), with the HTML in it, parse the JSON (or whatever) and then run each HTML fragment through the div one at a time:

function handleLinks(data) {
  var div, links, htmlIndex, linkIndex;

  div = document.createElement('div');
  for (htmlIndex = 0; htmlIndex < data.htmlList.length; ++htmlIndex) {
    div.innerHTML = data.htmlList[htmlIndex];
    links = div.querySelectorAll("a.r_lapi");
    for (linkIndex = 0; linkIndex < links.length; ++linkIndex) {
      // Use `links[linkIndex].innerHTML` here
    }
  }
}

Live Example, using this JSON returned via ajax:

{
    "htmlList": [
        "blah blah <a href=\"someplace/topics/us/john.htm\" class=\"r_lapi\">John Dow</a> blah blah",
        "<a href=\"foo\">Don't pick me</a>",
        "Two in this one <a href=\"blahblahblah\" class=\"r_lapi\">Jane Bloggs</a> and <a href=\"blahblahblah\" class=\"r_lapi\">Trevor Bloggs</a>"
    ]
}

If you really need to use regex:

Beware that you cannot do this reliably with regular expressions in JavaScript; you need a parser.

You can get close with a couple of assumptions.

 var link_regex = /<a(?:>|\s[^>]*>)(.*?)<\/a>/i;
 var match = content.match(link_regex);
 if (match) {
     // Use match[1], which contains it
 }

Live illustration

That looks for this:

  1. The literal text <a
  2. Either a > immediately following, or at least one whitespace character followed by any number of characters that aren't a >, followed by a >
  3. Any number of characters, minimal-match
  4. The literal text </a>

The "minimal match" in Step 3 is so we don't get more than we want if we have <a>first</a><a>second</a>.

I haven't tried to limit the regex by the class, I'll leave that as an exercise for the reader. :-)

Again, though, this is a bad idea. Instead, use the DOM (if you're doing this outside a browser, there are plenty of DOM implementations you can use).

One of the primary assumptions made with the above above are that there is never a > character within an attribute value in the anchor (e.g., <a href="..." data-something="I have a > in me">John Dow></a>). It's perfectly valid to have a>` inside an attribute value, so that assumption is invalid.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks a lot for the help! I got regex working. I expect links coming back via ajax, otherwise I would have definitely went for the solution with querySelectorAll. Also, my links are going to be related to a particular source, of a predictable format, so I'm not expecting special characters in them. I'm set for now!
@lw0: Glad that helped! You still don't have to use regex with data returned via ajax, by the way. I've added a couple of examples showing how to do that.
1

If you're in a browser, you really should be using the native DOM.

If you're not, assuming the href does not contain weird characters like > or ", you could use following regex:

var matches = link.match(/^<a\s+[^>]*href="([^"]+)"[^>]*>([^<]*)<\/a>$/);
matches[1] == "someplace/topics/us/john.htm";
matches[2] == "John Dow";

Please note that this will fail on certain links like

  • <a href=">">test</a>
  • <a href="test">John <b>Dow</b></a>

For a complete solution, use a HTML parser.

1 Comment

Thank you very much for the response. For some reason teh expression you have wasn't working for me. I settle for the following regex, it gets me pretty close to what I need: content.match(/<a\shref="([^"]*)"[^>]*>([^<]*)<\/a>/g);

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.