Get all links from html page using regex

Question

I'm using Google Apps Script to fetch the content of emails from gmail and after that I need to extract all of the links from the html tags. I found some code here, on stackoverflow, and I implemented it with a regular expression, but the issue is that it is always returning me the first url. (http://vacante2016.eu/tr/17599/51743713/c4f5eadf38eb475d39e3cdeca9201538)

Is there a way to make a loop that search for the next content that matches the regex expression to display all of the elements one by one?

Here you can see an example with the content of an email that I need to get those links from: https://www.mailinator.com/inbox2.jsp?public_to=get_urls#/#public_showmaildiv

This is my code:

function getURL() {

  var threads = GmailApp.getInboxThreads();
  var message = threads[0].getMessages()[0];
  var content = message.getRawContent();

    var source = (content || '').toString();
    var urlArray = [];
    var url;
    var matchArray;

    // Regular expression to find FTP, HTTP(S) URLs.
    var regexToken = /(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/;

    // Iterate through any URLs in the text.
    while( (matchArray = regexToken.exec( source )) !== null )
    {
      var token = matchArray[0];
      urlArray.push( token );
    }
}

UPDATE: Changed the regex to /(?:ht|f)tps?\:\/\/[a-zA-Z0-9\-.]+\.[a-zA-Z]{2,3}(\/[\S=]*)?/g improved the things but now I also get the following type of response when I search for urls: "http://vacante2016.eu/clk/17599/5=\r\n1743713/150132/bf7639dd7e7aa48c9197a52a8c61e168\"><img" ... I think that the regex should also have a condition to return the url but only up to the > symbol.

Also, is there a way to remove the additional characters like =, \r and \n from the found url?

Looks like you forgot /g: var regexToken = /(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/g;. See stackoverflow.com/questions/520611/… — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Aug 8, 2016 at 13:02
If the email is formatted with html, is there a reason as to why you're not just getting the attributes straight from the tags? — NTL
– NTL, Commented Aug 8, 2016 at 13:05
@NTL no, there is no reason, but I don't know how to do this...I think that the regex must search for the href property from <a> and <img/> tags — Valip
– Valip, Commented Aug 8, 2016 at 13:08
@WiktorStribiżew that fixed it, but now a url response that looks like this : http://vacante2016.eu/tr/17599/51743713/c4f5eadf38eb475d39e3cde=ca9201538 will be truncated after = as follows: http://vacante2016.eu/tr/17599/51743713/c4f5eadf38eb475d39e3cde .. why does this happen? — Valip
– Valip, Commented Aug 8, 2016 at 13:11
Well, the /(?:ht|f)tps?\:\/\/[a-zA-Z0-9\-.]+\.[a-zA-Z]{2,3}(\/\S*)?/g should work. Check what you are doing to the links or whether you check against expected contents. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Aug 8, 2016 at 13:16

Wiktor Stribiżew · Accepted Answer · 2016-08-08 17:26:50Z

4

You need to use a global modifier /g to get multiple matches with RegExp#exec.

Besides, since your input is HTML code, you need to make sure you do not grab < with \S:

/(?:ht|f)tps?:\/\/[-a-zA-Z0-9.]+\.[a-zA-Z]{2,3}(\/[^"<]*)?/g

See the regex demo.

If for some reason this pattern does not match equal signs, add it as an alternative:

/(?:ht|f)tps?:\/\/[-a-zA-Z0-9.]+\.[a-zA-Z]{2,3}(?:\/(?:[^"<=]|=)*)?/g

See another demo (however, the first one should do).

answered Aug 8, 2016 at 17:26

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

Valip Over a year ago

The second pattern works perfect! Last question...is there a way to remove the additional characters like =, \r and \n from the found url such that "http://vacante2016.eu/clk/17599/5=\r\n1743713/150132/bf7639dd7e7aa48c9197a52a8c61e168\" will be "http://vacante2016.eu/clk/17599/51743713/150132/bf7639dd7e7aa48c9197a52a8c61e168\" ?

Wiktor Stribiżew Over a year ago

I don't know if these are literal strings.If yes, you will have to use sth like .replace(/\\[rn]|=/g, '').

Valip Over a year ago

They are sting literals, I use token.replace(/\\[rn]|=/g, '') and nothing happens. To be sure I also did toke.toString() before using replace.

Wiktor Stribiżew Over a year ago

Then try .replace(/[\r\n=]+/g, "")

Valip Over a year ago

This works partially because only the = is removed. I also tried with .replace("\r", "") and is does nothing...

|

NTL · Accepted Answer · 2016-08-08 13:47:40Z

-2

I am assuming based on the code you provided that you are able to get the contents of the email as an html string.

function getHref(content){
  var el = document.createElement('html');
  el.innerHTML = content;

  var hrefs = [];

  var elements = el.getElementsByTagName('a');

  for (var i=0; i < elements.length; i++){
    hrefs.push(elements[i].href);
  }

  return hrefs;
}

This will return an array of all the href attributes from anchor tags on the page.

answered Aug 8, 2016 at 13:47

NTL

1,0078 silver badges15 bronze badges

2 Comments

Wiktor Stribiżew Over a year ago

The document object is not accessible in Google Apps Scripts. That framework does not support all the JS features, only some of them.

roma Over a year ago

This only works in browser, client-side. Google apps script is server-side, there is no DOM there at all.

Collectives™ on Stack Overflow

Get all links from html page using regex

2 Answers 2

8 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

8 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related