3

I'm using Google Apps Script to fetch the content of emails from gmail and after that I need to extract all of the links from the html tags. I found some code here, on stackoverflow, and I implemented it with a regular expression, but the issue is that it is always returning me the first url. (http://vacante2016.eu/tr/17599/51743713/c4f5eadf38eb475d39e3cdeca9201538)

Is there a way to make a loop that search for the next content that matches the regex expression to display all of the elements one by one?

Here you can see an example with the content of an email that I need to get those links from: https://www.mailinator.com/inbox2.jsp?public_to=get_urls#/#public_showmaildiv

This is my code:

function getURL() {

  var threads = GmailApp.getInboxThreads();
  var message = threads[0].getMessages()[0];
  var content = message.getRawContent();

    var source = (content || '').toString();
    var urlArray = [];
    var url;
    var matchArray;

    // Regular expression to find FTP, HTTP(S) URLs.
    var regexToken = /(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/;

    // Iterate through any URLs in the text.
    while( (matchArray = regexToken.exec( source )) !== null )
    {
      var token = matchArray[0];
      urlArray.push( token );
    }
}

UPDATE: Changed the regex to /(?:ht|f)tps?\:\/\/[a-zA-Z0-9\-.]+\.[a-zA-Z]{2,3}(\/[\S=]*)?/g improved the things but now I also get the following type of response when I search for urls: "http://vacante2016.eu/clk/17599/5=\r\n1743713/150132/bf7639dd7e7aa48c9197a52a8c61e168\"><img" ... I think that the regex should also have a condition to return the url but only up to the > symbol.

Also, is there a way to remove the additional characters like =, \r and \n from the found url?

9
  • 1
    Looks like you forgot /g: var regexToken = /(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/g;. See stackoverflow.com/questions/520611/… Commented Aug 8, 2016 at 13:02
  • If the email is formatted with html, is there a reason as to why you're not just getting the attributes straight from the tags? Commented Aug 8, 2016 at 13:05
  • @NTL no, there is no reason, but I don't know how to do this...I think that the regex must search for the href property from <a> and <img/> tags Commented Aug 8, 2016 at 13:08
  • @WiktorStribiżew that fixed it, but now a url response that looks like this : http://vacante2016.eu/tr/17599/51743713/c4f5eadf38eb475d39e3cde=ca9201538 will be truncated after = as follows: http://vacante2016.eu/tr/17599/51743713/c4f5eadf38eb475d39e3cde .. why does this happen? Commented Aug 8, 2016 at 13:11
  • Well, the /(?:ht|f)tps?\:\/\/[a-zA-Z0-9\-.]+\.[a-zA-Z]{2,3}(\/\S*)?/g should work. Check what you are doing to the links or whether you check against expected contents. Commented Aug 8, 2016 at 13:16

2 Answers 2

4

You need to use a global modifier /g to get multiple matches with RegExp#exec.

Besides, since your input is HTML code, you need to make sure you do not grab < with \S:

/(?:ht|f)tps?:\/\/[-a-zA-Z0-9.]+\.[a-zA-Z]{2,3}(\/[^"<]*)?/g

See the regex demo.

If for some reason this pattern does not match equal signs, add it as an alternative:

/(?:ht|f)tps?:\/\/[-a-zA-Z0-9.]+\.[a-zA-Z]{2,3}(?:\/(?:[^"<=]|=)*)?/g

See another demo (however, the first one should do).

Sign up to request clarification or add additional context in comments.

8 Comments

The second pattern works perfect! Last question...is there a way to remove the additional characters like =, \r and \n from the found url such that "http://vacante2016.eu/clk/17599/5=\r\n1743713/150132/bf7639dd7e7aa48c9197a52a8c61e168\" will be "http://vacante2016.eu/clk/17599/51743713/150132/bf7639dd7e7aa48c9197a52a8c61e168\" ?
I don't know if these are literal strings.If yes, you will have to use sth like .replace(/\\[rn]|=/g, '').
They are sting literals, I use token.replace(/\\[rn]|=/g, '') and nothing happens. To be sure I also did toke.toString() before using replace.
Then try .replace(/[\r\n=]+/g, "")
This works partially because only the = is removed. I also tried with .replace("\r", "") and is does nothing...
|
-2

I am assuming based on the code you provided that you are able to get the contents of the email as an html string.

function getHref(content){
  var el = document.createElement('html');
  el.innerHTML = content;

  var hrefs = [];

  var elements = el.getElementsByTagName('a');

  for (var i=0; i < elements.length; i++){
    hrefs.push(elements[i].href);
  }

  return hrefs;
}

This will return an array of all the href attributes from anchor tags on the page.

2 Comments

The document object is not accessible in Google Apps Scripts. That framework does not support all the JS features, only some of them.
This only works in browser, client-side. Google apps script is server-side, there is no DOM there at all.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.