2

I'm running a script that returns HTML in the following format into a variable (i.e var results;)

var results = titleResults[0];
return results;

***RETURNS the below***
<h2>
<a href="https://www.southwest.com/about/southwest/index.html"><b>About Southwest</b></a>
</h2>

How can I parse out only the 'southwest.com' into another variable?

8
  • do you mean ' southwest.com/about/southwest/index.html' or only 'southwest.com' Commented Apr 21, 2020 at 16:01
  • 2
    Is this executed on server with apps script or javascript on browser? Commented Apr 21, 2020 at 16:06
  • 1
    I assume this isn't always going to be "soutewest.com", it could be some other site. Will it always have the www. prefix? What can you actually count on being there? e.g. 1) always an anchor, 2) always wrapped in an h2, 3)...? Commented Apr 21, 2020 at 16:09
  • 1
    I want to return 'domain.com' only as the result. It will not always be southwest, it will depend on what my search criteria is. This is executed on Google Scripts App. Commented Apr 21, 2020 at 16:11
  • 1
    @StephenP With my Edit, it won't fail anymore if there's no www, I have updated the regex now. Commented Apr 21, 2020 at 16:35

4 Answers 4

2

[EDIT] Full apps script code here:

  var html = '<h2><a href="https://www.southwest.com/about/southwest/index.html"><b>About Southwest</b></a></h2>';
  var doc = XmlService.parse(html);
  var root = doc.getRootElement();
  var children = root.getDescendants(); 
  children.forEach(function(ch){
    var chelm = ch.asElement();
    if(chelm && chelm.getAttribute('href'))
    {
      var href = chelm.getAttribute('href');
      var url = href.getValue();
      Logger.log(url);

      var hostname;
      if (url.indexOf("//") > -1)
          hostname = url.split('/')[2];
      else
          hostname = url.split('/')[0];
      hostname = hostname.split('://').pop();   
      hostname = hostname.split('www.').pop();
      hostname = hostname.split('?')[0];
      Logger.log(hostname);
    }
  });

You can use XmlService.parse in apps script and get the link node & href attribute: https://sites.google.com/site/scriptsexamples/learn-by-example/parsing-html

From the href attribute, you can extract the domain:

var hostname;
if (url.indexOf("//") > -1)
    hostname = url.split('/')[2];
else
    hostname = url.split('/')[0];

hostname = hostname.split('://').pop();
hostname = hostname.split('www.').pop();
hostname = hostname.split('?')[0];
Sign up to request clarification or add additional context in comments.

13 Comments

Thanks - how do I print the hostname (domain.com)?
@BigMike Logger.log(hostname) should print the hostname in the apps script logs (Apps script Editor Menu > View > Logs). Or, you can use FormApp.getUi().alert(hostname) to see it (replace FormApp with SpreadsheetApp or relevant Gsuite app).
This is working well, thanks. However, the actual HTMl results look more like this and it seems to break your code: "<h2 class=""result__title""> <a rel=""nofollow"" class=""result__a"" href=""/l/?kh=-1&amp;uddg=https%3A%2F%2Fwww.aa.com%2F""><b>American</b> <b>Airlines</b> - <b>Airline</b> tickets and cheap flights at AA.com</a> </h2>"
@BigMike Your HTML double quotes are formatted incorrectly (repeated twice). You are probably copy/pasting it from Google Sheets. Please fix the HTML or use single quotes: var html = "<h2 class='result__title'> <a rel='nofollow' class='result__a' href='aa.com/'><b>American</b> <b>Airlines</b> - <b>Airline</b> tickets and cheap flights at AA.com</a> </h2>";
I put your code into a parseHTML() function. It returns: /l/?kh=-1&uddg=https%3A%2F%2Fwww.aa.com%2F for Logger.log(url)
|
1

var results = `<h2>
<a href="https://www.southwest.com/about/southwest/index.html"><b>About Southwest</b></a>
</h2>`


//Getting southwest.com :

let southwest = results.split("www.")[1].split("/")[0]
//this method works for all urls,
let example = "http://www.example.com/index.html".split("www.")[1].split("/")[0]

console.log(southwest,"\n",example)

2 Comments

document.getElementById("h2") will fail because the h2 and a are not part of the document — OP's "results" is just a text string.
Got it working: let domain = results.split("www.")[1].split("%2F")[0]; console.log(domain); return domain;
1

I am not really sure if this will do it for a google app script, but using Javascript you can extract what you need like this:

  1. You can use this result string as a innerHTML of a new HTML element.
  2. Then extract the href attribute value from the a element.
  3. And finally use a regex like \/\/(www\.)?([\w\.]+)\/? and .match() method to extract the desired output.

This is how should be your code:

var div = document.createElement("div");
div.innerHTML= result;
let href = div.getElementsByTagName("a")[0].href;
console.log(href.match(/\/\/(www\.)?([\w\.]+)\/?/)[1]);

Demo:

let result = `<h2>
<a href="https://www.southwest.com/about/southwest/index.html"><b>About Southwest</b></a>
</h2>`;
var div = document.createElement("div");
div.innerHTML= result;
let href = div.getElementsByTagName("a")[0].href;
console.log(href.match(/\/\/(www\.)?([\w\.]+)\/?/)[2]);

References:

6 Comments

Mine was using let href = div.querySelector('a').href; but I was working to solve the issue of "what if it's not a www prefix"
@StephenP Yes parsing it can be done in many ways, but we need to find the appropriate regex, I've updated my regex like this \/\/(www\.)?([\w\.]+)\/?, no it shoud work with any url.
I also looked at creating a URL object, like let host = new URL(anchor.href).host to reduce the problem space to only the host string without the protocol or path. Anyway, a definite +1
Yes a good idea too (y), but as I said I think the right regex is the big deal :)
@cнŝdk - Will this code work in Google Scripts app?
|
0

create a function

const extractDomain = url =>
{
    let domain

    domain = url.split('/')[url.indexOf("://") > -1 ? 2 : 0]

    if (domain.indexOf("www.") > -1)
        domain = domain.split('www.')[1]

    domain = domain.split(':')[0];
    domain = domain.split('?')[0];

    return domain
}
<h2>
    <a href="https://www.southwest.com/about/southwest/index.html" id="url"><b>About Southwest</b></a>
</h2>
const {href} = document.getElementById('url')

const anotherVariable = extractDomain(href)

now anotherVariable is "southwest.com"

DEMO: https://jsfiddle.net/cs6bzgfn/

1 Comment

There is no id="url" in OP's results variable; you can't count on that to identify the element, and the whole thing is just a text string, it's not part of the document so document.getElementByAnything will fail.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.