Parse specific string from variable HTML result

Question

I'm running a script that returns HTML in the following format into a variable (i.e var results;)

var results = titleResults[0];
return results;

***RETURNS the below***
<h2>
<a href="https://www.southwest.com/about/southwest/index.html"><b>About Southwest</b></a>
</h2>

How can I parse out only the 'southwest.com' into another variable?

do you mean ' southwest.com/about/southwest/index.html' or only 'southwest.com' — Ufuk
– Ufuk, Commented Apr 21, 2020 at 16:01
Is this executed on server with apps script or javascript on browser? — Réti Opening
– Réti Opening, Commented Apr 21, 2020 at 16:06
I assume this isn't always going to be "soutewest.com", it could be some other site. Will it always have the www. prefix? What can you actually count on being there? e.g. 1) always an anchor, 2) always wrapped in an h2, 3)...? — Stephen P
– Stephen P, Commented Apr 21, 2020 at 16:09
I want to return 'domain.com' only as the result. It will not always be southwest, it will depend on what my search criteria is. This is executed on Google Scripts App. — BigMike
– BigMike, Commented Apr 21, 2020 at 16:11
@StephenP With my Edit, it won't fail anymore if there's no www, I have updated the regex now. — cнŝdk
– cнŝdk, Commented Apr 21, 2020 at 16:35

Réti Opening · Accepted Answer · 2020-04-22 09:37:30Z

2

[EDIT] Full apps script code here:

  var html = '<h2><a href="https://www.southwest.com/about/southwest/index.html"><b>About Southwest</b></a></h2>';
  var doc = XmlService.parse(html);
  var root = doc.getRootElement();
  var children = root.getDescendants(); 
  children.forEach(function(ch){
    var chelm = ch.asElement();
    if(chelm && chelm.getAttribute('href'))
    {
      var href = chelm.getAttribute('href');
      var url = href.getValue();
      Logger.log(url);

      var hostname;
      if (url.indexOf("//") > -1)
          hostname = url.split('/')[2];
      else
          hostname = url.split('/')[0];
      hostname = hostname.split('://').pop();   
      hostname = hostname.split('www.').pop();
      hostname = hostname.split('?')[0];
      Logger.log(hostname);
    }
  });

You can use XmlService.parse in apps script and get the link node & href attribute: https://sites.google.com/site/scriptsexamples/learn-by-example/parsing-html

From the href attribute, you can extract the domain:

var hostname;
if (url.indexOf("//") > -1)
    hostname = url.split('/')[2];
else
    hostname = url.split('/')[0];

hostname = hostname.split('://').pop();
hostname = hostname.split('www.').pop();
hostname = hostname.split('?')[0];

edited Apr 22, 2020 at 9:37

answered Apr 21, 2020 at 16:04

Réti Opening

4942 gold badges6 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

13 Comments

BigMike Over a year ago

Thanks - how do I print the hostname (domain.com)?

Réti Opening Over a year ago

@BigMike Logger.log(hostname) should print the hostname in the apps script logs (Apps script Editor Menu > View > Logs). Or, you can use FormApp.getUi().alert(hostname) to see it (replace FormApp with SpreadsheetApp or relevant Gsuite app).

BigMike Over a year ago

This is working well, thanks. However, the actual HTMl results look more like this and it seems to break your code: "<h2 class=""result__title""> <a rel=""nofollow"" class=""result__a"" href=""/l/?kh=-1&uddg=https%3A%2F%2Fwww.aa.com%2F""><b>American</b> <b>Airlines</b> - <b>Airline</b> tickets and cheap flights at AA.com</a> </h2>"

Réti Opening Over a year ago

@BigMike Your HTML double quotes are formatted incorrectly (repeated twice). You are probably copy/pasting it from Google Sheets. Please fix the HTML or use single quotes: var html = "<h2 class='result__title'> <a rel='nofollow' class='result__a' href='aa.com/'><b>American</b> <b>Airlines</b> - <b>Airline</b> tickets and cheap flights at AA.com</a> </h2>";

BigMike Over a year ago

I put your code into a parseHTML() function. It returns: /l/?kh=-1&uddg=https%3A%2F%2Fwww.aa.com%2F for Logger.log(url)

|

Ufuk · Accepted Answer · 2020-04-21 16:39:02Z

1

var results = `<h2>
<a href="https://www.southwest.com/about/southwest/index.html"><b>About Southwest</b></a>
</h2>`


//Getting southwest.com :

let southwest = results.split("www.")[1].split("/")[0]
//this method works for all urls,
let example = "http://www.example.com/index.html".split("www.")[1].split("/")[0]

console.log(southwest,"\n",example)

edited Apr 21, 2020 at 16:39

answered Apr 21, 2020 at 16:04

Ufuk

4485 silver badges9 bronze badges

2 Comments

Stephen P Over a year ago

document.getElementById("h2") will fail because the h2 and a are not part of the document — OP's "results" is just a text string.

BigMike Over a year ago

Got it working: let domain = results.split("www.")[1].split("%2F")[0]; console.log(domain); return domain;

cнŝdk · Accepted Answer · 2020-04-21 16:27:24Z

1

I am not really sure if this will do it for a google app script, but using Javascript you can extract what you need like this:

You can use this result string as a innerHTML of a new HTML element.
Then extract the href attribute value from the a element.
And finally use a regex like \/\/(www\.)?([\w\.]+)\/? and .match() method to extract the desired output.

This is how should be your code:

var div = document.createElement("div");
div.innerHTML= result;
let href = div.getElementsByTagName("a")[0].href;
console.log(href.match(/\/\/(www\.)?([\w\.]+)\/?/)[1]);

Demo:

let result = `<h2>
<a href="https://www.southwest.com/about/southwest/index.html"><b>About Southwest</b></a>
</h2>`;
var div = document.createElement("div");
div.innerHTML= result;
let href = div.getElementsByTagName("a")[0].href;
console.log(href.match(/\/\/(www\.)?([\w\.]+)\/?/)[2]);

References:

edited Apr 21, 2020 at 16:27

answered Apr 21, 2020 at 16:04

cнŝdk

32.2k7 gold badges62 silver badges81 bronze badges

6 Comments

Stephen P Over a year ago

Mine was using let href = div.querySelector('a').href; but I was working to solve the issue of "what if it's not a www prefix"

cнŝdk Over a year ago

@StephenP Yes parsing it can be done in many ways, but we need to find the appropriate regex, I've updated my regex like this \/\/(www\.)?([\w\.]+)\/?, no it shoud work with any url.

Stephen P Over a year ago

I also looked at creating a URL object, like let host = new URL(anchor.href).host to reduce the problem space to only the host string without the protocol or path. Anyway, a definite +1

cнŝdk Over a year ago

Yes a good idea too (y), but as I said I think the right regex is the big deal :)

BigMike Over a year ago

@cнŝdk - Will this code work in Google Scripts app?

|

Kakajann · Accepted Answer · 2020-04-21 16:08:21Z

0

create a function

const extractDomain = url =>
{
    let domain

    domain = url.split('/')[url.indexOf("://") > -1 ? 2 : 0]

    if (domain.indexOf("www.") > -1)
        domain = domain.split('www.')[1]

    domain = domain.split(':')[0];
    domain = domain.split('?')[0];

    return domain
}

<h2>
    <a href="https://www.southwest.com/about/southwest/index.html" id="url"><b>About Southwest</b></a>
</h2>

const {href} = document.getElementById('url')

const anotherVariable = extractDomain(href)

now anotherVariable is "southwest.com"

DEMO: https://jsfiddle.net/cs6bzgfn/

answered Apr 21, 2020 at 16:08

Kakajann

1241 silver badge11 bronze badges

1 Comment

Stephen P Over a year ago

There is no id="url" in OP's results variable; you can't count on that to identify the element, and the whole thing is just a text string, it's not part of the document so document.getElementByAnything will fail.

Collectives™ on Stack Overflow

Parse specific string from variable HTML result

4 Answers 4

13 Comments

2 Comments

6 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

13 Comments

2 Comments

6 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related