Is there a function or example for converting html string to plaintext without html tags using Google Apps Script?

Question

In JavaScript this solution would do the job:

function strip(html)
{
   var tmp = document.createElement("DIV");
   tmp.innerHTML = html;
   return tmp.textContent || tmp.innerText || "";
}

However document is not available in Google Apps Script to my knowledge. Is there another alternative to parse and display plain text from html in Google Apps Script?

I have tried using the

HtmlService.createHtmlOutput('<b>Hello, world!</b>').getContent();

However this just displays the text with all the tags.

My expectation would be that input of

'<b>Hello, world!</b>'

Would output

'Hello, world!'

Alan Wells · Accepted Answer · 2019-07-20 19:23:38Z

3

The html tags can be removed in two different ways:

Reg Exp - Regular Expression
Converting the HTML to XML an using XmlService to get every element and then get the value of each element

The Reg Exp is better because you don't need to find every HTML element, which requires a lot more code.

The HTML must first be converted to XML so that XmlService.getPrettyFormat() can be used. If the html tags were removed first with a Regular Expression, then the code wouldn't know where the line breaks were supposed to be.

Using XmlService.getPrettyFormat() will format the html with line breaks. But to use XmlService, the html string must first be converted to XML. And there are a couple of things that you need to do when converting the html string to XML in order to avoid errors.

function parseHtml() {

  var html = 'This is just a Test<br><br>Here is my List<br>\
    <ol><li>one</li><li>Two</li><li>Three</li></ol><br>And a bulleted one<br><ul>\
    <li>Bullet One</li><li>Bullet Two</li><li>Bullet Three</li></ul>'; 

  html = '<div>' + html + '</div>';//To avoid the "Content is not allowed in prolog." error
  html = html.replace(/<br>/g,"");//To avoid an error when parsing to xml
  //Logger.log('html: ' + html)

  var document = XmlService.parse(html);

  var output = XmlService.getPrettyFormat().format(document);
  //Logger.log(output);

  output = output.replace(/<[^>]*>/g,"");
  Logger.log(output)
}

Another way to do it, which is just provided as a learning example is to parse the HTML as Xml with XmlService and then loop through all the elements. The following code only goes down through a couple layers of children.

function parseHtml() {

  var html = 'This is just a Test<br><br>Here is my List<br>\
    <ol><li>one</li><li>Two</li><li>Three</li></ol><br>And a bulleted one<br><ul>\
    <li>Bullet One</li><li>Bullet Two</li><li>Bullet Three</li></ul>'; 

  html = '<div>' + html + '</div>';
  html = html.replace(/<br>/g,"");
  //Logger.log('html: ' + html)

  var allText = "";
  var thisTxt;

  var document = XmlService.parse(html);
  var root = document.getRootElement();
  //Logger.log('root: ' + JSON.stringify(root))

  var content = root.getAllContent();
  //Logger.log('content: ' + JSON.stringify(content))

  var L = content.length;

  for (var i=0;i<L;i++) {
    var thisEl = content[i];
    if (!thisEl) {continue;}

    var theType = thisEl.getType();
    //Logger.log('theType: ' + theType)
    //Logger.log('typeof theType: ' + typeof theType)

    if (theType === theType.ELEMENT) {
      var asElmt = thisEl.asElement();
      var allChildren = asElmt.getChildren();

      if (allChildren) {
        var nmbrOfChildren = allChildren.length;
        //Logger.log('nmbrOfChildren: ' + nmbrOfChildren)
      }

      if (!nmbrOfChildren) {
        thisTxt = asElmt.getValue();

        //Logger.log('thisTxt 43: ' + thisTxt)
        allText = allText + thisTxt  + "\n";
        continue;
      }

      for (var j=0;j<nmbrOfChildren;j++) {

        thisTxt = allChildren[j].getValue();
        if (!thisTxt) {
          continue;
        }

        allText = allText + thisTxt + "\n";

      }
      continue;
    }

    //Logger.log(thisEl.getValue())   
    allText = allText + thisEl.getValue()  + "\n";

  }

  //Logger.log('allText: ' + allText + "\n")

}

edited Jul 20, 2019 at 19:23

answered Jul 19, 2019 at 18:43

Alan Wells

31.4k16 gold badges113 silver badges164 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

George Duke Over a year ago

I get "Content is not allowed in prolog." when I try this with html that contains a ordered list. function ParseHtml() { var html = 'This is just a Test<br><br>Here is my List<br><ol><li>one</li><li>Two</li><li>Three</li></ol><br>And a bulleted one<br><ul><li>Bullet One</li><li>Bullet Two</li><li>Bullet Three</li></ul>'; var document = XmlService.parse(html);//Creates an XML document var root = document.getRootElement();//Gets the documents root element node var text = root.getText();//Gets the text value of the element node Logger.log('text: ' + text) }

Alan Wells Over a year ago

The "Content is not allowed in prolog." error is because the first part of the content: "This is just a Test" is not wrapped in a beginning and ending tag. You must also replace all of the <br> tags because they don't have an ending tag: html = "<div>" + html + "</div>"; html = html.replace(/<br>/g,"");Logger.log("html:" + html) There is more that you need to do also, but that's a start. You might want to update your original question with the added test code.

TheMaster Over a year ago

XmlService.getPrettyFormat() might be easier.

Alan Wells Over a year ago

I just tried var output = XmlService.getPrettyFormat().format(document);Logger.log(output); but it doesn't remove the HTML tags, which is what the question asked for. But, it's a great way to clean up the indentations and format line breaks. Is there a better way to remove the html tags?

TheMaster Over a year ago

I see. I didn't test at first. After testing, I think we can use regex: Logger.log(output.replace(/<[^>]*>/g,"")); after format

|

vstepaniuk · Accepted Answer · 2023-08-27 12:50:30Z

0

First you need to create a temporary Google Doc and get its docid

Then you need to enable the Drive API Advanced Service.

Then you use the following code:

function htmltotext() {
  
  var html = 'Your <b>HTML</b> code here';
  var blob = HtmlService.createHtmlOutput(html).getBlob();

  var docid = 'Your doc id here';

  Drive.Files.update('',docid,blob);

  var doc = DocumentApp.openById(docid);
  var text = doc.getBody().getText();
  doc.saveAndClose();

  Logger.log(text);
  return text;
}

edited Aug 27, 2023 at 12:50

answered Aug 26, 2023 at 13:47

vstepaniuk

8818 silver badges16 bronze badges

Comments

Sizerth · Accepted Answer · 2025-01-13 03:19:38Z

0

I know this is an old thread, but I found a way to do this pretty elegantly using the Gmail library. In effect, you create a draft with your HTML using the "htmlBody" option, get the plaintext body of that draft, then delete the draft. It makes your script run a little slower to do this, but it is effective and seemingly foolproof.

var htmlInput = '<html><body><p>Your plaintext here.</p></body></html>';
var draftMsg = GmailApp.createDraft('', 'To be deleted', '', { htmlBody: htmlInput });
var plainText = draftMsg.getMessage().getPlainBody();
draftMsg.deleteDraft();
console.log(plainText);

answered Jan 13 at 3:19

Sizerth

1

Collectives™ on Stack Overflow

Is there a function or example for converting html string to plaintext without html tags using Google Apps Script?

3 Answers 3

6 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

6 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related