4

In JavaScript this solution would do the job:

function strip(html)
{
   var tmp = document.createElement("DIV");
   tmp.innerHTML = html;
   return tmp.textContent || tmp.innerText || "";
}

However document is not available in Google Apps Script to my knowledge. Is there another alternative to parse and display plain text from html in Google Apps Script?

I have tried using the

HtmlService.createHtmlOutput('<b>Hello, world!</b>').getContent();

However this just displays the text with all the tags.

My expectation would be that input of

'<b>Hello, world!</b>'

Would output

'Hello, world!'

3 Answers 3

3

The html tags can be removed in two different ways:

  1. Reg Exp - Regular Expression
  2. Converting the HTML to XML an using XmlService to get every element and then get the value of each element

The Reg Exp is better because you don't need to find every HTML element, which requires a lot more code.

The HTML must first be converted to XML so that XmlService.getPrettyFormat() can be used. If the html tags were removed first with a Regular Expression, then the code wouldn't know where the line breaks were supposed to be.

Using XmlService.getPrettyFormat() will format the html with line breaks. But to use XmlService, the html string must first be converted to XML. And there are a couple of things that you need to do when converting the html string to XML in order to avoid errors.

function parseHtml() {

  var html = 'This is just a Test<br><br>Here is my List<br>\
    <ol><li>one</li><li>Two</li><li>Three</li></ol><br>And a bulleted one<br><ul>\
    <li>Bullet One</li><li>Bullet Two</li><li>Bullet Three</li></ul>'; 

  html = '<div>' + html + '</div>';//To avoid the "Content is not allowed in prolog." error
  html = html.replace(/<br>/g,"");//To avoid an error when parsing to xml
  //Logger.log('html: ' + html)

  var document = XmlService.parse(html);

  var output = XmlService.getPrettyFormat().format(document);
  //Logger.log(output);

  output = output.replace(/<[^>]*>/g,"");
  Logger.log(output)
}

Another way to do it, which is just provided as a learning example is to parse the HTML as Xml with XmlService and then loop through all the elements. The following code only goes down through a couple layers of children.

function parseHtml() {

  var html = 'This is just a Test<br><br>Here is my List<br>\
    <ol><li>one</li><li>Two</li><li>Three</li></ol><br>And a bulleted one<br><ul>\
    <li>Bullet One</li><li>Bullet Two</li><li>Bullet Three</li></ul>'; 

  html = '<div>' + html + '</div>';
  html = html.replace(/<br>/g,"");
  //Logger.log('html: ' + html)

  var allText = "";
  var thisTxt;

  var document = XmlService.parse(html);
  var root = document.getRootElement();
  //Logger.log('root: ' + JSON.stringify(root))

  var content = root.getAllContent();
  //Logger.log('content: ' + JSON.stringify(content))

  var L = content.length;

  for (var i=0;i<L;i++) {
    var thisEl = content[i];
    if (!thisEl) {continue;}

    var theType = thisEl.getType();
    //Logger.log('theType: ' + theType)
    //Logger.log('typeof theType: ' + typeof theType)

    if (theType === theType.ELEMENT) {
      var asElmt = thisEl.asElement();
      var allChildren = asElmt.getChildren();

      if (allChildren) {
        var nmbrOfChildren = allChildren.length;
        //Logger.log('nmbrOfChildren: ' + nmbrOfChildren)
      }

      if (!nmbrOfChildren) {
        thisTxt = asElmt.getValue();

        //Logger.log('thisTxt 43: ' + thisTxt)
        allText = allText + thisTxt  + "\n";
        continue;
      }

      for (var j=0;j<nmbrOfChildren;j++) {

        thisTxt = allChildren[j].getValue();
        if (!thisTxt) {
          continue;
        }

        allText = allText + thisTxt + "\n";

      }
      continue;
    }

    //Logger.log(thisEl.getValue())   
    allText = allText + thisEl.getValue()  + "\n";

  }

  //Logger.log('allText: ' + allText + "\n")

}
Sign up to request clarification or add additional context in comments.

6 Comments

I get "Content is not allowed in prolog." when I try this with html that contains a ordered list. function ParseHtml() { var html = 'This is just a Test<br><br>Here is my List<br><ol><li>one</li><li>Two</li><li>Three</li></ol><br>And a bulleted one<br><ul><li>Bullet One</li><li>Bullet Two</li><li>Bullet Three</li></ul>'; var document = XmlService.parse(html);//Creates an XML document var root = document.getRootElement();//Gets the documents root element node var text = root.getText();//Gets the text value of the element node Logger.log('text: ' + text) }
The "Content is not allowed in prolog." error is because the first part of the content: "This is just a Test" is not wrapped in a beginning and ending tag. You must also replace all of the <br> tags because they don't have an ending tag: html = "<div>" + html + "</div>"; html = html.replace(/<br>/g,"");Logger.log("html:" + html) There is more that you need to do also, but that's a start. You might want to update your original question with the added test code.
XmlService.getPrettyFormat() might be easier.
I just tried var output = XmlService.getPrettyFormat().format(document);Logger.log(output); but it doesn't remove the HTML tags, which is what the question asked for. But, it's a great way to clean up the indentations and format line breaks. Is there a better way to remove the html tags?
I see. I didn't test at first. After testing, I think we can use regex: Logger.log(output.replace(/<[^>]*>/g,"")); after format
|
0

First you need to create a temporary Google Doc and get its docid

Then you need to enable the Drive API Advanced Service.

Then you use the following code:

function htmltotext() {
  
  var html = 'Your <b>HTML</b> code here';
  var blob = HtmlService.createHtmlOutput(html).getBlob();

  var docid = 'Your doc id here';

  Drive.Files.update('',docid,blob);

  var doc = DocumentApp.openById(docid);
  var text = doc.getBody().getText();
  doc.saveAndClose();

  Logger.log(text);
  return text;
}

Comments

0

I know this is an old thread, but I found a way to do this pretty elegantly using the Gmail library. In effect, you create a draft with your HTML using the "htmlBody" option, get the plaintext body of that draft, then delete the draft. It makes your script run a little slower to do this, but it is effective and seemingly foolproof.

var htmlInput = '<html><body><p>Your plaintext here.</p></body></html>';
var draftMsg = GmailApp.createDraft('', 'To be deleted', '', { htmlBody: htmlInput });
var plainText = draftMsg.getMessage().getPlainBody();
draftMsg.deleteDraft();
console.log(plainText);

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.