Substring text with HTML tags in Javascript

Question

Do you have solution to substring text with HTML tags in Javascript?

For example:

var str = 'Lorem ipsum <a href="#">dolor <strong>sit</strong> amet</a>, consectetur adipiscing elit.'

html_substr(str, 20)
// return Lorem ipsum <a href="#">dolor <strong>si</strong></a>

html_substr(str, 30)
// return Lorem ipsum <a href="#">dolor <strong>sit</strong> amet</a>, co

It seems that you want the substring to ignore the tags, but keep them intact in the final result. I think you'll need to convert the string to DOM elements, traverse through the elements, count the characters in the text nodes, and delete all characters (or text nodes) that exceed your count. Even then I have a feeling that there may be some variation between browsers with respect to white space. Not sure though. — user113716
– user113716, Commented May 14, 2011 at 16:58
Posted an answer. Seems to give the result you want, but again there may be some variation between browsers with respect to white spaces. Not sure. — user113716
– user113716, Commented May 14, 2011 at 17:50
substring html code without html breaking like [this][1]. [1]: stackoverflow.com/questions/6118904/… — imxylz
– imxylz, Commented Dec 28, 2012 at 11:54

Community · Accepted Answer · 2017-05-23 11:46:28Z

10

Taking into consideration that parsing html with regex is a bad idea, here is a solution that does just that :)

EDIT: Just to be clear: This is not a valid solution, it was meant as an exercise that made very lenient assumptions about the input string, and as such should be taken with a grain of salt. Read the link above and see why parsing html with regex can never be done.

function htmlSubstring(s, n) {
    var m, r = /<([^>\s]*)[^>]*>/g,
        stack = [],
        lasti = 0,
        result = '';

    //for each tag, while we don't have enough characters
    while ((m = r.exec(s)) && n) {
        //get the text substring between the last tag and this one
        var temp = s.substring(lasti, m.index).substr(0, n);
        //append to the result and count the number of characters added
        result += temp;
        n -= temp.length;
        lasti = r.lastIndex;

        if (n) {
            result += m[0];
            if (m[1].indexOf('/') === 0) {
                //if this is a closing tag, than pop the stack (does not account for bad html)
                stack.pop();
            } else if (m[1].lastIndexOf('/') !== m[1].length - 1) {
                //if this is not a self closing tag than push it in the stack
                stack.push(m[1]);
            }
        }
    }

    //add the remainder of the string, if needed (there are no more tags in here)
    result += s.substr(lasti, n);

    //fix the unclosed tags
    while (stack.length) {
        result += '</' + stack.pop() + '>';
    }

    return result;

}

Example: http://jsfiddle.net/danmana/5mNNU/

Note: patrick dw's solution may be safer regarding bad html, but I'm not sure how well it handles white spaces.

edited May 23, 2017 at 11:46

CommunityBot

11 silver badge

answered May 14, 2011 at 19:21

Dan Manastireanu

1,8221 gold badge15 silver badges19 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Zirak Over a year ago

<img src='blah' title='Yes/No' alt='>>' /> Don't parse html with regular expressions - for every regex you have, one can find the html to break it.

Dan Manastireanu Over a year ago

@Zirak: I know :) Did you actually read the first link in the first sentence I posted? :) Also read my last sentence :P I know this is not the correct solution, but I thought it was an interesting exercise for me, and if I did it anyway, than why not post it.

Zirak Over a year ago

So you know it's bad, yet you suggest it? My example isn't invalid or bad html. It's completely valid. Run it against a validator and it won't make a noise. What's not valid is your regex, because it can't match all valid htmls.

Dan Manastireanu Over a year ago

@Zirak: I never said this was a valid solution, and of course the regex is not valid, it was never meant to be. It was just an exercise that made some wild assumptions about the input string... I'll edit the post and make this clearer

Ruchita Sheth Over a year ago

how can I get the remaining string from the above function

user113716 · Accepted Answer · 2011-05-15 13:20:05Z

7

Usage:

var str = 'Lorem ipsum <a href="#">dolor <strong>sit</strong> amet</a>, consectetur adipiscing elit.';

var res1 = html_substr( str, 20 );
var res2 = html_substr( str, 30 );

alert( res1 ); // Lorem ipsum <a href="#">dolor <strong>si</strong></a>
alert( res2 ); // Lorem ipsum <a href="#">dolor <strong>sit</strong> amet</a>, co

Example: http://jsfiddle.net/2ULbK/4/

Function:

function html_substr( str, count ) {

    var div = document.createElement('div');
    div.innerHTML = str;

    walk( div, track );

    function track( el ) {
        if( count > 0 ) {
            var len = el.data.length;
            count -= len;
            if( count <= 0 ) {
                el.data = el.substringData( 0, el.data.length + count );
            }
        } else {
            el.data = '';
        }
    }

    function walk( el, fn ) {
        var node = el.firstChild;
        do {
            if( node.nodeType === 3 ) {
                fn(node);
                    //          Added this >>------------------------------------<<
            } else if( node.nodeType === 1 && node.childNodes && node.childNodes[0] ) {
                walk( node, fn );
            }
        } while( node = node.nextSibling );
    }
    return div.innerHTML;
}

edited May 15, 2011 at 13:20

answered May 14, 2011 at 17:48

user113716

323k64 gold badges454 silver badges441 bronze badges

5 Comments

Dan Manastireanu Over a year ago

I don't think that simply returning div.innerHTML is enough. Consider what happens if there are more tags after the cut point. They would end up in the final string, but empty... I think that once count<=0 you should remove the remaining elements, instead of setting data = ''

user113716 Over a year ago

@Dan: Yes, that's true. I wasn't sure which OP wanted. It could be that the potential empty tags should be left in place as part of the DOM structure. But you're right, if that's not the case, then you'd do el.parentNode.removeChild(el) instead. EDIT: Actually that would mess up the DOM walk.

Dan Manastireanu Over a year ago

@patrick dw: Here is an updated jsFiddle that removes the remaining nodes

honzahommer Over a year ago

Thanks man. This solution is great. But there is some problem with non-pair tags (img, hr, ...). Works great!

user113716 Over a year ago

@honzahommer: Can you give an example of an HTML string that is giving you trouble? Also, what do you want to do with tags that get emptied completely (ones whose entire content is above the count)? Should those tags be removed, or retained as empty tags?

Michail M. · Accepted Answer · 2012-11-07 10:58:29Z

5

it is solution for single tags

function subStrWithoutBreakingTags(str, start, length) {
    var countTags = 0;
    var returnString = "";
    var writeLetters = 0;
    while (!((writeLetters >= length) && (countTags == 0))) {
        var letter = str.charAt(start + writeLetters);
        if (letter == "<") {
            countTags++;
        }
        if (letter == ">") {
            countTags--;
        }
        returnString += letter;
        writeLetters++;
    }
    return returnString;
}

answered Nov 7, 2012 at 10:58

Michail M.

7555 silver badges11 bronze badges

Comments

Mubeen Khan · Accepted Answer · 2019-01-24 10:52:54Z

0

let str = 'Lorem ipsum <a href="#">dolor <strong>sit</strong> amet</a>, consectetur adipiscing elit.'
let plainText = htmlString.replace(/<[^>]+>/g, '');

Extract plain text with above given regular expression then use JS String based ".substr()" function for desired results

answered Jan 24, 2019 at 10:52

Mubeen Khan

1,0651 gold badge11 silver badges11 bronze badges

Comments

Shaz · Accepted Answer · 2011-05-14 17:33:26Z

-1

Use something similar to = str.replace(/<[^>]*>?/gi, '').substr(0, 20);
I've created an example at: http://fiddle.jshell.net/xpW9j/1/

edited May 14, 2011 at 17:33

answered May 14, 2011 at 17:23

Shaz

15.9k4 gold badges44 silver badges60 bronze badges

1 Comment

user113716 Over a year ago

This doesn't do what OP wants. In the example results, the tags are maintained.

herostwist · Accepted Answer · 2011-05-14 16:51:28Z

-2

Javascript has a sub-string method. It makes no difference if the string contains html.

see http://www.w3schools.com/jsref/jsref_substr.asp

answered May 14, 2011 at 16:51

herostwist

3,9781 gold badge30 silver badges36 bronze badges

2 Comments

honzahommer Over a year ago

Yes, I know. But my problem is, when I use substr, the html tags bould be broken.

herostwist Over a year ago

in that case your looking at something like recursive regular expressions to balance html tags. but that's going to be hideously complicated to implement.

Collectives™ on Stack Overflow

Substring text with HTML tags in Javascript

6 Answers 6

5 Comments

5 Comments

Comments

Comments

1 Comment

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

5 Comments

5 Comments

Comments

Comments

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related