15

Do you have solution to substring text with HTML tags in Javascript?

For example:

var str = 'Lorem ipsum <a href="#">dolor <strong>sit</strong> amet</a>, consectetur adipiscing elit.'

html_substr(str, 20)
// return Lorem ipsum <a href="#">dolor <strong>si</strong></a>

html_substr(str, 30)
// return Lorem ipsum <a href="#">dolor <strong>sit</strong> amet</a>, co
3
  • 4
    It seems that you want the substring to ignore the tags, but keep them intact in the final result. I think you'll need to convert the string to DOM elements, traverse through the elements, count the characters in the text nodes, and delete all characters (or text nodes) that exceed your count. Even then I have a feeling that there may be some variation between browsers with respect to white space. Not sure though. Commented May 14, 2011 at 16:58
  • Posted an answer. Seems to give the result you want, but again there may be some variation between browsers with respect to white spaces. Not sure. Commented May 14, 2011 at 17:50
  • substring html code without html breaking like [this][1]. [1]: stackoverflow.com/questions/6118904/… Commented Dec 28, 2012 at 11:54

6 Answers 6

10

Taking into consideration that parsing html with regex is a bad idea, here is a solution that does just that :)

EDIT: Just to be clear: This is not a valid solution, it was meant as an exercise that made very lenient assumptions about the input string, and as such should be taken with a grain of salt. Read the link above and see why parsing html with regex can never be done.

function htmlSubstring(s, n) {
    var m, r = /<([^>\s]*)[^>]*>/g,
        stack = [],
        lasti = 0,
        result = '';

    //for each tag, while we don't have enough characters
    while ((m = r.exec(s)) && n) {
        //get the text substring between the last tag and this one
        var temp = s.substring(lasti, m.index).substr(0, n);
        //append to the result and count the number of characters added
        result += temp;
        n -= temp.length;
        lasti = r.lastIndex;

        if (n) {
            result += m[0];
            if (m[1].indexOf('/') === 0) {
                //if this is a closing tag, than pop the stack (does not account for bad html)
                stack.pop();
            } else if (m[1].lastIndexOf('/') !== m[1].length - 1) {
                //if this is not a self closing tag than push it in the stack
                stack.push(m[1]);
            }
        }
    }

    //add the remainder of the string, if needed (there are no more tags in here)
    result += s.substr(lasti, n);

    //fix the unclosed tags
    while (stack.length) {
        result += '</' + stack.pop() + '>';
    }

    return result;

}

Example: http://jsfiddle.net/danmana/5mNNU/

Note: patrick dw's solution may be safer regarding bad html, but I'm not sure how well it handles white spaces.

Sign up to request clarification or add additional context in comments.

5 Comments

<img src='blah' title='Yes/No' alt='>>' /> Don't parse html with regular expressions - for every regex you have, one can find the html to break it.
@Zirak: I know :) Did you actually read the first link in the first sentence I posted? :) Also read my last sentence :P I know this is not the correct solution, but I thought it was an interesting exercise for me, and if I did it anyway, than why not post it.
So you know it's bad, yet you suggest it? My example isn't invalid or bad html. It's completely valid. Run it against a validator and it won't make a noise. What's not valid is your regex, because it can't match all valid htmls.
@Zirak: I never said this was a valid solution, and of course the regex is not valid, it was never meant to be. It was just an exercise that made some wild assumptions about the input string... I'll edit the post and make this clearer
how can I get the remaining string from the above function
7

Usage:

var str = 'Lorem ipsum <a href="#">dolor <strong>sit</strong> amet</a>, consectetur adipiscing elit.';

var res1 = html_substr( str, 20 );
var res2 = html_substr( str, 30 );

alert( res1 ); // Lorem ipsum <a href="#">dolor <strong>si</strong></a>
alert( res2 ); // Lorem ipsum <a href="#">dolor <strong>sit</strong> amet</a>, co

Example: http://jsfiddle.net/2ULbK/4/


Function:

function html_substr( str, count ) {

    var div = document.createElement('div');
    div.innerHTML = str;

    walk( div, track );

    function track( el ) {
        if( count > 0 ) {
            var len = el.data.length;
            count -= len;
            if( count <= 0 ) {
                el.data = el.substringData( 0, el.data.length + count );
            }
        } else {
            el.data = '';
        }
    }

    function walk( el, fn ) {
        var node = el.firstChild;
        do {
            if( node.nodeType === 3 ) {
                fn(node);
                    //          Added this >>------------------------------------<<
            } else if( node.nodeType === 1 && node.childNodes && node.childNodes[0] ) {
                walk( node, fn );
            }
        } while( node = node.nextSibling );
    }
    return div.innerHTML;
}

5 Comments

I don't think that simply returning div.innerHTML is enough. Consider what happens if there are more tags after the cut point. They would end up in the final string, but empty... I think that once count<=0 you should remove the remaining elements, instead of setting data = ''
@Dan: Yes, that's true. I wasn't sure which OP wanted. It could be that the potential empty tags should be left in place as part of the DOM structure. But you're right, if that's not the case, then you'd do el.parentNode.removeChild(el) instead. EDIT: Actually that would mess up the DOM walk.
@patrick dw: Here is an updated jsFiddle that removes the remaining nodes
Thanks man. This solution is great. But there is some problem with non-pair tags (img, hr, ...). Works great!
@honzahommer: Can you give an example of an HTML string that is giving you trouble? Also, what do you want to do with tags that get emptied completely (ones whose entire content is above the count)? Should those tags be removed, or retained as empty tags?
5

it is solution for single tags

function subStrWithoutBreakingTags(str, start, length) {
    var countTags = 0;
    var returnString = "";
    var writeLetters = 0;
    while (!((writeLetters >= length) && (countTags == 0))) {
        var letter = str.charAt(start + writeLetters);
        if (letter == "<") {
            countTags++;
        }
        if (letter == ">") {
            countTags--;
        }
        returnString += letter;
        writeLetters++;
    }
    return returnString;
}

Comments

0
let str = 'Lorem ipsum <a href="#">dolor <strong>sit</strong> amet</a>, consectetur adipiscing elit.'
let plainText = htmlString.replace(/<[^>]+>/g, '');

Extract plain text with above given regular expression then use JS String based ".substr()" function for desired results

Comments

-1

Use something similar to = str.replace(/<[^>]*>?/gi, '').substr(0, 20);
I've created an example at: http://fiddle.jshell.net/xpW9j/1/

1 Comment

This doesn't do what OP wants. In the example results, the tags are maintained.
-2

Javascript has a sub-string method. It makes no difference if the string contains html.

see http://www.w3schools.com/jsref/jsref_substr.asp

2 Comments

Yes, I know. But my problem is, when I use substr, the html tags bould be broken.
in that case your looking at something like recursive regular expressions to balance html tags. but that's going to be hideously complicated to implement.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.