Count bytes in textarea using javascript

Question

I need to count how long in bytes a textarea is when UTF8 encoded using javascript. Any idea how I would do this?

thanks!

Tgr · Accepted Answer · 2010-05-17 11:10:33Z

19

encodeURIComponent(text).replace(/%[A-F\d]{2}/g, 'U').length

answered May 17, 2010 at 11:10

Tgr

28.4k13 gold badges88 silver badges123 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

broofa Over a year ago

This is pretty slick. The one issue is that it will throw if the string contains an invalid surrogate pattern. E.g. encodeURIComponent('\ud800a'). Just something to be aware of.

Lauri Oherd Over a year ago

How can you insert into textarea a string which contains an invalid surrogate pattern? I tried to insert the text '\ud800a' to this test page (which uses encodeURI -function internally to encode inserted text) but couldn't reproduce such an error situation - instead I saw: document.getElementsByTagName("textarea")[0].value === "\\ud800a".

Satya Prakash Over a year ago

Used for counting length of UTF-8 string.

broofa Over a year ago

@LauriOherd: (very!) late response here, but to answer your question, textareas will accept invalid strings. E.g. textarea.value = '\ud800' && encodeURIComponent(textarea.value) will throw (at least, in Chrome it will).

broofa · Accepted Answer · 2012-08-30 21:51:54Z

18

Combining various answers, the following method should be fast and accurate, and avoids issues with invalid surrogate pairs that can cause errors in encodeURIComponent():

function getUTF8Length(s) {
  var len = 0;
  for (var i = 0; i < s.length; i++) {
    var code = s.charCodeAt(i);
    if (code <= 0x7f) {
      len += 1;
    } else if (code <= 0x7ff) {
      len += 2;
    } else if (code >= 0xd800 && code <= 0xdfff) {
      // Surrogate pair: These take 4 bytes in UTF-8 and 2 chars in UCS-2
      // (Assume next char is the other [valid] half and just skip it)
      len += 4; i++;
    } else if (code < 0xffff) {
      len += 3;
    } else {
      len += 4;
    }
  }
  return len;
}

answered Aug 30, 2012 at 21:51

broofa

38.3k13 gold badges76 silver badges73 bronze badges

4 Comments

Akhil Kooliyatt Over a year ago

I had a badly design situation where i was forced to count the bytes explicitly and handle it out. On top of the above snippet, I also had to add a handling for next line characters since they are also 2 bytes.

broofa Over a year ago

@RBz Are you referring to the NEL (U+0085) character? That should be counted properly by this function as 0x7f < NEL < 0x7ff. Regardless, it turns out there is now a TextEncoder API that most JS environments support. See my recent edit to the accepted answer, above.

Akhil Kooliyatt Over a year ago

I am not sure how this works, but if I press enter it was counted as 1 by this snippet. I read earlier chrome was treating it as 2 and now they have fixed it to reflect 1. However, for me, it had to be counted as 2 since by back-end database treats it as 2.

broofa Over a year ago

@RBz Be aware there are a perhaps surprising number of line termination characters in Unicode. Some encode as one byte, some as two bytes. So it really depends on what specific character(s) are used/expected. See en.wikipedia.org/wiki/Newline#Unicode .

broofa · Accepted Answer · 2020-06-23 14:23:24Z

16

[June 2020: The previous answer has been replaced due to it returning incorrect results].

Most modern JS environments (browsers and Node) now support the TextEncoder API, which may be used as follows to count UTF8 bytes:

const textEncoder = new TextEncoder();
textEncoder.encode('⤀⦀⨀').length; // => 9

This is not quite as fast as the getUTF8Length() function mentioned in other answers, below, but should suffice for all but the most demanding use cases. Moreover, it has the benefit of leveraging a standard API that is well-tested, well-maintained, and portable.

edited Jun 23, 2020 at 14:23

broofa

38.3k13 gold badges76 silver badges73 bronze badges

answered Feb 16, 2011 at 11:37

derflocki

8731 gold badge12 silver badges20 bronze badges

2 Comments

Didier L Over a year ago

I don't think this implementation is correct since it counts surrogate characters twice: once when encountering the high surrogate, then once when encountering the low one. For example, the following returns 6: getUTF8Length(String.fromCharCode(0xD800, 0xDC00)) although this represents a single character (I must admit I don't know which one, I just combined 2 surrogate char codes…). I am no expert in unicode though…

Sebastian Over a year ago

@Didier L, yes you are right! It should be added to the case list and be accounted for

frank_neff · Accepted Answer · 2011-11-14 14:54:36Z

14

If you have non-bmp characters in your string, it's a little more complicated...

Because javascript does UTF-16 encode, and a "character" is a 2-byte-stack (16 bit) all multibyte characters (3 and more bytes) will not work:

    <script type="text/javascript">
        var nonBmpString = "foo€";
        console.log( nonBmpString.length );
        // will output 5
    </script>

The character "€" has a length of 3 bytes (24bit). Javascript does interpret it as 2 characters, because in JS, a character is a 16 bit block.

So to correctly get the bytesize of a mixed string, we have to code our own function fixedCharCodeAt();

    function fixedCharCodeAt(str, idx) {
        idx = idx || 0;
        var code = str.charCodeAt(idx);
        var hi, low;
        if (0xD800 <= code && code <= 0xDBFF) { // High surrogate (could change last hex to 0xDB7F to treat high private surrogates as single characters)
            hi = code;
            low = str.charCodeAt(idx + 1);
            if (isNaN(low)) {
                throw 'Kein gültiges Schriftzeichen oder Speicherfehler!';
            }
            return ((hi - 0xD800) * 0x400) + (low - 0xDC00) + 0x10000;
        }
        if (0xDC00 <= code && code <= 0xDFFF) { // Low surrogate
            // We return false to allow loops to skip this iteration since should have already handled high surrogate above in the previous iteration
            return false;
            /*hi = str.charCodeAt(idx-1);
            low = code;
            return ((hi - 0xD800) * 0x400) + (low - 0xDC00) + 0x10000;*/
        }
        return code;
    }

Now we can count the bytes...

    function countUtf8(str) {
        var result = 0;
        for (var n = 0; n < str.length; n++) {
            var charCode = fixedCharCodeAt(str, n);
            if (typeof charCode === "number") {
                if (charCode < 128) {
                    result = result + 1;
                } else if (charCode < 2048) {
                    result = result + 2;
                } else if (charCode < 65536) {
                    result = result + 3;
                } else if (charCode < 2097152) {
                    result = result + 4;
                } else if (charCode < 67108864) {
                    result = result + 5;
                } else {
                    result = result + 6;
                }
            }
        }
        return result;
    }

By the way... You should not use the encodeURI-method, because, it's a native browser function ;)

More stuff:

Cheers

frankneff.ch / @frank_neff

answered Nov 14, 2011 at 14:54

frank_neff

1,0229 silver badges16 bronze badges

5 Comments

Nadeem Ullah Over a year ago

Hi Frank, I used your method and it works correctly for multi byte char strings. i have a text area where I need to count chars / bytes as soon as user types. I tried the key press event but it does not get fired when we do copy / paste. Can you please suggest some reliable & efficient way to count bytes while user types? I need to show a count like "300 left.." Thanks & regards, Nadeem

Mathias Bynens Over a year ago

There is no need for the else if (charCode < 67108864) {} bit and the else that follows it. Unicode stops at U+10FFFF and it’s impossible to represent a non-Unicode code point in JavaScript.

user1441149 Over a year ago

This is true according to the RFC3629 specification. But the original specification allows up to six byte characters. I'm not sure which implementation should be respected but I would say this is the correct solution.

Ry- Over a year ago

@DaanBiesterbos: JavaScript uses UTF-16*, though, which can’t represent codepoints (the ones that don’t exist) above U+10FFFF anyway.

Daniel Lidström Over a year ago

@frank_neff What's wrong with using a native browser function?

Ryan W · Accepted Answer · 2012-10-29 09:15:30Z

2

Add Byte length counting function to the string

String.prototype.Blength = function() {
    var arr = this.match(/[^\x00-\xff]/ig);
    return  arr == null ? this.length : this.length + arr.length;
}

then you can use .Blength() to get the size

answered Oct 29, 2012 at 9:15

Ryan W

6,2132 gold badges39 silver badges47 bronze badges

Comments

Juan Correa · Accepted Answer · 2010-10-04 16:11:12Z

I have been asking myself the same thing. This is the best answer I have stumble upon:

http://www.inter-locale.com/demos/countBytes.html

Here is the code snippet:

<script type="text/javascript">
 function checkLength() {
    var countMe = document.getElementById("someText").value
    var escapedStr = encodeURI(countMe)
    if (escapedStr.indexOf("%") != -1) {
        var count = escapedStr.split("%").length - 1
        if (count == 0) count++  //perverse case; can't happen with real UTF-8
        var tmp = escapedStr.length - (count * 3)
        count = count + tmp
    } else {
        count = escapedStr.length
    }
    alert(escapedStr + ": size is " + count)
 }

but the link contains a live example of it to play with. "encodeURI(STRING)" is the building block here, but also look at encodeURIComponent(STRING) (as already point out on the previous answer) to see which one fits your needs.

Regards

Lauri Oherd · Accepted Answer · 2012-09-02 08:05:31Z

0

encodeURI(text).split(/%..|./).length - 1

answered Sep 2, 2012 at 8:05

Lauri Oherd

1,4031 gold badge13 silver badges15 bronze badges

Comments

qbolec · Accepted Answer · 2013-04-29 17:42:07Z

0

How about simple:

unescape(encodeURIComponent(utf8text)).length

The trick is that encodeURIComponent seems to work on characters while unescape works on bytes.

answered Apr 29, 2013 at 17:42

qbolec

5,1642 gold badges38 silver badges45 bronze badges

1 Comment

jvatic Over a year ago

the unescape function is deprecated and obsolete as of JavaScript 1.5

Sankumarsingh · Accepted Answer · 2014-01-19 04:37:09Z

-1

Try the following:

function b(c) {
     var n=0;
     for (i=0;i<c.length;i++) {
           p = c.charCodeAt(i);
           if (p<128) {
                 n++;
           } else if (p<2048) {
                 n+=2;
           } else {
                 n+=3;
           }
      }return n;
}

edited Jan 19, 2014 at 4:37

Sankumarsingh

10.1k11 gold badges54 silver badges75 bronze badges

answered Jan 19, 2014 at 3:57

user3211372

1

Comments

akc42 · Accepted Answer · 2016-10-09 08:02:49Z

-2

set meta UTF-8 just & it's OK!

<meta charset="UTF-8">
<meta http-equiv="content-type" content="text/html;charset=utf-8">

and js:

if($mytext.length > 10){
 // its okkk :)
}

edited Oct 9, 2016 at 8:02

akc42

5,0415 gold badges47 silver badges67 bronze badges

answered Oct 9, 2016 at 7:31

Mehdi Mashayekhi

11 bronze badge

Collectives™ on Stack Overflow

Count bytes in textarea using javascript

10 Answers 10

4 Comments

4 Comments

2 Comments

5 Comments

Comments

Comments

Comments

1 Comment

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

10 Answers 10

4 Comments

4 Comments

2 Comments

5 Comments

Comments

Comments

Comments

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related