Chapter 1Lesson 4

Why .length lies

How JavaScript actually models strings as Unicode, using Intl.Segmenter to count what a user sees and normalization to keep look-alike text comparable.

A user types 280 emojis into a bio field with a “280-character limit” and gets a “too long” error. A different user signs up with a name that contains an accented letter, and the duplicate check misses an existing row that looks identical on screen. Both bugs trace to the same root: string.length in JavaScript counts UTF-16 storage chunks, not the characters the user perceives. By the end of this lesson you’ll know which of the three available counts to reach for, and how to compare two strings that the user thinks of as identical.

The surprise: `.length` doesn’t count characters

Predict the output of these three lines. Two of them won’t match what you’d guess from looking at the strings.

Predict what this program prints, then press Check.

console.log('hello'.length);
console.log('🇺🇸'.length);
console.log('👨‍👩‍👧‍👦'.length);

The takeaway goes beyond “watch out for emojis.” .length is a serialization detail : it tells you how the engine stored the string, not how a human reads it. The rest of this lesson covers the three counts you actually have available, and how to pick the right one for the job.

Three counts, three reaches

There’s no single “length of a string” in JavaScript. There are three, and each one answers a different question. So instead of asking “how long is this string?”, ask “do I need code units, code points, or grapheme clusters ?”

Start with the same input measured all three ways:

const family = '👨‍👩‍👧‍👦';

family.length;                                            // 11
[...family].length;                                       // 7
[...new Intl.Segmenter('en', { granularity: 'grapheme' })
  .segment(family)].length;                               // 1

Here are the three options and when to reach for each:

Code units: string.length. The right answer when the value is a key, an index back into the same string, or a byte budget for a serialized payload. It’s fast, allocates nothing, and is correct for ASCII, but wrong for anything the user sees.
Code points: [...string].length (or Array.from(string).length). Spreading a string iterates by Unicode code point, so surrogate pairs collapse to one. This is closer to what a human would count than .length, but still wrong for any character built from joined sequences. The family above is seven code points but one cluster on screen.
Grapheme clusters: new Intl.Segmenter(locale, { granularity: 'grapheme' }), then count the segments. The right answer for “how many characters does the user see,” and the 2026 default for any length check on a user-facing field.

The table below runs the same three counts against three representative inputs. Notice how the disagreement grows as the input moves from plain ASCII toward complex emoji:

Input	`.length` (code units)	`[...str].length` (code points)	`Intl.Segmenter` (grapheme clusters)
`'café'` (combining acute)	5	5	4
`'🇺🇸'` (US flag)	4	2	1
`'👨‍👩‍👧‍👦'` (family)	11	7	1

The same string, three different lengths. Pick the column whose intent matches yours.

The table shows the core pattern: each step up in fidelity (code units to code points to grapheme clusters) handles another class of input that the cruder count gets wrong. On ASCII all three counts agree, which is why .length looks fine on 'hello' and then fails without warning the moment a user pastes an emoji or an accented character.

One forward link: Zod’s .length() constraint on string schemas (coming in a later unit) uses .length under the hood, which counts code units, so user-facing length validation needs a custom refinement that runs the segmenter. The fix is a one-liner, and we’ll wire it in when validation lands.

`Intl.Segmenter`: the 2026 user-facing length

Intl.Segmenter has been part of the platform baseline since 2024: it ships in every modern browser this course targets and in Node 16 and up. The course pins Node 24 LTS, so you can use it directly, with no polyfill, no if (typeof Intl.Segmenter !== 'undefined') check, and no fallback path.

The one-liner to memorize:

const countCharacters = (input: string): number =>
  [...new Intl.Segmenter('en', { granularity: 'grapheme' }).segment(input)].length;

countCharacters('🇺🇸');         // 1
countCharacters('👨‍👩‍👧‍👦');     // 1
countCharacters('café');        // 4

The function reads in two parts.

First, the constructor takes a locale and a granularity option. The locale matters for scripts where grapheme rules differ (Thai, Khmer, Devanagari); for English and most Latin-script text the cluster boundaries come out the same either way, but you still pass a locale. 'grapheme' is the mode that segments by user-perceived character; the same constructor supports 'word' and 'sentence' granularities for related jobs.

Second, .segment(input) returns an iterable of { segment, index, ... } objects, one per cluster. There’s no .size or .length shortcut on the result, so the idiomatic count is to spread the result into an array and read its length. That spread-then-.length pattern is the standard form, so when you see it in a codebase you know exactly what it’s doing.

The companion granularities are worth recognizing even if you rarely reach for them. granularity: 'word' is the right tool for word-count features, and each segment carries an isWordLike flag so you can skip punctuation and whitespace. granularity: 'sentence' does the same for sentence-count features. Both are rare in SaaS UIs, but knowing they exist means you’ll recognize Intl.Segmenter as the one tool that handles all three jobs when you meet it elsewhere.

One forward link: the same Intl.* namespace also ships Intl.NumberFormat (mentioned in the previous lesson on money), Intl.Collator for locale-aware sorting, and Intl.DateTimeFormat for date and time rendering. This is the platform-native internationalization surface the course leans on instead of pulling in third-party libraries.

Normalization at the storage boundary

Normalization solves the other half of the intro’s problem: two strings that look identical on screen but compare as different. In this snippet the two literals render the same, yet the equality check fails and the two length checks disagree:

const a = 'café';            // precomposed: 'c', 'a', 'f', 'é'
const b = 'café';            // decomposed:  'c', 'a', 'f', 'e' + combining acute
a === b;                     // false
a.length;                    // 4
b.length;                    // 5

A user would type these as the same word, but they’re different sequences of Unicode code points. The first uses a single precomposed é, one code point that already includes the accent. The second uses a plain e followed by a combining acute accent, two code points that the renderer overlays into one visual cluster. The previous lesson’s === rule still applies: for primitives, === compares the value byte for byte, and the bytes here genuinely differ.

The fix is one method call:

a.normalize('NFC') === b.normalize('NFC');   // true

.normalize('NFC') collapses both forms to the same canonical sequence. NFC stands for Normalization Form C (Canonical Composition): it combines characters into their precomposed form wherever possible. It’s the form you want by default for storage, comparison, and search.

There are four normalization forms in total, and you only need a short rule of thumb for each:

NFC (Canonical Composition): combine characters into their precomposed form. This is the senior default for storage, comparison, and search, and the form this lesson teaches.
NFD (Canonical Decomposition): split precomposed characters into base letter plus combining marks. Useful for accent-insensitive search, where you decompose and then strip the combining marks.
NFKC / NFKD (Compatibility forms): collapse characters that look similar but are semantically distinct, so the ﬁ ligature becomes fi and full-width digits become ASCII digits. Useful for fuzzy matching at a search boundary, but the wrong default for storage because they lose information.

One rule keeps this from becoming a maintenance burden: normalize once, at the database write boundary. Don’t sprinkle .normalize('NFC') at every === and every .length call site downstream. If every value in the table is already NFC, every comparison and every length check works against a canonical form for free. Normalization lives at the same boundary that runs your input validation (where Zod lands in a later unit): one call, in one place, and the rest of the system stays clean.

Scattered .normalize calls are also a sign that the storage boundary isn’t doing its job. If you find yourself reaching for .normalize deep inside a comparison helper or a search function, walk back to the seam where the value entered the system and normalize it there instead.

One forward link: Drizzle’s text and varchar columns (in a later unit) don’t normalize for you; Postgres stores whatever bytes you hand it. The normalization belongs in the schema or the action handler, before the row hits the database.

The senior `String.prototype` surface

String.prototype is huge, and most of it is legacy. In 2026 you reach for a small set of methods daily; the long tail you only need to recognize when you meet it in older code. Here’s the working surface:

includes / startsWith / endsWith: substring tests. Reach for these over indexOf(needle) !== -1, which reads as “where is it?” when the question is “is it there?”
at(-1): last-character access. Cleaner than string[string.length - 1], and negative indices count from the end for any position.
slice(start, end): substring extraction, and the default choice. Prefer it over substring (which silently swaps its arguments if start > end) and substr (deprecated).
split / join: the boundary between a string and an array of segments, always paired. split(sep) breaks the string apart, join(sep) reassembles it.
replaceAll(needle, replacement): the modern way to replace every occurrence of a literal string. It removes the replace(/needle/g, ...) regex boilerplate when the needle isn’t a regex anyway.
trim / trimStart / trimEnd: whitespace cleanup at input boundaries. Pair with the empty-string guard from the previous lesson when converting form input.
padStart / padEnd: fixed-width formatting. Rare in SaaS UIs; occasional in logs and CLI output.
localeCompare(other, locale, options): locale-aware comparison for sorting. The right answer for any user-visible alphabetical sort, because < and > on strings compare by code unit, which puts accented characters in positions no human would call alphabetical.
normalize(form): the boundary tool from the previous section.

Then there’s the legacy surface: methods you’ll see in older code and AI suggestions, but that don’t earn a place in 2026 code:

substr: deprecated by the spec; behaves like slice with a length argument instead of an end index. Use slice.
substring: older sibling of slice that silently swaps its arguments if start > end. No reason to prefer it.
escape / unescape: globals rather than String.prototype methods, but commonly grouped with them. Deprecated for over two decades; use encodeURIComponent / decodeURIComponent for URL escaping.
String.prototype.bold / italics / fontcolor / etc.: HTML-wrapping methods from the 1990s, deprecated; the JSX layer owns markup in this stack.
String.raw: the template-literal escape hatch. This one isn’t legacy; it belongs to the next lesson, where tagged templates get their full treatment.

Skim this list once so you recognize each method on sight in later lessons. There’s no need to memorize a full reference; when you need the complete surface, MDN has it.

Practice: count what the user sees

Now write the canonical segmenter pattern yourself. Implement countCharacters(input) so it returns the number of characters the user actually perceives: the count a bio field with a character limit should enforce.

Implement countCharacters(input) using Intl.Segmenter with granularity: 'grapheme'. The tests cover ASCII, emoji built from surrogate pairs, joined emoji sequences, combining marks, and the empty string. If you use .length or the spread form, the emoji tests will fail; only the segmenter passes all six.

Output

The flag-emoji and family-emoji tests are the load-bearing ones: they fail if you reach for .length or the spread form, the same way those counts failed in the table earlier in the lesson. The combining-mark test catches an implementation that uses [...str].length and stops there. Only the segmenter passes all six.

External resources

Intl.Segmenter — MDN

developer.mozilla.org

The full reference, including the word and sentence granularities with worked isWordLike examples.

String.prototype.normalize — MDN

developer.mozilla.org

The four normalization forms with worked examples covering precomposed vs decomposed, compatibility decomposition, and the search-friendly NFKC patterns.

The Absolute Minimum Every Software Developer Must Know About Unicode in 2023

tonsky.me

Nikita Prokopov's modern rewrite of Joel Spolsky's classic — UTF-8, grapheme clusters, and normalization with the same systems mindset this lesson takes.

UniView — interactive Unicode inspector

r12a.github.io

Paste any string to see its code points, names, and grapheme breakdown — the fastest way to understand why an emoji has the length it does.