Chapter 3Lesson 6

Regex: the modern flavor

The modern JavaScript regular-expression surface, and the senior judgment of when to drop the regex for a parser instead.

Here are two production bugs, both caused by a regex. The first looks like a textbook validator but rejects most of the world’s names. The second hand-rolls a regex to do the work of a method call that was one import away.

const isValidUsername = (input: string) => /^[a-zA-Z0-9]+$/.test(input);

isValidUsername('Smith');   // true
isValidUsername('Müller');  // false
isValidUsername('José');    // false
isValidUsername('小明');     // false

This regex shipped to a sign-up form on launch day. By the next morning the support inbox held thirty complaints from non-English users locked out of their own accounts. The problem is that [a-zA-Z0-9] only matches ASCII, but a “letter” in 2026 means any Unicode letter. Three out of four reasonable names fall through.

const EMAIL_RE = /^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$/;

export const subscribe = async (formData: FormData) => {
  const email = String(formData.get('email'));
  if (!EMAIL_RE.test(email)) {
    return { error: 'Invalid email' };
  }
  // ... persist subscriber
};

This regex is shorter than the real RFC 5322 grammar but still longer than the team would want to maintain, and it gets the job wrong in both directions. It rejects valid addresses such as quoted local-parts, internationalized domain names, and the .museum TLD. It also accepts garbage, since user@a.b passes. On top of that, it buries the real contract, that this field must be a deliverable email address, behind a wall of escapes. A single method call on a Zod schema does the job correctly, and it brings a localized error message, a JSON Schema shape, and a maintained validator behind it.

This lesson covers the regex flavor a 2026 engineer actually writes, and the point where a regex stops being the right tool at all.

Two construction forms

A regex in JavaScript can be written two ways. The first is your default. The second comes out in only one situation.

Literal (default)
Constructor (dynamic pattern)

const hexColor = /^#[\da-f]{6}$/i;

Slashes wrap the pattern, and flags go after the closing slash. It compiles once at parse time, every editor highlights it, and every reader recognizes it. Use it for any regex whose pattern is fixed when you write it, which is almost all of them.

const buildSearchRe = (term: string) => {
  const escaped = term.replace(/[.*+?^${}()|[\]\\]/g, '\\$&');
  return new RegExp(escaped, 'i');
};

new RegExp(pattern, flags) takes the pattern as a string. Use it when the pattern is genuinely built from a variable, such as a user-supplied search term or a parameterized field name. Two things need care here. First, every backslash in the string has to be doubled: you write new RegExp('\\d+'), not new RegExp('\d+'), because the string literal consumes one backslash before the regex parser ever sees it. Second, you have to escape any user-supplied substring with the three-line helper above. Without that escaping, a user can inject regex metacharacters into your pattern.

The constructor form is rare in application code, because most patterns are known when you write them. When you do need it, the escape helper above covers you. Reaching for an npm package to do three lines of replace is over-engineering.

The flag surface

Six flags do all the work, and each one has a clear job.

// g — required for .matchAll and for .replaceAll(regex, ...)
// i — case-insensitive; the daily reach for human input
// m — ^ and $ match line boundaries, not just string boundaries
// s — . matches newlines (dotAll); needed for patterns that span lines
// u — Unicode mode; full code-point matching, enables \p{...} escapes
// v — ES2024 Unicode sets mode; supersedes u, adds set ops and \p{RGI_Emoji}

You can combine flags freely (gi, gim, gs), with one rule that catches everyone the first time. Unicode mode (u) and Unicode sets mode (v) are mutually exclusive, so writing /pattern/uv is a syntax error. Default to u. Reach for v only when you need set operations inside a character class, or when you’re matching emoji sequences. Both flags are available everywhere in May 2026, from Node 24 LTS to every current browser, so the only question is which one the pattern needs, not whether your runtime supports it.

Named capture groups

Any time a regex captures structure, give the captures names. Named groups are the default in 2026. You’ll still see indexed access like match[1] and match[2] in older code and code reviews, but learning the named form first keeps you out of a common trap. Referring to captures by position breaks the moment you add or reorder a group in the pattern; referring to them by name does not.

const invoiceRe = /^INV-(?<year>\d{4})-(?<num>\d{4})$/;

const match = 'INV-2026-0042'.match(invoiceRe);
if (match?.groups) {
  const { year, num } = match.groups;
  // year: '2026', num: '0042'
}

The (?<name>...) syntax names a capture group inline. Each name has to be a valid identifier and unique within the pattern. The value of the group is whatever the inner pattern, here \d{4}, captures.

const invoiceRe = /^INV-(?<year>\d{4})-(?<num>\d{4})$/;

const match = 'INV-2026-0042'.match(invoiceRe);
if (match?.groups) {
  const { year, num } = match.groups;
  // year: '2026', num: '0042'
}

.match(re) returns either a match object or null. On a match, .groups is an object keyed by the names you used in the pattern; on no match, the whole result is null. The match?.groups guard handles both cases. The optional chaining short-circuits when there’s no match, and the truthiness check inside the if is what satisfies TypeScript, which otherwise treats .groups as possibly undefined.

const invoiceRe = /^INV-(?<year>\d{4})-(?<num>\d{4})$/;

const match = 'INV-2026-0042'.match(invoiceRe);
if (match?.groups) {
  const { year, num } = match.groups;
  // year: '2026', num: '0042'
}

TypeScript types .groups as { [key: string]: string } | undefined. The type system can’t see which names appeared in the pattern, so every key on .groups is just string. Once the guard above narrows away the undefined, the destructure gives you string rather than string | undefined. If you want tighter typing, validate the parsed values at the call site with Zod or a small satisfies shape, rather than chasing template-literal-type tricks on the regex itself.

1 / 1

Indexed capture groups still exist: /^INV-(\d{4})-(\d{4})$/ read through match[1] and match[2] is the older form. Recognize it in code review, but write named groups yourself. Backreferences work the same way, \k<year> for named groups and \1 for indexed, and the same default applies: prefer the named form.

One hazard is worth knowing about. Any regex with nested unbounded quantifiers like (a+)+, or an .* followed by alternation, can hit ReDoS on hostile input. The engine ends up backtracking through an exploding number of combinations, and a single request can pin a CPU core for seconds. Two habits prevent it: avoid nesting quantifiers, and put a length cap on any user-controlled string before the regex sees it.

`\p{Letter}` over `[a-zA-Z]`

This is the habit change to take away from the lesson, and it fixes the username bug that opened it. Compare the two versions below.

Latin-only (broken)
Unicode-aware (correct)

const isLetters = (input: string) => /^[a-zA-Z]+$/u.test(input);

['Müller', '小明', 'José', 'Smith'].map(isLetters);
// → [false, false, false, true]

[a-zA-Z] is exactly the 52 ASCII letters. Anything outside that range falls through: accented Latin, CJK, Cyrillic, Arabic, Devanagari. Three out of four reasonable names get rejected, which is how this kind of bug ends up reported through the support inbox rather than caught in review.

const isLetters = (input: string) => /^\p{Letter}+$/u.test(input);

['Müller', '小明', 'José', 'Smith'].map(isLetters);
// → [true, true, true, true]

\p{Letter} is a property escape , meaning any code point classified as a letter, in any script. The u flag is required, because it’s what tells the engine to read \p{...} as a property escape rather than a literal p. The regex has the same shape as before, but now its meaning is correct and the ASCII-only bug is gone.

These are the properties worth recognizing on sight:

\p{Letter} matches any letter in any script. This is the default for “this looks like a name.”
\p{Number} matches any numeric character, including digits, Roman numerals, and Arabic-Indic digits.
\p{White_Space} matches any whitespace, including non-ASCII spaces.
\p{Emoji} matches any emoji code point.
\p{Script=Latin}, \p{Script=Han}, \p{Script=Cyrillic}, and so on, match a specific script when you intentionally need one.

The rule is simple: any regex over human-entered text should use \p{...}. Save [a-zA-Z] for patterns over data that is ASCII by contract, such as an ASCII hex token, a Base64 chunk, or a protocol identifier whose spec is itself ASCII.

The v flag adds one capability worth recognizing: set operations inside a character class.

// Letters that are also in ASCII — the explicit "Latin alphabet" form
const asciiLetter = /^[\p{Letter}&&\p{ASCII}]+$/v;

// Letters that are NOT ASCII — names with accents and non-Latin scripts
const nonAsciiLetter = /^[\p{Letter}--\p{ASCII}]+$/v;

&& is intersection, -- is difference, and a nested class is a union by default. You won’t write these often, and may go weeks without needing them. The point is to recognize && and -- inside a character class when you read one, and not misread them as logical operators.

The four-method result surface

A regex is a value; the methods are what run it against a string. Four of them cover everything, and an inconsistency between two of those four causes most regex bugs.

pattern.test(string) is the fastest way to get a boolean.

const isHexColor = /^#[\da-f]{6}$/i;

isHexColor.test('#ff8800');   // true
isHexColor.test('#FFFF');     // false
isHexColor.test('not a hex'); // false

This is what you reach for to answer “does this match?” There’s no allocation and no object, just a boolean. The one thing to watch for is the g flag: a regex with g carries a lastIndex cursor across calls, so each .test starts scanning from where the previous one left off, and repeated calls return different answers for the same string. Don’t put g on a pattern you call .test on more than once.

string.match(pattern) is the inconsistent one. Without g, it returns a single match object, with the full match, .groups, .index, and captures. With g, it returns a plain string[] of full matches and discards the groups and indices entirely.

const re = /INV-(?<year>\d{4})-(?<num>\d{4})/;
'order INV-2026-0042'.match(re)?.groups;
// → { year: '2026', num: '0042' }

const reG = /INV-(?<year>\d{4})-(?<num>\d{4})/g;
'order INV-2026-0001 and INV-2026-0042'.match(reG);
// → ['INV-2026-0001', 'INV-2026-0042']   ← strings only, no groups

Losing the groups like this is exactly the problem the next method was added to fix.

string.matchAll(pattern) requires the g flag and throws a TypeError if you forget it. It returns an iterator of full match objects, each one carrying .groups, .index, and the captures.

const re = /INV-(?<year>\d{4})-(?<num>\d{4})/g;

for (const match of 'INV-2026-0001 and INV-2026-0042'.matchAll(re)) {
  if (!match.groups) continue;
  const { year, num } = match.groups;
  // year: '2026', num: '0001'  then  year: '2026', num: '0042'
}

The iterator works naturally with the for...of loop from the previous lesson, and with Array.from(text.matchAll(re)) when you want an array. For any capturing regex you want to run multiple times against the same string, use .matchAll rather than .match with g.

string.replaceAll(pattern, replacement) with a regex also requires g: TypeScript flags it, and the runtime throws. The strictness is deliberate. The older .replace(/x/, ...) silently replaced only the first occurrence, which shipped to production as a bug constantly, so replaceAll refuses the ambiguous case. The replacement itself can be a string, using $<name> for named groups or $1 for indexed, or a function (match, ...groups) => string when you need logic.

const numbered = 'INV-2026-0001 / INV-2026-0042'.replaceAll(
  /INV-(?<year>\d{4})-(?<num>\d{4})/g,
  (_full, _year, num) => `#${num}`,
);
// → '#0001 / #0042'

In the latest spec, the function form also receives the named groups as a trailing object parameter, but the positional (match, ...groups) form is the one you’ll see most. Use the function form when the replacement depends on the captured values.

Now try the inconsistency for yourself. The bug below is a common one to catch in code review.

Both regexes match the same shape, but the methods do different things. Read carefully. Predict what this program prints, then press Check.

const text = 'order INV-2026-0001 and INV-2026-0042';
const re = /INV-(?<year>\d{4})-(?<num>\d{4})/g;

const result = text.match(re);

console.log(result?.[0]);
console.log(result?.[0]?.groups);

Lookarounds

A lookaround is an assertion about what sits around the current position, and it consumes no characters. There are four forms: two that look forward and two that look back, each in a positive and a negative version. You reach for one when the surrounding context is part of the match condition but shouldn’t be part of the captured text.

// (?=...)   positive lookahead — assert what follows
// (?!...)   negative lookahead
// (?<=...)  positive lookbehind — assert what precedes
// (?<!...)  negative lookbehind

// "Numbers immediately followed by px, without capturing the px"
const pxValue = /\d+(?=px)/g;
'padding: 16px 24em 32px'.match(pxValue);
// → ['16', '32']

All four forms are available everywhere in 2026, so the only question is whether they earn their place, and most of the time they don’t. Capturing the surrounding context and slicing it off in code usually reads more clearly than a lookaround. Reach for one when the distinction between asserting and capturing is genuinely the point.

When to drop the regex

Knowing when not to write a regex is the harder judgment, and it’s where a lot of senior value in this area lives. The lesson started with two bugs, and one of them was an entire regex that should never have existed, because Zod’s z.email() does the job. Two situations call for that restraint.

Situation 1: the input is a structured format. Email, URL, JSON, HTML, CSV, Markdown, and ISO dates all have a real specification, and all have a parser one import away. The table below names the parser to reach for in each case.

| Format | Parser (senior reach) | | --- | --- | | Email | z.email(): Zod 4 top-level format builder, lands in the forms unit | | URL | new URL(input) (throws on invalid) or URL.canParse(input) for a boolean | | JSON | JSON.parse(input): covered in the JSON chapter | | HTML | DOMParser in the browser, a real HTML parser on the server | | CSV | a CSV library, never regex | | Markdown | a Markdown parser | | ISO date | Temporal.PlainDate.from(input): covered later, in the time chapter |

Those forward links are gentle: you don’t need any of these parsers yet. What you need now is the habit of pausing when you see a structured format, and recognizing that a regex is the wrong tool before you start typing the pattern.

Situation 2: the regex is becoming unreadable. A rough threshold is the point where a reviewer can’t tell what the pattern matches in a single read. Past that point, a small parser is the better choice: a few .indexOf and .slice calls, or a real tokenizer for anything that warrants one. Regex stays useful for short patterns over bounded, unstructured text.

If the text passes both checks, meaning it’s small, unstructured, and bounded, then write modern regex: literal form, the u flag (or v if you need set operations or emoji), named groups, \p{Letter} over [a-zA-Z], and .matchAll over .match with g. That’s the whole decision.

Now apply the rule. The PR below is a validateContactInput function with two hand-rolled regexes. Both use a regex where a parser belongs, and both have subtle bugs that make the case for the parser. Read the file and leave a comment on each.

Review this PR for a teammate. The function is supposed to accept either an email or a URL and tell the caller which. Two regex-versus-parser bugs to flag — leave a comment on each. Click any line to leave a review comment, then press Submit review.

src/validate-contact.ts

type ContactKind = 'email' | 'url' | null;

export const validateContactInput = (input: string): ContactKind => {
  const emailRe = /^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$/;
  if (emailRe.test(input)) {
    return 'email';
  }

  const urlRe = /^https?:\/\/.+\..+$/;
  if (urlRe.test(input)) {
    return 'url';
  }

  return null;
};

Hand-rolled email regex is a perennial source of false rejections (this one drops anything with consecutive dots in the local-part, IDN domains, and quoted local-parts) and false accepts (a@b.cc parses fine). Zod 4’s z.email() is the senior reach — it carries a maintained validator, a localized error message, and a JSON Schema shape, all at the cost of one import:

import { z } from 'zod';

const emailSchema = z.email();
if (emailSchema.safeParse(input).success) {
  return 'email';
}

The course will cover Zod schemas in depth in the forms unit; for now, the takeaway is parser, not regex, for any input that has a specification.

URL parsing has the same shape — there’s a real spec, a real parser is built into the runtime, and the hand-rolled regex matches strings the spec rejects (http:////foo passes this one). The senior reach is URL.canParse(input) for the boolean form:

if (URL.canParse(input)) {
  return 'url';
}

URL.canParse is the static boolean cousin of new URL(input) and is universally available in 2026. The try { new URL(input); } catch { ... } shape works too, but URL.canParse is the cleaner read. The URL constructor lands properly in the HTTP chapter.

External resources

MDN — Regular expression syntax

developer.mozilla.org

Canonical reference for the full regex surface — every flag, every character class, every quantifier, with runnable examples.

MDN — Unicode property escapes

developer.mozilla.org

The full \p{...} property name catalog — Letter, Number, White_Space, Emoji, and every Script= filter.

V8 blog — RegExp v flag

v8.dev

The canonical landing post for the v flag — set operations inside character classes, properties-of-strings, and why u and v are mutually exclusive.

Zod — String formats

zod.dev

The top-level format builders — z.email, z.url, z.iso.datetime — as the parser replacement for regex on structured formats.

regex101.com

Interactive regex playground. Useful as a sanity-check workflow, and as a signal — if you need the tool to read your own regex, the unreadable threshold has crossed.