Module lessons (4/4)
Unicode property escapes
The classes \w, \d, \s in ASCII are not enough for Italian, French, Greek
or emoji text. Modern JavaScript (with the u flag -- Unicode) offers
property escapes \p{...}: semantic classes based on the Unicode
properties of characters.
Pattern: \p{L}+ (with flag u)
Sample: Ciao caffe' \u00fcber \u4e16\u754c
^^^^ ^^^^^^^ ^^^^ ^^^^\p{L} = "any Letter (Unicode)": includes accented letters, Chinese
ideograms, Cyrillic, Greek\u2026 everything. The most common ones:
\p{L}-- letter (of any alphabet).\p{N}-- number (Arabic digits, Roman, Indian\u2026).\p{P}-- punctuation.\p{S}-- symbol (mathematical, currency, emoji\u2026).\p{Z}-- space/separator.\p{Script=Latin}-- specifically the Latin alphabet.\p{Script=Greek}-- the Greek alphabet. And so on.
And the negated versions \P{L}, \P{N}, \u2026
Difference with \w and \d
\\w matches [A-Za-z0-9_] -- ASCII only, no "caffe'"
\\p{L}\\p{N}_ with flag u -- includes accented charactersFor a robust parser of Italian text, prefer \p{L} over \w:
citta', perche', andro' correctly match as words.
Unicode properties and browser compatibility
Unicode properties like \\p{L} (Letters) or \\p{Script=Latin} extend classes to international alphabets. In JavaScript, they strictly require the u (or v) flag, otherwise the engine throws a syntax error.
Try it
Find every word, including those with accents (citta', perche', e' \u2026). Use the property escape \\p{L} with flag u.
Show hint
Replace \\w+ with \\p{L}+ and add the u flag (in addition to g).
Solution available after 3 attempts
Review exercise
Find every Unicode symbol (currencies, math, emoji) in the text, excluding letters and digits.
Show hint
\\p{S} matches the Symbol category of Unicode. Remember the u flag.
Solution available after 3 attempts
Additional challenge
Find all words consisting of Cyrillic alphabet letters using `p{Script=Cyrillic}`.
Show hint
Use \p{Script=Cyrillic} with the + quantifier and the u flag.
Solution available after 3 attempts