sia.hackernoon.com

Emojis have become more than just a tool for expression; they are a language of their own. However, their implementation in software development brings unique challenges, particularly in accurately measuring their length. This article delves into the complexities of emoji lengths, using JavaScript to provide practical examples and solutions for navigating these challenges.

The Curious Case of Emoji Lengths

Emoji handling varies between programming languages. In languages that use UTF-16 encoding, like JavaScript, emojis may be composed of more complex structures, such as surrogate pairs or even sequences of multiple code units. Thus, an emoji's length could be more than just 1 or 2; it might extend to several units depending on its composition. This section unveils some of the most intriguing examples of emoji lengths and explains the technical reasons behind these phenomena, focusing on JavaScript for specific insights.

Surprising Examples and Their Actual Lengths

Single Emojis: The Heart Emoji (❤️)

At first glance, the heart emoji appears to be a single character. However, when we inspect its length in JavaScript, we find a surprising result:
```
console.log('❤️'.length); // Outputs: 2
```
This discrepancy arises because the emoji consists of two components in Unicode's UTF-16 encoding: a base character, which is the universal symbol for the heart, and a variation selector, which specifies the emoji's red variant. This detail illustrates the complexity behind what seems like a straightforward emoji.
Skin Tone Modifiers: Thumbs Up Emoji (👍🏽)

The thumbs-up emoji with a skin tone modifier presents an interesting case, too:
```
console.log('👍🏽'.length); // Outputs: 2
```
Again, it combines the base thumbs-up emoji with a skin tone modifier, each recognized as an individual code unit. The addition of skin tone, therefore, extends the character count, illustrating the impact of modifiers on emoji length.
Zero Width Joiner (ZWJ) Sequences: The Family Emoji (👨‍👩‍👦)

The family emoji showcases the complexity of combining multiple emojis:
```
console.log('👨‍👩‍👦'.length); // Outputs: 8
```
Family emoji - a sequence that combines several emojis (👨 man, 👩 woman, 👦 boy) using invisible Zero Width Joiners (ZWJ). Each individual emoji is encoded as two characters, incorporating a primary emoji character and often a default variation selector that specifies a variant or skin tone. ZWJs merge these separate emojis into a single glyph and are considered as one character within the sequence. Thus, we have: man emoji (2 characters) + ZWJ (1 character) + woman emoji (2 characters) + ZWJ (1 character) + boy emoji (2 characters).
Complex Emojis with Multiple Components: The Woman Astronaut Emoji (👩‍🚀)

Consider the woman astronaut emoji for its composition complexity:
```
console.log('👩‍🚀'.length); // Outputs: 5
```
This emoji is crafted by combining the 👩 woman emoji and the 🚀 rocket emoji with an invisible zero-width Joiner (ZWJ). Both the 👩 and the 🚀 are encoded as two characters each, as we already know. The ZWJ seamlessly merges these icons into one glyph and adds another character to the count. Therefore, the sequence is comprised of: 👩 (2 characters) + ZWJ (1 character) + 🚀 (2 characters), culminating in a total of 5 characters for the composite emoji.
Flag Emojis: The United States Flag (🇺🇸)

Consider the encoding intricacies of the US Flag emoji (🇺🇸):
```
console.log('🇺🇸'.length); // Outputs: 4
```
Flag emojis are unique in that they're composed using regional indicator symbols. These symbols, such as 🇺 (U) and 🇸 (S) for the USA flag, represent the country's ISO 3166-1 alpha-2 code. Each letter is encoded as a surrogate pair in UTF-16, which means that despite each symbol representing a single letter, it is stored using two characters to accommodate the extensive range of Unicode characters. Thus, the 🇺🇸 emoji sequence comprises: 🇺 (2 characters) + 🇸 (2 characters), leading to a total of 4 characters for the flag emoji.

Counting Emojis as One Character: From Intuition to Precision

In an ideal world, the length of an emoji, no matter how complex, would be counted as one character to align with our visual perception. Initially, developers might attempt straightforward methods, quickly discovering the limitations and complexities of accurately measuring emoji lengths. Let's explore some methods together to find a solution that helps us better understand the problem intuitively.

Using .length property

Initially, one might think the .length property of a string could offer a straightforward count of emojis. As we've seen with our examples, though, this method falls short. Complex emojis don't conform to this simplicity, revealing the method's limitations for accurate emoji length determination.
Using spread operator

Attempting to count emojis using the spread operator [...string] offers an insightful perspective:
```
console.log([... '👩‍🚀']); // Output: ['👩', '‍', '🚀']
console.log([... '👩‍🚀'].length); // Outputs: 3
```
Interestingly, the result is 3, which at first glance might seem unexpected but is actually closer to our visual interpretation than the initial 5 obtained using the .length property. The operation counts the woman emoji (👩), the Zero Width Joiner (ZWJ), and the rocket emoji (🚀) as individual characters.
Using RegExp

Regular Expressions (RegExp) offer a focused way to identify emojis using Unicode properties:
```
const emojiPattern = /[\p{Emoji_Presentation}]/gu;
const matches = '👩‍🚀'.match(emojiPattern);
console.log(matches); // Output: ['👩', '🚀']
console.log(matches.length); // Outputs: 2
```
Applying this RegExp to emojis like 👩‍🚀 splits them into their basic emojis ['👩', '🚀'], giving a count of 2. We utilise \p{Emoji_Presentation} for its precision in targeting characters explicitly displayed as emojis, and not including regular digits like "1", which \p{Emoji} might match. As we can see, this method effectively ignores the Zero Width Joiner (ZWJ), simplifying the process of identifying the exact emoji length. However, it's still not ideal for accurately counting complex emojis as single characters.
Using Intl.Segmenter

The Intl.Segmenter API provides a sophisticated mechanism for accurately counting emojis by treating them as whole units, regardless of their complexity:
```
const segmenter = new Intl.Segmenter('en', { granularity: 'grapheme' });
const emojiString = '👩‍🚀';
const segments = Array.from(segmenter.segment(emojiString));
console.log(segments.map(segment => segment.segment)); // Output: ['👩‍🚀']
console.log(segments.length); // Outputs: 1
```
This approach leverages the concept of grapheme clusters, which are sequences of one or more code points that are displayed as a single, unified character to the user. By using Intl.Segmenter with the granularity option set to 'grapheme', it correctly identifies and counts the woman astronaut emoji (👩‍🚀) as one unit, aligning perfectly with our visual interpretation.

Conclusion

The task of accurately counting emoji lengths in JavaScript reveals the nuances of digital communication with Unicode. Through the examination of various methods, from the simple .length property to the comprehensive Intl.Segmenter, we highlight the importance of understanding Unicode encoding. This journey into the encoding and counting of emojis not only reveals challenges specific to JavaScript but also illuminates general aspects of working with text in digital environments.

I hope this exploration has clarified the complexities behind something as seemingly simple as emojis and provided you with practical methods to apply in your projects. May the insights shared here enhance your development work and inspire you to delve deeper into the fascinating interplay between technology and language!

Beyond the Smile: Decoding Emoji Lengths in JavaScript

The Curious Case of Emoji Lengths

Surprising Examples and Their Actual Lengths

Counting Emojis as One Character: From Intuition to Precision

Conclusion