$strLenBytes (aggregation)
Definition
New in version 3.4.
Returns the number of UTF-8 encoded bytes in the specified string.
$strLenBytes
has the following operatorexpression syntax:
- { $strLenBytes: <string expression> }
The argument can be any valid expression as long as it resolves to a string. Formore information on expressions, see Expressions.
If the argument resolves to a value of null
or refers to amissing field, $strLenBytes
returns an error.
Behavior
The $strLenBytes
operator counts the number of UTF-8encoded bytes in a string where each character may use between oneand four bytes.
For example, US-ASCII characters are encoded using one byte. Characterswith diacritic markings and additional Latin alphabetical characters(i.e. Latin characters outside of the English alphabet) are encodedusing two bytes. Chinese, Japanese and Korean characters typicallyrequire three bytes, and other planes of unicode (emoji, mathematicalsymbols, etc.) require four bytes.
The $strLenBytes
operator differs from$strLenCP
operator which counts thecode pointsin the specified string regardless of how many bytes each characteruses.
Example | Results | Notes |
---|---|---|
| 5 | Each character is encoded using one byte. |
| 12 | Each character is encoded using one byte. |
| 9 | Each character is encoded using one byte. |
| 11 | é is encoded using two bytes. |
| 0 | Empty strings return 0. |
| 7 | € is encoded using three bytes.λ is encoded using two bytes. |
| 6 | Each character is encoded using three bytes. |
Example
Single-Byte and Multibyte Character Set
A collection named food
contains the following documents:
- { "_id" : 1, "name" : "apple" }
- { "_id" : 2, "name" : "banana" }
- { "_id" : 3, "name" : "éclair" }
- { "_id" : 4, "name" : "hamburger" }
- { "_id" : 5, "name" : "jalapeño" }
- { "_id" : 6, "name" : "pizza" }
- { "_id" : 7, "name" : "tacos" }
- { "_id" : 8, "name" : "寿司" }
The following operation uses the $strLenBytes
operator to calculatethe length
of each name
value:
- db.food.aggregate(
- [
- {
- $project: {
- "name": 1,
- "length": { $strLenBytes: "$name" }
- }
- }
- ]
- )
The operation returns the following results:
- { "_id" : 1, "name" : "apple", "length" : 5 }
- { "_id" : 2, "name" : "banana", "length" : 6 }
- { "_id" : 3, "name" : "éclair", "length" : 7 }
- { "_id" : 4, "name" : "hamburger", "length" : 9 }
- { "_id" : 5, "name" : "jalapeño", "length" : 9 }
- { "_id" : 6, "name" : "pizza", "length" : 5 }
- { "_id" : 7, "name" : "tacos", "length" : 5 }
- { "_id" : 8, "name" : "寿司", "length" : 6 }
The documents with _id: 3
and _id: 5
each contain a diacriticcharacter (é
and ñ
respectively) that requires two bytes toencode. The document with _id: 8
contains two Japanese charactersthat are encoded using three bytes each. This makes the length
greater than the number of characters in name
for the documentswith _id: 3
, _id: 5
and _id: 8
.
See also