$substrBytes (aggregation)
Definition
New in version 3.4.
Returns the substring of a string. The substring starts with thecharacter at the specified UTF-8 byte index (zero-based) in thestring and continues for the number of bytes specified.
$substrBytes
has the following operatorexpression syntax:
- { $substrBytes: [ <string expression>, <byte index>, <byte count> ] }
FieldTypeDescriptionstring expression
stringThe string from which the substring will be extracted. string expression
can be any valid expression aslong as it resolves to a string. For more information onexpressions, see Expressions.
If the argument resolves to a value of null
or refers to a fieldthat is missing, $substrBytes
returns an empty string.
If the argument does not resolve to a string or null
norrefers to a missing field, $substrBytes
returns an error.byte index
numberIndicates the starting point of the substring. byte index
can beany valid expression as long asit resolves to a non-negative integer or number that can berepresented as an integer (such as 2.0).
byte index
cannot referto a starting index located in the middle of a multi-byte UTF-8character.byte count
numberCan be any valid expressionas long as it resolves to a non-negative integer or number that can berepresented as an integer (such as 2.0).
byte count
can notresult in an ending index that is in the middle of a UTF-8 character.
Behavior
The $substrBytes
operator uses the indexes of UTF-8encoded bytes where each code point, or character, may use between oneand four bytes to encode.
For example, US-ASCII characters are encoded using one byte. Characterswith diacritic markings and additional Latin alphabetical characters(i.e. Latin characters outside of the English alphabet) are encodedusing two bytes. Chinese, Japanese and Korean characters typicallyrequire three bytes, and other planes of unicode (emoji, mathematicalsymbols, etc.) require four bytes.
It is important to be mindful of the content in thestring expression
because providing a byte index
orbyte count
located in the middle of a UTF-8 character will resultin an error.
$substrBytes
differs from $substrCP
in that$substrBytes
counts the bytes of each character, whereas$substrCP
counts the code points, or characters,regardless of how many bytes a character uses.
Example | Results |
---|---|
|
|
|
|
|
|
|
|
| Errors with message:"Error: Invalid range, starting index is a UTF-8 continuation byte." |
| Errors with message:"Error: Invalid range, ending index is in the middle of a UTF-8 character." |
Example
Single-Byte Character Set
Consider an inventory
collection with the following documents:
- { "_id" : 1, "item" : "ABC1", quarter: "13Q1", "description" : "product 1" }
- { "_id" : 2, "item" : "ABC2", quarter: "13Q4", "description" : "product 2" }
- { "_id" : 3, "item" : "XYZ1", quarter: "14Q2", "description" : null }
The following operation uses the $substrBytes
operatorseparate the quarter
value (containing only single byte US-ASCIIcharacters) into a yearSubstring
and a quarterSubstring
. ThequarterSubstring
field represents the rest of the string from thespecified byte index
following the yearSubstring
. It iscalculated by subtracting the byte index
from the length of thestring using $strLenBytes
.
- db.inventory.aggregate(
- [
- {
- $project: {
- item: 1,
- yearSubstring: { $substrBytes: [ "$quarter", 0, 2 ] },
- quarterSubtring: {
- $substrBytes: [
- "$quarter", 2, { $subtract: [ { $strLenBytes: "$quarter" }, 2 ] }
- ]
- }
- }
- }
- ]
- )
The operation returns the following results:
- { "_id" : 1, "item" : "ABC1", "yearSubstring" : "13", "quarterSubtring" : "Q1" }
- { "_id" : 2, "item" : "ABC2", "yearSubstring" : "13", "quarterSubtring" : "Q4" }
- { "_id" : 3, "item" : "XYZ1", "yearSubstring" : "14", "quarterSubtring" : "Q2" }
Single-Byte and Multibyte Character Set
A collection named food
contains the following documents:
- { "_id" : 1, "name" : "apple" }
- { "_id" : 2, "name" : "banana" }
- { "_id" : 3, "name" : "éclair" }
- { "_id" : 4, "name" : "hamburger" }
- { "_id" : 5, "name" : "jalapeño" }
- { "_id" : 6, "name" : "pizza" }
- { "_id" : 7, "name" : "tacos" }
- { "_id" : 8, "name" : "寿司sushi" }
The following operation uses the $substrBytes
operator to create a threebyte menuCode
from the name
value:
- db.food.aggregate(
- [
- {
- $project: {
- "name": 1,
- "menuCode": { $substrBytes: [ "$name", 0, 3 ] }
- }
- }
- ]
- )
The operation returns the following results:
- { "_id" : 1, "name" : "apple", "menuCode" : "app" }
- { "_id" : 2, "name" : "banana", "menuCode" : "ban" }
- { "_id" : 3, "name" : "éclair", "menuCode" : "éc" }
- { "_id" : 4, "name" : "hamburger", "menuCode" : "ham" }
- { "_id" : 5, "name" : "jalapeño", "menuCode" : "jal" }
- { "_id" : 6, "name" : "pizza", "menuCode" : "piz" }
- { "_id" : 7, "name" : "tacos", "menuCode" : "tac" }
- { "_id" : 8, "name" : "寿司sushi", "menuCode" : "寿" }
See also