Unicode data
Django supports Unicode data everywhere.
This document tells you what you need to know if you’re writing applicationsthat use data or templates that are encoded in something other than ASCII.
Creating the database
Make sure your database is configured to be able to store arbitrary stringdata. Normally, this means giving it an encoding of UTF-8 or UTF-16. If you usea more restrictive encoding – for example, latin1 (iso8859-1) – you won’t beable to store certain characters in the database, and information will be lost.
- MySQL users, refer to the MySQL manual for details on how to set or alterthe database character set encoding.
- PostgreSQL users, refer to the PostgreSQL manual (section 22.3.2 inPostgreSQL 9) for details on creating databases with the correct encoding.
- Oracle users, refer to the Oracle manual for details on how to set(section 2) or alter (section 11) the database character set encoding.
- SQLite users, there is nothing you need to do. SQLite always uses UTF-8for internal encoding.All of Django’s database backends automatically convert strings intothe appropriate encoding for talking to the database. They also automaticallyconvert strings retrieved from the database into strings. You don’t even needto tell Django what encoding your database uses: that is handled transparently.
For more, see the section “The database API” below.
General string handling
Whenever you use strings with Django – e.g., in database lookups, templaterendering or anywhere else – you have two choices for encoding those strings.You can use normal strings or bytestrings (starting with a ‘b’).
Warning
A bytestring does not carry any information with it about its encoding.For that reason, we have to make an assumption, and Django assumes that allbytestrings are in UTF-8.
If you pass a string to Django that has been encoded in some other format,things will go wrong in interesting ways. Usually, Django will raise aUnicodeDecodeError
at some point.
If your code only uses ASCII data, it’s safe to use your normal strings,passing them around at will, because ASCII is a subset of UTF-8.
Don’t be fooled into thinking that if your DEFAULT_CHARSET
setting is setto something other than 'utf-8'
you can use that other encoding in yourbytestrings! DEFAULT_CHARSET
only applies to the strings generated asthe result of template rendering (and email). Django will always assume UTF-8encoding for internal bytestrings. The reason for this is that theDEFAULT_CHARSET
setting is not actually under your control (if you are theapplication developer). It’s under the control of the person installing andusing your application – and if that person chooses a different setting, yourcode must still continue to work. Ergo, it cannot rely on that setting.
In most cases when Django is dealing with strings, it will convert them tostrings before doing anything else. So, as a general rule, if you passin a bytestring, be prepared to receive a string back in the result.
Translated strings
Aside from strings and bytestrings, there’s a third type of string-likeobject you may encounter when using Django. The framework’sinternationalization features introduce the concept of a “lazy translation” –a string that has been marked as translated but whose actual translation resultisn’t determined until the object is used in a string. This feature is usefulin cases where the translation locale is unknown until the string is used, eventhough the string might have originally been created when the code was firstimported.
Normally, you won’t have to worry about lazy translations. Just be aware thatif you examine an object and it claims to be adjango.utils.functional.proxy
object, it is a lazy translation.Calling str()
with the lazy translation as the argument will generate astring in the current locale.
For more details about lazy translation objects, refer to theinternationalization documentation.
Useful utility functions
Because some string operations come up again and again, Django ships with a fewuseful functions that should make working with string and bytestring objectsa bit easier.
Conversion functions
The django.utils.encoding
module contains a few functions that are handyfor converting back and forth between strings and bytestrings.
smart_str(s, encoding='utf-8', strings_only=False, errors='strict')
converts its input to a string. Theencoding
parameterspecifies the input encoding. (For example, Django uses this internallywhen processing form input data, which might not be UTF-8 encoded.) Thestrings_only
parameter, if set to True, will result in Pythonnumbers, booleans andNone
not being converted to a string (they keeptheir original types). Theerrors
parameter takes any of the valuesthat are accepted by Python’sstr()
function for its errorhandling.forcestr(s, encoding='utf-8', strings_only=False, errors='strict')
isidentical tosmart_str()
in almost all cases. The difference is when thefirst argument is a lazy translation instance.Whilesmart_str()
preserves lazy translations,force_str()
forcesthose objects to a string (causing the translation to occur). Normally,you’ll want to usesmart_str()
. However,force_str()
is useful intemplate tags and filters that absolutely _must have a string to work with,not just something that can be converted to a string.smart_bytes(s, encoding='utf-8', strings_only=False, errors='strict')
is essentially the opposite ofsmart_str()
. It forces the firstargument to a bytestring. Thestrings_only
parameter has the samebehavior as forsmart_str()
andforce_str()
. This isslightly different semantics from Python’s builtinstr()
function,but the difference is needed in a few places within Django’s internals.Normally, you’ll only need to useforce_str()
. Call it as early aspossible on any input data that might be either a string or a bytestring, andfrom then on, you can treat the result as always being a string.
URI and IRI handling
Web frameworks have to deal with URLs (which are a type of IRI). Onerequirement of URLs is that they are encoded using only ASCII characters.However, in an international environment, you might need to construct aURL from an IRI – very loosely speaking, a URI that can contain Unicodecharacters. Use these functions for quoting and converting an IRI to a URI:
- The
django.utils.encoding.iri_to_uri()
function, which implements theconversion from IRI to URI as required by RFC 3987#section-3.1. - The
urllib.parse.quote()
andurllib.parse.quote_plus()
functions from Python’s standard library.These two groups of functions have slightly different purposes, and it’simportant to keep them straight. Normally, you would usequote()
on theindividual portions of the IRI or URI path so that any reserved characterssuch as ‘&’ or ‘%’ are correctly encoded. Then, you applyiri_to_uri()
tothe full IRI and it converts any non-ASCII characters to the correct encodedvalues.
Note
Technically, it isn’t correct to say that iri_to_uri()
implements thefull algorithm in the IRI specification. It doesn’t (yet) perform theinternational domain name encoding portion of the algorithm.
The iri_to_uri()
function will not change ASCII characters that areotherwise permitted in a URL. So, for example, the character ‘%’ is notfurther encoded when passed to iri_to_uri()
. This means you can pass afull URL to this function and it will not mess up the query string or anythinglike that.
An example might clarify things here:
- >>> from urllib.parse import quote
- >>> from django.utils.encoding import iri_to_uri
- >>> quote('Paris & Orléans')
- 'Paris%20%26%20Orl%C3%A9ans'
- >>> iri_to_uri('/favorites/François/%s' % quote('Paris & Orléans'))
- '/favorites/Fran%C3%A7ois/Paris%20%26%20Orl%C3%A9ans'
If you look carefully, you can see that the portion that was generated byquote()
in the second example was not double-quoted when passed toiri_to_uri()
. This is a very important and useful feature. It means thatyou can construct your IRI without worrying about whether it containsnon-ASCII characters and then, right at the end, call iri_to_uri()
on theresult.
Similarly, Django provides django.utils.encoding.uri_to_iri()
whichimplements the conversion from URI to IRI as per RFC 3987#section-3.2.
An example to demonstrate:
- >>> from django.utils.encoding import uri_to_iri
- >>> uri_to_iri('/%E2%99%A5%E2%99%A5/?utf8=%E2%9C%93')
- '/♥♥/?utf8=✓'
- >>> uri_to_iri('%A9hello%3Fworld')
- '%A9hello%3Fworld'
In the first example, the UTF-8 characters are unquoted. In the second, thepercent-encodings remain unchanged because they lie outside the valid UTF-8range or represent a reserved character.
Both iri_to_uri()
and uri_to_iri()
functions are idempotent, which means thefollowing is always true:
- iri_to_uri(iri_to_uri(some_string)) == iri_to_uri(some_string)
- uri_to_iri(uri_to_iri(some_string)) == uri_to_iri(some_string)
So you can safely call it multiple times on the same URI/IRI without riskingdouble-quoting problems.
Models
Because all strings are returned from the database as str
objects, modelfields that are character based (CharField, TextField, URLField, etc.) willcontain Unicode values when Django retrieves data from the database. Thisis always the case, even if the data could fit into an ASCII bytestring.
You can pass in bytestrings when creating a model or populating a field, andDjango will convert it to strings when it needs to.
Taking care in get_absolute_url()
URLs can only contain ASCII characters. If you’re constructing a URL frompieces of data that might be non-ASCII, be careful to encode the results in away that is suitable for a URL. The reverse()
functionhandles this for you automatically.
If you’re constructing a URL manually (i.e., not using the reverse()
function), you’ll need to take care of the encoding yourself. In this case,use the iri_to_uri()
and quote()
functions that were documentedabove. For example:
- from urllib.parse import quote
- from django.utils.encoding import iri_to_uri
- def get_absolute_url(self):
- url = '/person/%s/?x=0&y=0' % quote(self.location)
- return iri_to_uri(url)
This function returns a correctly encoded URL even if self.location
issomething like “Jack visited Paris & Orléans”. (In fact, the iri_to_uri()
call isn’t strictly necessary in the above example, because all thenon-ASCII characters would have been removed in quoting in the first line.)
Templates
Use strings when creating templates manually:
- from django.template import Template
- t2 = Template('This is a string template.')
But the common case is to read templates from the filesystem. If your templatefiles are not stored with a UTF-8 encoding, adjust the TEMPLATES
setting. The built-in django
backendprovides the 'file_charset'
option to change the encoding used to readfiles from disk.
The DEFAULT_CHARSET
setting controls the encoding of rendered templates.This is set to UTF-8 by default.
Template tags and filters
A couple of tips to remember when writing your own template tags and filters:
- Always return strings from a template tag’s
render()
methodand from template filters. - Use
force_str()
in preference tosmart_str()
in theseplaces. Tag rendering and filter calls occur as the template is beingrendered, so there is no advantage to postponing the conversion of lazytranslation objects into strings. It’s easier to work solely withstrings at that point.
Files
If you intend to allow users to upload files, you must ensure that theenvironment used to run Django is configured to work with non-ASCII file names.If your environment isn’t configured correctly, you’ll encounterUnicodeEncodeError
exceptions when saving files with file names thatcontain non-ASCII characters.
Filesystem support for UTF-8 file names varies and might depend on theenvironment. Check your current configuration in an interactive Python shell byrunning:
- import sys
- sys.getfilesystemencoding()
This should output “UTF-8”.
The LANG
environment variable is responsible for setting the expectedencoding on Unix platforms. Consult the documentation for your operating systemand application server for the appropriate syntax and location to set thisvariable.
In your development environment, you might need to add a setting to your~.bashrc
analogous to::
- export LANG="en_US.UTF-8"
Form submission
HTML form submission is a tricky area. There’s no guarantee that thesubmission will include encoding information, which means the framework mighthave to guess at the encoding of submitted data.
Django adopts a “lazy” approach to decoding form data. The data in anHttpRequest
object is only decoded when you access it. In fact, most ofthe data is not decoded at all. Only the HttpRequest.GET
andHttpRequest.POST
data structures have any decoding applied to them. Thosetwo fields will return their members as Unicode data. All other attributes andmethods of HttpRequest
return data exactly as it was submitted by theclient.
By default, the DEFAULT_CHARSET
setting is used as the assumed encodingfor form data. If you need to change this for a particular form, you can setthe encoding
attribute on an HttpRequest
instance. For example:
- def some_view(request):
- # We know that the data must be encoded as KOI8-R (for some reason).
- request.encoding = 'koi8-r'
- ...
You can even change the encoding after having accessed request.GET
orrequest.POST
, and all subsequent accesses will use the new encoding.
Most developers won’t need to worry about changing form encoding, but this isa useful feature for applications that talk to legacy systems whose encodingyou cannot control.
Django does not decode the data of file uploads, because that data is normallytreated as collections of bytes, rather than strings. Any automatic decodingthere would alter the meaning of the stream of bytes.