Unicode

Since early Python 2 days unicode was part of all default Python builds. Itallows developers to write applications that deal with non-ASCII charactersin a straightforward way. But working with unicode requires a basic knowledgeabout that matter, especially when working with libraries that do not supportit.

Werkzeug uses unicode internally everywhere text data is assumed, even if theHTTP standard is not unicode aware as it. Basically all incoming data isdecoded from the charset specified (per default utf-8) so that you don’toperate on bytestrings any more. Outgoing unicode data is then encoded intothe target charset again.

Unicode in Python

In Python 2 there are two basic string types: str and unicode. str maycarry encoded unicode data but it’s always represented in bytes whereas theunicode type does not contain bytes but charpoints. What does this mean?Imagine you have the German Umlaut ö. In ASCII you cannot represent thatcharacter, but in the latin-1 and utf-8 character sets you can representit, but they look differently when encoded:

  1. >>> u'ö'.encode('latin1')
  2. '\xf6'
  3. >>> u'ö'.encode('utf-8')
  4. '\xc3\xb6'

So an ö might look totally different depending on the encoding which makesit hard to work with it. The solution is using the unicode type (as we didabove, note the u prefix before the string). The unicode type does notstore the bytes for ö but the information, that this is aLATIN SMALL LETTER O WITH DIAERESIS.

Doing len(u'ö') will always give us the expected “1” but len('ö')might give different results depending on the encoding of 'ö'.

Unicode in HTTP

The problem with unicode is that HTTP does not know what unicode is. HTTPis limited to bytes but this is not a big problem as Werkzeug decodes andencodes for us automatically all incoming and outgoing data. Basically whatthis means is that data sent from the browser to the web application is perdefault decoded from an utf-8 bytestring into a unicode string. Data sentfrom the application back to the browser that is not yet a bytestring is thenencoded back to utf-8.

Usually this “just works” and we don’t have to worry about it, but there aresituations where this behavior is problematic. For example the Python 2 IOlayer is not unicode aware. This means that whenever you work with data fromthe file system you have to properly decode it. The correct way to loada text file from the file system looks like this:

  1. f = file('/path/to/the_file.txt', 'r')
  2. try:
  3. text = f.decode('utf-8') # assuming the file is utf-8 encoded
  4. finally:
  5. f.close()

There is also the codecs module which provides an open function that decodesautomatically from the given encoding.

Error Handling

With Werkzeug 0.3 onwards you can further control the way Werkzeug works withunicode. In the past Werkzeug ignored encoding errors silently on incomingdata. This decision was made to avoid internal server errors if the usertampered with the submitted data. However there are situations where youwant to abort with a 400 BAD REQUEST instead of silently ignoring the error.

All the functions that do internal decoding now accept an errors keywordargument that behaves like the errors parameter of the builtin string methoddecode. The following values are possible:

  • ignore
  • This is the default behavior and tells the codec to ignore characters thatit doesn’t understand silently.
  • replace
  • The codec will replace unknown characters with a replacement character(U+FFFD REPLACEMENT CHARACTER)
  • strict
  • Raise an exception if decoding fails.

Unlike the regular python decoding Werkzeug does not raise anUnicodeDecodeError if the decoding failed but anHTTPUnicodeError whichis a direct subclass of UnicodeError and the BadRequest HTTP exception.The reason is that if this exception is not caught by the application buta catch-all for HTTP exceptions exists a default 400 BAD REQUEST errorpage is displayed.

There is additional error handling available which is a Werkzeug extensionto the regular codec error handling which is called fallback. Often youwant to use utf-8 but support latin1 as legacy encoding too if decodingfailed. For this case you can use the fallback error handling. Forexample you can specify 'fallback:iso-8859-15' to tell Werkzeug it shouldtry with iso-8859-15 if utf-8 failed. If this decoding fails too (whichshould not happen for most legacy charsets such as iso-8859-15) the erroris silently ignored as if the error handling was ignore.

Further details are available as part of the API documentation of the concreteimplementations of the functions or classes working with unicode.

Request and Response Objects

As request and response objects usually are the central entities of Werkzeugpowered applications you can change the default encoding Werkzeug operates onby subclassing these two classes. For example you can easily set theapplication to utf-7 and strict error handling:

  1. from werkzeug.wrappers import BaseRequest, BaseResponse
  2.  
  3. class Request(BaseRequest):
  4. charset = 'utf-7'
  5. encoding_errors = 'strict'
  6.  
  7. class Response(BaseResponse):
  8. charset = 'utf-7'

Keep in mind that the error handling is only customizable for all decodingbut not encoding. If Werkzeug encounters an encoding error it will raise aUnicodeEncodeError. It’s your responsibility to not create data that isnot present in the target charset (a non issue with all unicode encodingssuch as utf-8).

The Filesystem

Changed in version 0.11.

Up until version 0.11, Werkzeug used Python’s stdlib functionality to detectthe filesystem encoding. However, several bug reports against Werkzeug haveshown that the value of sys.getfilesystemencoding() cannot betrusted under traditional UNIX systems. The usual problems come frommisconfigured systems, where LANG and similar environment variables are notset. In such cases, Python would default to ASCII as filesystem encoding, avery conservative default that is usually wrong and causes more problems thanit avoids.

Therefore Werkzeug will force the filesystem encoding to UTF-8 and issue awarning whenever it detects that it is running under BSD or Linux, andsys.getfilesystemencoding() is returning an ASCII encoding.

See also werkzeug.filesystem.