Unicode

Werkzeug uses strings internally everwhere text data is assumed, even if the HTTP standard is not Unicode aware. Basically all incoming data is decoded from the charset (UTF-8 by default) so that you don’t work with bytes directly. Outgoing data is encoded into the target charset.

Unicode in Python

Imagine you have the German Umlaut ö. In ASCII you cannot represent that character, but in the latin-1 and utf-8 character sets you can represent it, but they look different when encoded:

>>> "ö".encode("latin1")
b'\xf6'
>>> "ö".encode("utf-8")
b'\xc3\xb6'

An ö looks different depending on the encoding which makes it hard to work with it as bytes. Instead, Python treats strings as Unicode text and stores the information LATIN SMALL LETTER O WITH DIAERESIS instead of the bytes for ö in a specific encoding. The length of a string with 1 character will be 1, where the length of the bytes might be some other value.

Unicode in HTTP

However, the HTTP spec was written in a time where ASCII bytes were the common way data was represented. To work around this for the modern web, Werkzeug decodes and encodes incoming and outgoing data automatically. Data sent from the browser to the web application is decoded from UTF-8 bytes into a string. Data sent from the application back to the browser is encoded back to UTF-8.

Error Handling

Functions that do internal encoding or decoding accept an errors keyword argument that is passed to str.decode() and str.encode(). The default is 'replace' so that errors are easy to spot. It might be useful to set it to 'strict' in order to catch the error and report the bad data to the client.

Request and Response Objects

In most cases, you should stick with Werkzeug’s default encoding of UTF-8. If you have a specific reason to, you can subclass wrappers.Request and wrappers.Response to change the encoding and error handling.

from werkzeug.wrappers.request import Request
from werkzeug.wrappers.response import Response

class Latin1Request(Request):
    charset = "latin1"
    encoding_errors = "strict"

class Latin1Response(Response):
    charset = "latin1"

The error handling can only be changed for the request. Werkzeug will always raise errors when encoding to bytes in the response. It’s your responsibility to not create data that is not present in the target charset. This is not an issue for UTF-8.

© 2007–2021 Pallets
Licensed under the BSD 3-clause License.
https://werkzeug.palletsprojects.com/en/2.0.x/unicode/