Inbound: fix charset handling in .text, .html, .get_content_text()

Make `AnymailInboundMessage.text`, `.html` and `.get_content_text()`
usually do the right thing for non-UTF-8 messages/attachments. Fixes
an incorrect UnicodeDecodeError when receiving an (e.g.,) ISO-8859-1
encoded message, and improves handling for inbound messages that were
not properly encoded by the sender.

* Decode using the message's (or attachments's) declared charset
  by default (rather than always defaulting to 'utf-8'; you can
  still override with `get_content_text(charset=...)`
* Add `errors` param to `get_content_text()`, defaulting to 'replace'.
  Mis-encoded messages will now use the Unicode replacement character
  rather than raising errors. (Use `get_content_text(errors='strict')`
  for the previous behavior.)
This commit is contained in:
medmunds
2018-04-01 14:18:35 -07:00
parent 97fc869992
commit 3928f6ea5e
3 changed files with 84 additions and 8 deletions

View File

@@ -363,11 +363,22 @@ have these methods:
(Anymail back-ports Python 3.5's :meth:`~email.message.Message.get_content_disposition`
method to all supported versions.)
.. method:: get_content_text(charset='utf-8')
.. method:: get_content_text(charset=None, errors='replace')
Returns the content of the attachment decoded to a `str` in the given charset.
Returns the content of the attachment decoded to Unicode text.
(This is generally only appropriate for text or message-type attachments.)
If provided, charset will override the attachment's declared charset. (This can be useful
if you know the attachment's :mailheader:`Content-Type` has a missing or incorrect charset.)
The errors param is as in :meth:`~bytes.decode`. The default "replace" substitutes the
Unicode "replacement character" for any illegal characters in the text.
.. versionchanged:: 2.1
Changed to use attachment's declared charset by default,
and added errors option defaulting to replace.
.. method:: get_content_bytes()
Returns the raw content of the attachment as bytes. (This will automatically decode