Inbound: fix charset handling in .text, .html, .get_content_text()

Make `AnymailInboundMessage.text`, `.html` and `.get_content_text()` usually do the right thing for non-UTF-8 messages/attachments. Fixes an incorrect UnicodeDecodeError when receiving an (e.g.,) ISO-8859-1 encoded message, and improves handling for inbound messages that were not properly encoded by the sender. * Decode using the message's (or attachments's) declared charset by default (rather than always defaulting to 'utf-8'; you can still override with `get_content_text(charset=...)` * Add `errors` param to `get_content_text()`, defaulting to 'replace'. Mis-encoded messages will now use the Unicode replacement character rather than raising errors. (Use `get_content_text(errors='strict')` for the previous behavior.)
2026-02-05 03:55:20 -05:00 · 2018-04-01 14:18:35 -07:00
parent 97fc869992
commit 3928f6ea5e
3 changed files with 84 additions and 8 deletions
--- a/docs/inbound.rst
+++ b/docs/inbound.rst
@@ -363,11 +363,22 @@ have these methods:
        (Anymail back-ports Python 3.5's :meth:`~email.message.Message.get_content_disposition`
        method to all supported versions.)

-    .. method:: get_content_text(charset='utf-8')
+    .. method:: get_content_text(charset=None, errors='replace')

-        Returns the content of the attachment decoded to a `str` in the given charset.
+        Returns the content of the attachment decoded to Unicode text.
        (This is generally only appropriate for text or message-type attachments.)

+        If provided, charset will override the attachment's declared charset. (This can be useful
+        if you know the attachment's :mailheader:`Content-Type` has a missing or incorrect charset.)
+
+        The errors param is as in :meth:`~bytes.decode`. The default "replace" substitutes the
+        Unicode "replacement character" for any illegal characters in the text.
+
+        .. versionchanged:: 2.1
+
+            Changed to use attachment's declared charset by default,
+            and added errors option defaulting to replace.
+
    .. method:: get_content_bytes()

        Returns the raw content of the attachment as bytes. (This will automatically decode