Skip to content

gh-149079: Fix O(n^2) canonical ordering in unicodedata.normalize()#149080

Merged
encukou merged 2 commits into
python:mainfrom
sethmlarson:quadratic-unicodedata-normalize
Jun 2, 2026
Merged

gh-149079: Fix O(n^2) canonical ordering in unicodedata.normalize()#149080
encukou merged 2 commits into
python:mainfrom
sethmlarson:quadratic-unicodedata-normalize

Conversation

@sethmlarson
Copy link
Copy Markdown
Contributor

@sethmlarson sethmlarson commented Apr 27, 2026

Replace the insertion sort used for canonical ordering of combining characters with a hybrid approach: insertion sort for short runs (< 20) and counting sort for longer runs, reducing worst-case complexity from O(n^2) to O(n). This prevents denial of service via crafted Unicode strings with many combining characters with a large number of inversions in combing class order.

…ze()

Replace the insertion sort used for canonical ordering of combining
characters with a hybrid approach: insertion sort for short runs (< 20)
and counting sort for longer runs, reducing worst-case complexity from
O(n^2) to O(n). This prevents denial of service via crafted Unicode
strings with many combining characters in alternating CCC order.

Co-authored-by: Seokchan Yoon <13852925+ch4n3-yoon@users.noreply.github.com>
@StanFromIreland
Copy link
Copy Markdown
Member

Reviewers: Note that there are pending changes from previous reviews.

Comment thread Lib/test/test_unicodedata.py Outdated
Comment thread Lib/test/test_unicodedata.py Outdated
Comment thread Modules/unicodedata.c Outdated
@serhiy-storchaka serhiy-storchaka self-requested a review April 27, 2026 22:16
Copy link
Copy Markdown
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a potential for optimization, but in general LGTM. 👍

Comment thread Modules/unicodedata.c
Py_ssize_t i, o, osize;
int kind;
const void *data;
int input_kind, result_kind;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not reuse the same variable?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, I asked to have two different variables for readability purposes. We could reuse it but when reading the code, it was cleaner when I saw the separation. But it can be reverted if you insist.

Comment thread Modules/unicodedata.c Outdated
@serhiy-storchaka
Copy link
Copy Markdown
Member

Ideas for optimization:

  • We already have the Py_UCS4 output buffer. It is better to sort it, without using more costly PyUnicode_READ and PyUnicode_WRITE.
  • It is perhaps possible to combine sorting routines with the code that determines the length. This will reduce the number of costly _getrecord_ex() calls but requires heavy rewriting.
  • Since Unicode characters only need 21 bits of 32, they can be combined with 8-bit combining in the temporary buffer, reducing the number of costly _getrecord_ex() calls. But this will make the code more difficult to read.

Comment thread Modules/unicodedata.c
Comment thread Modules/unicodedata.c Outdated
Comment thread Misc/NEWS.d/next/Security/2026-04-27-16-36-11.gh-issue-149079.vKl-LM.rst Outdated
Copy link
Copy Markdown
Member

@tim-one tim-one left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure there's ever an end to suggestions, so I'd prefer to ship this already. Good work, good enough,, and thank you for your care and patience!

@StanFromIreland
Copy link
Copy Markdown
Member

@sethmlarson a little ping, there are some suggestions above that need applying.

@serhiy-storchaka
Copy link
Copy Markdown
Member

serhiy-storchaka commented May 13, 2026

Most of my suggestions can wait to the following PRs. Although some of them can be easy to implement, so you can include them in this PR if you wish.

Copy link
Copy Markdown
Member

@encukou encukou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converting @maurycy's comments to GitHub suggestions:

Comment thread Lib/test/test_unicodedata.py Outdated
Comment thread Lib/test/test_unicodedata.py Outdated
Comment thread Modules/unicodedata.c Outdated
@StanFromIreland
Copy link
Copy Markdown
Member

Following Petr's example, I wrote sethmlarson#1 to apply all open suggestions (except #149080 (comment), which I think is fine as-is, and Serhiy's future performance plans).

Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
Co-authored-by: Petr Viktorin <encukou@gmail.com>
Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
Co-authored-by: Maurycy Pawłowski-Wieroński <maurycy@maurycy.com>
@sethmlarson
Copy link
Copy Markdown
Contributor Author

Sorry for the delay folks, I've merged the changes from @StanFromIreland 🙏

@sethmlarson sethmlarson requested a review from encukou May 26, 2026 21:29
@read-the-docs-community
Copy link
Copy Markdown

@encukou encukou merged commit 991224b into python:main Jun 2, 2026
56 checks passed
@StanFromIreland StanFromIreland added needs backport to 3.10 only security fixes needs backport to 3.11 only security fixes needs backport to 3.12 only security fixes needs backport to 3.13 bugs and security fixes needs backport to 3.14 bugs and security fixes needs backport to 3.15 pre-release feature fixes, bugs and security fixes labels Jun 2, 2026
@miss-islington-app
Copy link
Copy Markdown

Thanks @sethmlarson for the PR, and @encukou for merging it 🌮🎉.. I'm working now to backport this PR to: 3.10.
🐍🍒⛏🤖

@miss-islington-app
Copy link
Copy Markdown

Thanks @sethmlarson for the PR, and @encukou for merging it 🌮🎉.. I'm working now to backport this PR to: 3.11.
🐍🍒⛏🤖

@miss-islington-app
Copy link
Copy Markdown

Thanks @sethmlarson for the PR, and @encukou for merging it 🌮🎉.. I'm working now to backport this PR to: 3.12.
🐍🍒⛏🤖

@miss-islington-app
Copy link
Copy Markdown

Thanks @sethmlarson for the PR, and @encukou for merging it 🌮🎉.. I'm working now to backport this PR to: 3.13.
🐍🍒⛏🤖

@miss-islington-app
Copy link
Copy Markdown

Thanks @sethmlarson for the PR, and @encukou for merging it 🌮🎉.. I'm working now to backport this PR to: 3.14.
🐍🍒⛏🤖

@miss-islington-app
Copy link
Copy Markdown

Thanks @sethmlarson for the PR, and @encukou for merging it 🌮🎉.. I'm working now to backport this PR to: 3.15.
🐍🍒⛏🤖

@miss-islington-app
Copy link
Copy Markdown

Sorry, @sethmlarson and @encukou, I could not cleanly backport this to 3.10 due to a conflict.
Please backport using cherry_picker on command line.

cherry_picker 991224b1e8311c85f198f6dd8208bf8cff7fc26f 3.10

@miss-islington-app
Copy link
Copy Markdown

miss-islington-app Bot commented Jun 2, 2026

Sorry, @sethmlarson and @encukou, I could not cleanly backport this to 3.11 due to a conflict.
See the Backporting merged changes section of the Developer's Guide for instructions on backporting using the cherry_picker tool:

cherry_picker 991224b1e8311c85f198f6dd8208bf8cff7fc26f 3.11

@miss-islington-app
Copy link
Copy Markdown

Sorry, @sethmlarson and @encukou, I could not cleanly backport this to 3.12 due to a conflict.
Please backport using cherry_picker on command line.

cherry_picker 991224b1e8311c85f198f6dd8208bf8cff7fc26f 3.12

@miss-islington-app
Copy link
Copy Markdown

Sorry, @sethmlarson and @encukou, I could not cleanly backport this to 3.13 due to a conflict.
Please backport using cherry_picker on command line.

cherry_picker 991224b1e8311c85f198f6dd8208bf8cff7fc26f 3.13

@bedevere-app
Copy link
Copy Markdown

bedevere-app Bot commented Jun 2, 2026

GH-150775 is a backport of this pull request to the 3.14 branch.

@bedevere-app bedevere-app Bot removed the needs backport to 3.14 bugs and security fixes label Jun 2, 2026
@bedevere-app
Copy link
Copy Markdown

bedevere-app Bot commented Jun 2, 2026

GH-150776 is a backport of this pull request to the 3.15 branch.

@bedevere-app bedevere-app Bot removed the needs backport to 3.15 pre-release feature fixes, bugs and security fixes label Jun 2, 2026
StanFromIreland added a commit that referenced this pull request Jun 2, 2026
…ize() (GH-149080) (#150776)

Replace the insertion sort used for canonical ordering of combining
characters with a hybrid approach: insertion sort for short runs (< 20)
and counting sort for longer runs, reducing worst-case complexity from
O(n^2) to O(n). This prevents denial of service via crafted Unicode
strings with many combining characters in alternating CCC order.
(cherry picked from commit 991224b)

Co-authored-by: Seth Larson <seth@python.org>
Co-authored-by: ch4n3-yoon <ch4n3.yoon@gmail.com>
Co-authored-by: Seokchan Yoon <13852925+ch4n3-yoon@users.noreply.github.com>
Co-authored-by: Stan Ulbrych <stan@python.org>
Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
Co-authored-by: Petr Viktorin <encukou@gmail.com>
Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
Co-authored-by: Maurycy Pawłowski-Wieroński <maurycy@maurycy.com>
StanFromIreland added a commit that referenced this pull request Jun 2, 2026
…ize() (GH-149080)

Replace the insertion sort used for canonical ordering of combining
characters with a hybrid approach: insertion sort for short runs (< 20)
and counting sort for longer runs, reducing worst-case complexity from
O(n^2) to O(n). This prevents denial of service via crafted Unicode
strings with many combining characters in alternating CCC order.
(cherry picked from commit 991224b)

Co-authored-by: Seth Larson <seth@python.org>
Co-authored-by: ch4n3-yoon <ch4n3.yoon@gmail.com>
Co-authored-by: Seokchan Yoon <13852925+ch4n3-yoon@users.noreply.github.com>
Co-authored-by: Stan Ulbrych <stan@python.org>
Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
Co-authored-by: Petr Viktorin <encukou@gmail.com>
Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
Co-authored-by: Maurycy Pawłowski-Wieroński <maurycy@maurycy.com>
@bedevere-app
Copy link
Copy Markdown

bedevere-app Bot commented Jun 2, 2026

GH-150780 is a backport of this pull request to the 3.13 branch.

@bedevere-app bedevere-app Bot removed the needs backport to 3.13 bugs and security fixes label Jun 2, 2026
@bedevere-app
Copy link
Copy Markdown

bedevere-app Bot commented Jun 2, 2026

GH-150781 is a backport of this pull request to the 3.13 branch.

encukou added a commit that referenced this pull request Jun 2, 2026
…ize() (GH-149080) (#150780)

Replace the insertion sort used for canonical ordering of combining
characters with a hybrid approach: insertion sort for short runs (< 20)
and counting sort for longer runs, reducing worst-case complexity from
O(n^2) to O(n). This prevents denial of service via crafted Unicode
strings with many combining characters in alternating CCC order.
(cherry picked from commit 991224b)

Co-authored-by: Seth Larson <seth@python.org>
Co-authored-by: ch4n3-yoon <ch4n3.yoon@gmail.com>
Co-authored-by: Seokchan Yoon <13852925+ch4n3-yoon@users.noreply.github.com>
Co-authored-by: Stan Ulbrych <stan@python.org>
Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
Co-authored-by: Maurycy Pawłowski-Wieroński <maurycy@maurycy.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs backport to 3.10 only security fixes needs backport to 3.11 only security fixes needs backport to 3.12 only security fixes topic-unicode type-security A security issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants