gh-149079: Fix O(n^2) canonical ordering in unicodedata.normalize()#149080
Conversation
…ze() Replace the insertion sort used for canonical ordering of combining characters with a hybrid approach: insertion sort for short runs (< 20) and counting sort for longer runs, reducing worst-case complexity from O(n^2) to O(n). This prevents denial of service via crafted Unicode strings with many combining characters in alternating CCC order. Co-authored-by: Seokchan Yoon <13852925+ch4n3-yoon@users.noreply.github.com>
|
Reviewers: Note that there are pending changes from previous reviews. |
serhiy-storchaka
left a comment
There was a problem hiding this comment.
There is a potential for optimization, but in general LGTM. 👍
| Py_ssize_t i, o, osize; | ||
| int kind; | ||
| const void *data; | ||
| int input_kind, result_kind; |
There was a problem hiding this comment.
Why not reuse the same variable?
There was a problem hiding this comment.
IIRC, I asked to have two different variables for readability purposes. We could reuse it but when reading the code, it was cleaner when I saw the separation. But it can be reverted if you insist.
|
Ideas for optimization:
|
tim-one
left a comment
There was a problem hiding this comment.
I'm not sure there's ever an end to suggestions, so I'd prefer to ship this already. Good work, good enough,, and thank you for your care and patience!
|
@sethmlarson a little ping, there are some suggestions above that need applying. |
|
Most of my suggestions can wait to the following PRs. Although some of them can be easy to implement, so you can include them in this PR if you wish. |
encukou
left a comment
There was a problem hiding this comment.
Converting @maurycy's comments to GitHub suggestions:
|
Following Petr's example, I wrote sethmlarson#1 to apply all open suggestions (except #149080 (comment), which I think is fine as-is, and Serhiy's future performance plans). |
Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com> Co-authored-by: Petr Viktorin <encukou@gmail.com> Co-authored-by: Serhiy Storchaka <storchaka@gmail.com> Co-authored-by: Maurycy Pawłowski-Wieroński <maurycy@maurycy.com>
|
Sorry for the delay folks, I've merged the changes from @StanFromIreland 🙏 |
Documentation build overview
118 files changed ·
|
|
Thanks @sethmlarson for the PR, and @encukou for merging it 🌮🎉.. I'm working now to backport this PR to: 3.10. |
|
Thanks @sethmlarson for the PR, and @encukou for merging it 🌮🎉.. I'm working now to backport this PR to: 3.11. |
|
Thanks @sethmlarson for the PR, and @encukou for merging it 🌮🎉.. I'm working now to backport this PR to: 3.12. |
|
Thanks @sethmlarson for the PR, and @encukou for merging it 🌮🎉.. I'm working now to backport this PR to: 3.13. |
|
Thanks @sethmlarson for the PR, and @encukou for merging it 🌮🎉.. I'm working now to backport this PR to: 3.14. |
|
Thanks @sethmlarson for the PR, and @encukou for merging it 🌮🎉.. I'm working now to backport this PR to: 3.15. |
|
Sorry, @sethmlarson and @encukou, I could not cleanly backport this to |
|
Sorry, @sethmlarson and @encukou, I could not cleanly backport this to |
|
Sorry, @sethmlarson and @encukou, I could not cleanly backport this to |
|
Sorry, @sethmlarson and @encukou, I could not cleanly backport this to |
|
GH-150775 is a backport of this pull request to the 3.14 branch. |
|
GH-150776 is a backport of this pull request to the 3.15 branch. |
…ize() (GH-149080) (#150776) Replace the insertion sort used for canonical ordering of combining characters with a hybrid approach: insertion sort for short runs (< 20) and counting sort for longer runs, reducing worst-case complexity from O(n^2) to O(n). This prevents denial of service via crafted Unicode strings with many combining characters in alternating CCC order. (cherry picked from commit 991224b) Co-authored-by: Seth Larson <seth@python.org> Co-authored-by: ch4n3-yoon <ch4n3.yoon@gmail.com> Co-authored-by: Seokchan Yoon <13852925+ch4n3-yoon@users.noreply.github.com> Co-authored-by: Stan Ulbrych <stan@python.org> Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com> Co-authored-by: Petr Viktorin <encukou@gmail.com> Co-authored-by: Serhiy Storchaka <storchaka@gmail.com> Co-authored-by: Maurycy Pawłowski-Wieroński <maurycy@maurycy.com>
…ize() (GH-149080) Replace the insertion sort used for canonical ordering of combining characters with a hybrid approach: insertion sort for short runs (< 20) and counting sort for longer runs, reducing worst-case complexity from O(n^2) to O(n). This prevents denial of service via crafted Unicode strings with many combining characters in alternating CCC order. (cherry picked from commit 991224b) Co-authored-by: Seth Larson <seth@python.org> Co-authored-by: ch4n3-yoon <ch4n3.yoon@gmail.com> Co-authored-by: Seokchan Yoon <13852925+ch4n3-yoon@users.noreply.github.com> Co-authored-by: Stan Ulbrych <stan@python.org> Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com> Co-authored-by: Petr Viktorin <encukou@gmail.com> Co-authored-by: Serhiy Storchaka <storchaka@gmail.com> Co-authored-by: Maurycy Pawłowski-Wieroński <maurycy@maurycy.com>
|
GH-150780 is a backport of this pull request to the 3.13 branch. |
|
GH-150781 is a backport of this pull request to the 3.13 branch. |
…ize() (GH-149080) (#150780) Replace the insertion sort used for canonical ordering of combining characters with a hybrid approach: insertion sort for short runs (< 20) and counting sort for longer runs, reducing worst-case complexity from O(n^2) to O(n). This prevents denial of service via crafted Unicode strings with many combining characters in alternating CCC order. (cherry picked from commit 991224b) Co-authored-by: Seth Larson <seth@python.org> Co-authored-by: ch4n3-yoon <ch4n3.yoon@gmail.com> Co-authored-by: Seokchan Yoon <13852925+ch4n3-yoon@users.noreply.github.com> Co-authored-by: Stan Ulbrych <stan@python.org> Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com> Co-authored-by: Serhiy Storchaka <storchaka@gmail.com> Co-authored-by: Maurycy Pawłowski-Wieroński <maurycy@maurycy.com>
Replace the insertion sort used for canonical ordering of combining characters with a hybrid approach: insertion sort for short runs (< 20) and counting sort for longer runs, reducing worst-case complexity from O(n^2) to O(n). This prevents denial of service via crafted Unicode strings with many combining characters with a large number of inversions in combing class order.
unicodedata.normalize("NFC")canonical ordering #149079