Skip to content

Preserve trailing whitespace after the final semicolon#859

Open
Sanjays2402 wants to merge 1 commit into
andialbrecht:masterfrom
Sanjays2402:fix/preserve-trailing-whitespace
Open

Preserve trailing whitespace after the final semicolon#859
Sanjays2402 wants to merge 1 commit into
andialbrecht:masterfrom
Sanjays2402:fix/preserve-trailing-whitespace

Conversation

@Sanjays2402

Copy link
Copy Markdown

What

sqlparse.parse() is meant to be lossless — joining the returned
statements should reproduce the input exactly. That held for trailing
spaces/tabs after the final ;, but not for trailing whitespace
containing a newline:

>>> import sqlparse
>>> s = "select 1;\n"
>>> "".join(str(x) for x in sqlparse.parse(s)) == s
False          # returns "select 1;" — the trailing "\n" is dropped
input "".join(map(str, parse(input))) before after
"select 1;" "select 1;" "select 1;"
"select 1; " (space) "select 1; " "select 1; "
"select 1;\n" "select 1;" "select 1;\n"
"select 1;\r\n" "select 1;" "select 1;\r\n"
"select 1;\n\n" "select 1;" "select 1;\n\n"
";\n" ";" ";\n"

Trailing spaces were preserved but trailing newlines were not, which is a
surprising inconsistency for a library whose core guarantee is that parsing
never mutates the SQL.

Root cause

In StatementSplitter.process(), once a ; sets consume_ws = True, a
following newline token is deliberately treated as "not whitespace" for
end-of-statement detection (see the existing comment: "It will count
newline token as a non whitespace"
). This is what makes "a;\nb;" split
into two statements.

The side effect: that newline starts a fresh statement buffer. When the
whitespace is at the very end of the input there is no following
statement, so the buffer stays all-whitespace and is discarded by the final
not all(t.is_whitespace ...) guard — silently losing the exact input.

Fix

Hold each completed statement back by one segment. A held statement is
emitted only once the next real (non-whitespace) token confirms a new
statement has actually begun, so inter-statement newline placement is
byte-for-byte unchanged
:

>>> [str(x) for x in sqlparse.parse("a;\nb;")]
['a;', '\nb;']          # identical to before

At end of stream, any leftover all-whitespace tokens are reattached to the
held statement instead of being dropped, making the trailing case lossless.

Unchanged behaviors (verified):

  • Whitespace-only input (" ", "\n") still yields zero statements.
  • Statement count is unchanged in every case.
  • sqlparse.split() is unaffected — it .strip()s each statement anyway.

Tests

Added a parametrized regression test
tests/test_split.py::test_split_preserves_trailing_whitespace covering
\n, \r\n, multiple/mixed trailing whitespace, the multi-statement case,
and a bare ;\n.

Proof it guards the bug (stash the source fix, keep the test):

# without the fix
7 failed in 0.04s
# with the fix
7 passed in 0.01s

Also verified with a 200k-iteration round-trip fuzzer asserting
str(parse(sql)) == sql (excluding whitespace-only inputs): 31+ failures
before, 0 after
.

Verification

  • Full suite: 494 passed, 2 xfailed, 1 xpassed (487 → 494; +7 new; no
    regressions).
  • ruff check clean on both changed files.
  • Diff: +41 / −1 across statement_splitter.py, tests/test_split.py,
    and a CHANGELOG bullet.

parse() is meant to be lossless: joining the returned statements should
reproduce the input exactly. This held for trailing spaces/tabs after the
last ";" but not for trailing whitespace containing a newline --
str(parse("select 1;\n")) was "select 1;", dropping the "\n".

Root cause is in StatementSplitter.process(). Once a ";" arms consume_ws,
a following newline token is deliberately treated as "not whitespace" for
end-of-statement detection (so that "a;\nb;" splits into two statements),
which starts a new statement buffer for the trailing whitespace. When that
whitespace is at the very end of the input there is no following statement,
so the buffer stays all-whitespace and is discarded by the final
"not all whitespace" guard -- silently losing the exact input.

Fix: hold each completed statement back by one segment. A held statement is
emitted only once the next real (non-whitespace) token confirms a new
statement has begun -- so inter-statement newline placement is byte-for-byte
unchanged ("a;\nb;" still yields ["a;", "\nb;"]). At end of stream, any
leftover all-whitespace tokens are reattached to the held statement instead
of being dropped, making the trailing case lossless too. Whitespace-only
input still yields zero statements, and split() is unaffected because it
strips each statement.

Verified with a 200k-iteration round-trip fuzzer (str(parse(sql)) == sql):
31+ failures before, 0 after; full suite 494 passed.

Add parametrized regression test test_split_preserves_trailing_whitespace
covering "\n", "\r\n", multiple/mixed trailing whitespace, the multi-
statement case and a bare ";\n"; it fails on all 7 cases without the fix.
Copilot AI review requested due to automatic review settings July 3, 2026 04:21

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a losslessness gap in sqlparse.parse() where trailing whitespace containing newlines after the final semicolon (;) could be dropped when re-joining parsed statements, breaking the expected round-trip property for inputs like "select 1;\n".

Changes:

  • Update StatementSplitter.process() to hold back the most recently completed statement and, at end-of-stream, reattach any trailing all-whitespace tokens to it instead of discarding them.
  • Add a parametrized regression test ensuring ''.join(map(str, sqlparse.parse(s))) == s for a variety of trailing-newline and multi-statement cases.
  • Document the behavioral fix in the changelog.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
sqlparse/engine/statement_splitter.py Holds one completed statement back and reattaches trailing all-whitespace at EOF to preserve byte-for-byte round-trips.
tests/test_split.py Adds regression coverage for trailing newline/whitespace preservation when joining parse() results.
CHANGELOG Notes the bug fix and the restored str(parse(sql)) == sql behavior for inputs ending with newline after ;.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants