1
0
mirror of https://github.com/postgres/postgres.git synced 2025-11-19 13:42:17 +03:00

Get rid of O(N^2) script-parsing overhead in pgbench.

pgbench wants to record the starting line number of each command
in its scripts.  It was computing that by scanning from the script
start and counting newlines, so that O(N^2) work had to be done
for an N-command script.  In a script with 50K lines, this adds
up to about 10 seconds on my machine.

To add insult to injury, the results were subtly wrong, because
expr_scanner_offset() scanned to find the NUL that flex inserts
at the end of the current token --- and before the first yylex
call, no such NUL has been inserted.  So we ended by computing the
script's last line number not its first one.  This was visible only
in case of \gset at the start of a script, which perhaps accounts
for the lack of complaints.

To fix, steal an idea from plpgsql and track the current lexer
ending position and line count as we advance through the script.
(It's a bit simpler than plpgsql since we can't need to back up.)
Also adjust a couple of other places that were invoking scans
from script start when they didn't really need to.  I made a new
psqlscan function psql_scan_get_location() that replaces both
expr_scanner_offset() and expr_scanner_get_lineno(), since in
practice expr_scanner_get_lineno() was only being invoked to find
the line number of the current lexer end position.

Reported-by: Daniel Vérité <daniel@manitou-mail.org>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/84a8a89e-adb8-47a9-9d34-c13f7150ee45@manitou-mail.org
This commit is contained in:
Tom Lane
2025-02-27 10:53:38 -05:00
parent e167191dc1
commit c8c74ad7e1
6 changed files with 89 additions and 55 deletions

View File

@@ -1079,6 +1079,10 @@ psql_scan_setup(PsqlScanState state,
/* Set lookaside data in case we have to map unsafe encoding */
state->curline = state->scanbuf;
state->refline = state->scanline;
/* Initialize state for psql_scan_get_location() */
state->cur_line_no = 0; /* yylex not called yet */
state->cur_line_ptr = state->scanbuf;
}
/*
@@ -1136,6 +1140,10 @@ psql_scan(PsqlScanState state,
/* And lex. */
lexresult = yylex(NULL, state->scanner);
/* Notify psql_scan_get_location() that a yylex call has been made. */
if (state->cur_line_no == 0)
state->cur_line_no = 1;
/*
* Check termination state and return appropriate result info.
*/
@@ -1311,6 +1319,52 @@ psql_scan_in_quote(PsqlScanState state)
state->start_state != xqs;
}
/*
* Return the current scanning location (end+1 of last scanned token),
* as a line number counted from 1 and an offset from string start.
*
* This considers only the outermost input string, and therefore is of
* limited use for programs that use psqlscan_push_new_buffer().
*
* It would be a bit easier probably to use "%option yylineno" to count
* lines, but the flex manual says that has a performance cost, and only
* a minority of programs using psqlscan have need for this functionality.
* So we implement it ourselves without adding overhead to the lexer itself.
*/
void
psql_scan_get_location(PsqlScanState state,
int *lineno, int *offset)
{
const char *line_end;
/*
* We rely on flex's having stored a NUL after the current token in
* scanbuf. Therefore we must specially handle the state before yylex()
* has been called, when obviously that won't have happened yet.
*/
if (state->cur_line_no == 0)
{
*lineno = 1;
*offset = 0;
return;
}
/*
* Advance cur_line_no/cur_line_ptr past whatever has been lexed so far.
* Doing this prevents repeated calls from being O(N^2) for long inputs.
*/
while ((line_end = strchr(state->cur_line_ptr, '\n')) != NULL)
{
state->cur_line_no++;
state->cur_line_ptr = line_end + 1;
}
state->cur_line_ptr += strlen(state->cur_line_ptr);
/* Report current location. */
*lineno = state->cur_line_no;
*offset = state->cur_line_ptr - state->scanbuf;
}
/*
* Push the given string onto the stack of stuff to scan.
*