I wrote a query for a report that needed to summarize table data for different subsets of an ARRAY column. Once I had a working query, I recreated it in my code using an SQL query builder (the awesome jOOQ in this case). Unfortunately the output from the generated SQL was different from the handwritten query. Some of the grouping columns had null in them even though the original query showed the correct values. After some help from the jOOQ users mailing list, I found that the generated query I wrote was inadvertently using bind variables for the array indexes used in the GROUPING SETS. This apparently affects how the query is executed and leads to unexpected output. I wrote an SQL script runnable in psql to demonstrate the issue. The script creates a table with sample data and then executes 2 queries against it. The first one has everything inlined and produces the expected output. The second one uses a prepared statement and bind variables to reproduce the issue. I'm not really sure if this is a bug in query validation or a bug in executing the query or if it's just unreasonable expectations on my part but the output of the 2 queries seems to violate the principle of least surprise. - Aner -- bind-group-by-test.sql -- CREATE TABLE bind_group_by ( path text [] PRIMARY KEY, value int ); INSERT INTO bind_group_by (path, value) VALUES (ARRAY ['A1', 'B1', 'C1', 'D1'], 0), (ARRAY ['A1', 'B1', 'C1', 'D2'], 1), (ARRAY ['A1', 'B1', 'C1', 'D3'], 2), (ARRAY ['A1', 'B1', 'C2', 'D1'], 3), (ARRAY ['A1', 'B1', 'C2', 'D2'], 4), (ARRAY ['A1', 'B1', 'C2', 'D3'], 5), (ARRAY ['A1', 'B1', 'C3', 'D1'], 6), (ARRAY ['A1', 'B1', 'C3', 'D2'], 7), (ARRAY ['A1', 'B1', 'C3', 'D3'], 8), (ARRAY ['A1', 'B2', 'C1', 'D1'], 9), (ARRAY ['A1', 'B2', 'C1', 'D2'], 10), (ARRAY ['A1', 'B2', 'C1', 'D3'], 11), (ARRAY ['A1', 'B2', 'C2', 'D1'], 12), (ARRAY ['A1', 'B2', 'C2', 'D2'], 13), (ARRAY ['A1', 'B2', 'C2', 'D3'], 14), (ARRAY ['A1', 'B2', 'C3', 'D1'], 15), (ARRAY ['A1', 'B2', 'C3', 'D2'], 16), (ARRAY ['A1', 'B2', 'C3', 'D3'], 17), (ARRAY ['A1', 'B3', 'C1', 'D1'], 18), (ARRAY ['A1', 'B3', 'C1', 'D2'], 19), (ARRAY ['A1', 'B3', 'C1', 'D3'], 20), (ARRAY ['A1', 'B3', 'C2', 'D1'], 21), (ARRAY ['A1', 'B3', 'C2', 'D2'], 22), (ARRAY ['A1', 'B3', 'C2', 'D3'], 23), (ARRAY ['A1', 'B3', 'C3', 'D1'], 24), (ARRAY ['A1', 'B3', 'C3', 'D2'], 25), (ARRAY ['A1', 'B3', 'C3', 'D3'], 26), (ARRAY ['A2', 'B1', 'C1', 'D1'], 27), (ARRAY ['A2', 'B1', 'C1', 'D2'], 28), (ARRAY ['A2', 'B1', 'C1', 'D3'], 29), (ARRAY ['A2', 'B1', 'C2', 'D1'], 30), (ARRAY ['A2', 'B1', 'C2', 'D2'], 31), (ARRAY ['A2', 'B1', 'C2', 'D3'], 32), (ARRAY ['A2', 'B1', 'C3', 'D1'], 33), (ARRAY ['A2', 'B1', 'C3', 'D2'], 34), (ARRAY ['A2', 'B1', 'C3', 'D3'], 35), (ARRAY ['A2', 'B2', 'C1', 'D1'], 36), (ARRAY ['A2', 'B2', 'C1', 'D2'], 37), (ARRAY ['A2', 'B2', 'C1', 'D3'], 38), (ARRAY ['A2', 'B2', 'C2', 'D1'], 39), (ARRAY ['A2', 'B2', 'C2', 'D2'], 40), (ARRAY ['A2', 'B2', 'C2', 'D3'], 41), (ARRAY ['A2', 'B2', 'C3', 'D1'], 42), (ARRAY ['A2', 'B2', 'C3', 'D2'], 43), (ARRAY ['A2', 'B2', 'C3', 'D3'], 44), (ARRAY ['A2', 'B3', 'C1', 'D1'], 45), (ARRAY ['A2', 'B3', 'C1', 'D2'], 46), (ARRAY ['A2', 'B3', 'C1', 'D3'], 47), (ARRAY ['A2', 'B3', 'C2', 'D1'], 48), (ARRAY ['A2', 'B3', 'C2', 'D2'], 49), (ARRAY ['A2', 'B3', 'C2', 'D3'], 50), (ARRAY ['A2', 'B3', 'C3', 'D1'], 51), (ARRAY ['A2', 'B3', 'C3', 'D2'], 52), (ARRAY ['A2', 'B3', 'C3', 'D3'], 53), (ARRAY ['A3', 'B1', 'C1', 'D1'], 54), (ARRAY ['A3', 'B1', 'C1', 'D2'], 55), (ARRAY ['A3', 'B1', 'C1', 'D3'], 56), (ARRAY ['A3', 'B1', 'C2', 'D1'], 57), (ARRAY ['A3', 'B1', 'C2', 'D2'], 58), (ARRAY ['A3', 'B1', 'C2', 'D3'], 59), (ARRAY ['A3', 'B1', 'C3', 'D1'], 60), (ARRAY ['A3', 'B1', 'C3', 'D2'], 61), (ARRAY ['A3', 'B1', 'C3', 'D3'], 62), (ARRAY ['A3', 'B2', 'C1', 'D1'], 63), (ARRAY ['A3', 'B2', 'C1', 'D2'], 64), (ARRAY ['A3', 'B2', 'C1', 'D3'], 65), (ARRAY ['A3', 'B2', 'C2', 'D1'], 66), (ARRAY ['A3', 'B2', 'C2', 'D2'], 67), (ARRAY ['A3', 'B2', 'C2', 'D3'], 68), (ARRAY ['A3', 'B2', 'C3', 'D1'], 69), (ARRAY ['A3', 'B2', 'C3', 'D2'], 70), (ARRAY ['A3', 'B2', 'C3', 'D3'], 71), (ARRAY ['A3', 'B3', 'C1', 'D1'], 72), (ARRAY ['A3', 'B3', 'C1', 'D2'], 73), (ARRAY ['A3', 'B3', 'C1', 'D3'], 74), (ARRAY ['A3', 'B3', 'C2', 'D1'], 75), (ARRAY ['A3', 'B3', 'C2', 'D2'], 76), (ARRAY ['A3', 'B3', 'C2', 'D3'], 77), (ARRAY ['A3', 'B3', 'C3', 'D1'], 78), (ARRAY ['A3', 'B3', 'C3', 'D2'], 79), (ARRAY ['A3', 'B3', 'C3', 'D3'], 80); SELECT 'inlined' AS query; SELECT path[1], path[2], path[3], path, sum(value) FROM bind_group_by GROUP BY GROUPING SETS ( (path[1], path[2], path[3], path), (path[1], path[2], path[3]), (path[1], path[2]), (path[1]), () ) ORDER BY 1, 2, 3, 4 ; SELECT 'prepared' AS query; PREPARE prepared_group_by (int, int, int, int, int, int, int, int, int, int, int, int) AS SELECT path[$1], path[$2], path[$3], path, sum(value) FROM bind_group_by GROUP BY GROUPING SETS ( (path[$4], path[$5], path[$6], path), (path[$7], path[$8], path[$9]), (path[$10], path[$11]), (path[$12]), () ) ORDER BY 1, 2, 3, 4 ; EXECUTE prepared_group_by (1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 1); -- cleanup DEALLOCATE prepared_group_by; DROP TABLE bind_group_by; |
Aner Perez <[hidden email]> writes:
> [ these queries don't give the same results: ] > SELECT path[1], path[2], path[3], path, sum(value) > FROM bind_group_by > GROUP BY GROUPING SETS ( > (path[1], path[2], path[3], path), > (path[1], path[2], path[3]), > (path[1], path[2]), > (path[1]), > () > ) > ORDER BY 1, 2, 3, 4 > ; > PREPARE prepared_group_by (int, int, int, int, int, int, int, int, int, > int, int, int) AS > SELECT path[$1], path[$2], path[$3], path, sum(value) > FROM bind_group_by > GROUP BY GROUPING SETS ( > (path[$4], path[$5], path[$6], path), > (path[$7], path[$8], path[$9]), > (path[$10], path[$11]), > (path[$12]), > () > ) > ORDER BY 1, 2, 3, 4 > ; > EXECUTE prepared_group_by (1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 1); I think you're kind of shooting yourself in the foot here by writing a query with more than one possible interpretation. Since you have GROUP BY items that are both the whole "path" array and specific elements of it, it's not clear whether a targetlist reference such as "path[1]" is meant to refer to the GROUP BY item "path[1]", or to be a subscripting operator applied to the GROUP BY item "path". It appears that the parser makes the first choice in your first query, although that seems like an implementation detail that I wouldn't care to bet on going forward. In the second query, the parser is certainly not going to think that "path[$1]" matches "path[$4]" or any of the other single-element GROUP BY items, so it decides that it means subscripting the "path" item. Now the result will be NULL for any grouping set that doesn't include the "path" item. I haven't gone through the results one-by-one, but just by eyeball I think this is sufficient to explain the discrepancies. (I note that the SQL standard only allows GROUP BY items to be simple column references. Maybe that's not an arbitrary restriction but is meant to forestall this sort of ambiguity? Hard to be sure.) regards, tom lane |
Thanks Tom, I figured as much about the second query but I thought it would be safe to use the first version with the inlined indexing. I'm not sure if you're saying that the same query without the unindexed path column in the select would be safe. Like this: -- Do not GROUP BY or SELECT on path by itself -- SELECT path[1], path[2], path[3], path[4], sum(value) FROM bind_group_by GROUP BY GROUPING SETS ( (path[1], path[2], path[3], path[4]), (path[1], path[2], path[3]), (path[1], path[2]), (path[1]), () ) ORDER BY 1, 2, 3, 4; Or if using indexed path elements in the GROUP BY is the issue and I should put the array indexing in a subselect and do the aggregation in the outer select. Like this: -- Safer Subselect Version -- SELECT level1, level2, level3, level4, sum(value) FROM ( SELECT path[1] as level1, path[2] as level2, path[3] as level3, path[4] as level4, value FROM bind_group_by ) AS expanded GROUP BY GROUPING SETS ( (level1, level2, level3, level4), (level1, level2, level3), (level1, level2), (level1), () ) ORDER BY 1, 2, 3, 4; Thanks for the insight, - Aner On Wed, Sep 9, 2020 at 1:13 PM Tom Lane <[hidden email]> wrote: Aner Perez <[hidden email]> writes: |
Aner Perez <[hidden email]> writes:
> Or if using indexed path elements in the GROUP BY is the issue and I should > put the array indexing in a subselect and do the aggregation in the outer > select. Like this: > -- Safer Subselect Version -- > SELECT level1, level2, level3, level4, sum(value) > FROM ( > SELECT path[1] as level1, path[2] as level2, path[3] as level3, path[4] > as level4, value > FROM bind_group_by > ) AS expanded > GROUP BY GROUPING SETS ( > (level1, level2, level3, level4), > (level1, level2, level3), > (level1, level2), > (level1), > () > ) > ORDER BY 1, 2, 3, 4; Yeah, that one looks a lot safer from here. There's no question about which expressions are supposed to match what. It should end up with the same plan, too. regards, tom lane |
Of course, right when I hit send I thought of another option that makes the SQL a little less readable but perhaps gets rid of the ambiguity. Using ordinals in the GROUP BY: SELECT path[1], path[2], path[3], path[4], sum(value) FROM bind_group_by GROUP BY GROUPING SETS ( (1, 2, 3, 4), (1, 2, 3), (1, 2), (1), () ) ORDER BY 1, 2, 3, 4; Since I'm generating the SQL I'm not too worried about the readability and it works with bind variables too. I promise I'll stop thinking after this one ;-) Thanks! - Aner On Wed, Sep 9, 2020 at 2:45 PM Tom Lane <[hidden email]> wrote: Aner Perez <[hidden email]> writes: |
Free forum by Nabble | Edit this page |