Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fix](nereids)make agg output unchanged after normalized repeat #36207

Merged
merged 2 commits into from
Jun 13, 2024

Conversation

feiniaofeiafei
Copy link
Contributor

@feiniaofeiafei feiniaofeiafei commented Jun 12, 2024

The NormalizeRepeat rule can change the output of agg.
For example:

         SELECT
             col_int_undef_signed2 AS C1 ,
             col_int_undef_signed2
         FROM
             normalize_repeat_name_unchanged
         GROUP BY
         GROUPING SETS (
         (col_int_undef_signed2),
         (col_int_undef_signed2))

Before fixing the bug, the plan is:

LogicalResultSink[97] ( outputExprs=[C1#7, col_int_undef_signed2#1] )
      +--LogicalProject[94] ( distinct=false, projects=[C1#7, C1#7], excepts=[] )
         +--LogicalAggregate[93] ( groupByExpr=[C1#7, GROUPING_ID#8], outputExpr=[C1#7, GROUPING_ID#8], hasRepeat=true )
            +--LogicalRepeat ( groupingSets=[[C1#7], [C1#7]], outputExpressions=[C1#7, GROUPING_ID#8] )
               +--LogicalProject[91] ( distinct=false, projects=[col_int_undef_signed2#1 AS `C1`#7], excepts=[] )
                  +--LogicalOlapScan (  )

This can lead to column not found in LogicalResultSink, report error: Input slot(s) not in childs output: col_int_undef_signed2#1 in plan: LogicalResultSink[97] ( outputExprs=[C1#7, col_int_undef_signed2#1] )
child output is: [C1#7]

This pr makes agg output unchanged after normalized repeat. After fixing, the plan is:

LogicalResultSink[97] ( outputExprs=[C1#7, col_int_undef_signed2#1] )
      +--LogicalProject[94] ( distinct=false, projects=[C1#7, C1#7 as `col_int_undef_signed2`#1], excepts=[] )
         +--LogicalAggregate[93] ( groupByExpr=[C1#7, GROUPING_ID#8], outputExpr=[C1#7, GROUPING_ID#8], hasRepeat=true )
            +--LogicalRepeat ( groupingSets=[[C1#7], [C1#7]], outputExpressions=[C1#7, GROUPING_ID#8] )
               +--LogicalProject[91] ( distinct=false, projects=[col_int_undef_signed2#1 AS `C1`#7], excepts=[] )
                  +--LogicalOlapScan (  )
@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

@feiniaofeiafei
Copy link
Contributor Author

run buildall

Comment on lines +479 to +482
// Make the output ExprId unchanged
if (!e.getExprId().equals(originalAggOutput.get(i).getExprId())) {
e = new Alias(originalAggOutput.get(i).getExprId(), e, originalAggOutput.get(i).getName());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why changed in normalizeToUseSlotRef, could we ensure it not changed in normalizeToUseSlotRef?

@feiniaofeiafei
Copy link
Contributor Author

run buildall

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Jun 13, 2024
Copy link
Contributor

PR approved by at least one committer and no changes requested.

Copy link
Contributor

PR approved by anyone and no changes requested.

@doris-robot
Copy link

TPC-H: Total hot run time: 40006 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 7bdefd27021f969365b34831e4ed75b75e4eaad7, data reload: false

------ Round 1 ----------------------------------
q1	17604	4383	4365	4365
q2	2012	200	186	186
q3	10461	1087	1011	1011
q4	10185	866	817	817
q5	7492	2708	2651	2651
q6	224	138	135	135
q7	980	593	590	590
q8	9224	2081	2102	2081
q9	8791	6557	6484	6484
q10	9037	3732	3762	3732
q11	448	238	233	233
q12	454	238	229	229
q13	17776	2977	3015	2977
q14	267	223	221	221
q15	512	476	472	472
q16	504	376	378	376
q17	985	740	757	740
q18	8126	7544	7389	7389
q19	7008	1468	1465	1465
q20	686	327	314	314
q21	4956	3194	4046	3194
q22	414	344	346	344
Total cold run time: 118146 ms
Total hot run time: 40006 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4456	4250	4263	4250
q2	394	270	265	265
q3	3040	2880	2942	2880
q4	1991	1730	1757	1730
q5	5496	5564	5461	5461
q6	227	129	133	129
q7	2271	1841	1861	1841
q8	3300	3430	3441	3430
q9	8709	8755	8770	8755
q10	4187	3719	3816	3719
q11	603	518	491	491
q12	814	642	639	639
q13	16114	3202	3178	3178
q14	318	281	263	263
q15	543	480	485	480
q16	512	433	460	433
q17	1830	1513	1500	1500
q18	8009	8164	7818	7818
q19	1810	1546	1585	1546
q20	2125	1892	1825	1825
q21	7110	4822	4816	4816
q22	644	571	553	553
Total cold run time: 74503 ms
Total hot run time: 56002 ms
@doris-robot
Copy link

TPC-DS: Total hot run time: 173295 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 7bdefd27021f969365b34831e4ed75b75e4eaad7, data reload: false

query1	920	381	393	381
query2	6440	2545	2315	2315
query3	6636	206	205	205
query4	18914	17189	17412	17189
query5	3576	465	452	452
query6	239	163	158	158
query7	4584	306	305	305
query8	335	295	308	295
query9	8742	2482	2452	2452
query10	570	312	292	292
query11	10627	10003	9996	9996
query12	122	91	87	87
query13	1636	378	379	378
query14	12149	7715	6355	6355
query15	231	197	189	189
query16	7740	276	278	276
query17	1877	534	515	515
query18	1523	276	266	266
query19	200	162	153	153
query20	93	84	81	81
query21	212	128	119	119
query22	4394	4109	3902	3902
query23	33861	33672	33761	33672
query24	11646	2939	2920	2920
query25	654	401	386	386
query26	1730	158	160	158
query27	2885	331	338	331
query28	7450	2182	2148	2148
query29	972	647	642	642
query30	270	158	154	154
query31	990	760	753	753
query32	99	53	54	53
query33	759	311	308	308
query34	985	488	493	488
query35	758	628	614	614
query36	1165	977	1008	977
query37	175	78	77	77
query38	2912	2811	2825	2811
query39	878	832	862	832
query40	265	128	129	128
query41	60	56	57	56
query42	121	101	115	101
query43	593	547	526	526
query44	1225	725	718	718
query45	192	160	159	159
query46	1071	738	695	695
query47	1837	1744	1726	1726
query48	364	302	297	297
query49	906	402	409	402
query50	759	395	396	395
query51	6724	6682	6626	6626
query52	102	92	99	92
query53	358	287	282	282
query54	870	448	453	448
query55	75	72	76	72
query56	278	259	257	257
query57	1145	1073	1074	1073
query58	254	241	245	241
query59	3625	3288	3145	3145
query60	284	269	271	269
query61	124	94	89	89
query62	605	435	433	433
query63	317	287	287	287
query64	9646	2233	1706	1706
query65	3247	3116	3103	3103
query66	1196	321	328	321
query67	15610	15001	14918	14918
query68	4496	540	542	540
query69	510	435	406	406
query70	1152	1143	1144	1143
query71	442	276	275	275
query72	6965	5455	5861	5455
query73	766	333	337	333
query74	5912	5524	5498	5498
query75	3403	2667	2667	2667
query76	2764	958	932	932
query77	475	289	304	289
query78	10441	9796	9849	9796
query79	2321	527	516	516
query80	935	471	462	462
query81	581	217	217	217
query82	796	106	103	103
query83	262	171	166	166
query84	238	85	85	85
query85	1850	300	265	265
query86	473	323	332	323
query87	3203	3111	3073	3073
query88	4250	2463	2456	2456
query89	473	382	370	370
query90	1789	195	195	195
query91	129	101	100	100
query92	58	50	50	50
query93	2509	508	503	503
query94	1165	191	191	191
query95	410	394	324	324
query96	598	273	268	268
query97	3220	3072	3016	3016
query98	211	203	190	190
query99	1262	853	838	838
Total cold run time: 275861 ms
Total hot run time: 173295 ms
@doris-robot
Copy link

ClickBench: Total hot run time: 30.49 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 7bdefd27021f969365b34831e4ed75b75e4eaad7, data reload: false

query1	0.04	0.03	0.03
query2	0.08	0.04	0.04
query3	0.23	0.05	0.05
query4	1.67	0.07	0.07
query5	0.47	0.48	0.49
query6	1.13	0.73	0.73
query7	0.02	0.01	0.02
query8	0.05	0.04	0.04
query9	0.54	0.50	0.48
query10	0.55	0.56	0.55
query11	0.15	0.11	0.12
query12	0.15	0.13	0.12
query13	0.59	0.59	0.59
query14	0.78	0.79	0.80
query15	0.84	0.82	0.82
query16	0.35	0.37	0.36
query17	1.02	0.94	0.99
query18	0.23	0.23	0.24
query19	1.88	1.69	1.72
query20	0.02	0.01	0.01
query21	15.45	0.66	0.65
query22	4.64	7.32	1.95
query23	18.33	1.38	1.32
query24	2.13	0.23	0.22
query25	0.13	0.10	0.08
query26	0.26	0.17	0.17
query27	0.07	0.08	0.08
query28	13.17	1.02	1.00
query29	12.59	3.33	3.29
query30	0.26	0.07	0.06
query31	2.84	0.40	0.38
query32	3.27	0.47	0.47
query33	2.87	2.87	2.98
query34	17.08	4.42	4.36
query35	4.47	4.41	4.47
query36	0.65	0.46	0.47
query37	0.18	0.16	0.15
query38	0.14	0.14	0.15
query39	0.04	0.04	0.03
query40	0.18	0.13	0.14
query41	0.10	0.05	0.04
query42	0.05	0.05	0.05
query43	0.04	0.04	0.04
Total cold run time: 109.73 s
Total hot run time: 30.49 s
@feiniaofeiafei
Copy link
Contributor Author

run feut

@starocean999 starocean999 merged commit 9a125d3 into apache:master Jun 13, 2024
26 of 29 checks passed
feiniaofeiafei added a commit to feiniaofeiafei/doris that referenced this pull request Jun 17, 2024
…he#36207)

The NormalizeRepeat rule can change the output of agg.
For example:
```sql
         SELECT
             col_int_undef_signed2 AS C1 ,
             col_int_undef_signed2
         FROM
             normalize_repeat_name_unchanged
         GROUP BY
         GROUPING SETS (
         (col_int_undef_signed2),
         (col_int_undef_signed2))
```
Before fixing the bug, the plan is:
```sql
LogicalResultSink[97] ( outputExprs=[C1#7, col_int_undef_signed2#1] )
      +--LogicalProject[94] ( distinct=false, projects=[C1#7, C1#7], excepts=[] )
         +--LogicalAggregate[93] ( groupByExpr=[C1#7, GROUPING_ID#8], outputExpr=[C1#7, GROUPING_ID#8], hasRepeat=true )
            +--LogicalRepeat ( groupingSets=[[C1#7], [C1#7]], outputExpressions=[C1#7, GROUPING_ID#8] )
               +--LogicalProject[91] ( distinct=false, projects=[col_int_undef_signed2#1 AS `C1`apache#7], excepts=[] )
                  +--LogicalOlapScan (  )
```
This can lead to column not found in LogicalResultSink, report error:
Input slot(s) not in childs output: col_int_undef_signed2#1 in plan:
LogicalResultSink[97] ( outputExprs=[C1#7, col_int_undef_signed2#1] )
child output is: [C1#7]

This pr makes agg output unchanged after normalized repeat. After
fixing, the plan is:
```sql
LogicalResultSink[97] ( outputExprs=[C1#7, col_int_undef_signed2#1] )
      +--LogicalProject[94] ( distinct=false, projects=[C1#7, C1#7 as `col_int_undef_signed2`apache#1], excepts=[] )
         +--LogicalAggregate[93] ( groupByExpr=[C1#7, GROUPING_ID#8], outputExpr=[C1#7, GROUPING_ID#8], hasRepeat=true )
            +--LogicalRepeat ( groupingSets=[[C1#7], [C1#7]], outputExpressions=[C1#7, GROUPING_ID#8] )
               +--LogicalProject[91] ( distinct=false, projects=[col_int_undef_signed2#1 AS `C1`apache#7], excepts=[] )
                  +--LogicalOlapScan (  )
```

---------

Co-authored-by: feiniaofeiafei <moailing@selectdb.com>
feiniaofeiafei pushed a commit to feiniaofeiafei/doris that referenced this pull request Jun 17, 2024
feiniaofeiafei pushed a commit to feiniaofeiafei/doris that referenced this pull request Jun 17, 2024
dataroaring pushed a commit that referenced this pull request Jun 17, 2024
The NormalizeRepeat rule can change the output of agg.
For example:
```sql
         SELECT
             col_int_undef_signed2 AS C1 ,
             col_int_undef_signed2
         FROM
             normalize_repeat_name_unchanged
         GROUP BY
         GROUPING SETS (
         (col_int_undef_signed2),
         (col_int_undef_signed2))
```
Before fixing the bug, the plan is:
```sql
LogicalResultSink[97] ( outputExprs=[C1#7, col_int_undef_signed2#1] )
      +--LogicalProject[94] ( distinct=false, projects=[C1#7, C1#7], excepts=[] )
         +--LogicalAggregate[93] ( groupByExpr=[C1#7, GROUPING_ID#8], outputExpr=[C1#7, GROUPING_ID#8], hasRepeat=true )
            +--LogicalRepeat ( groupingSets=[[C1#7], [C1#7]], outputExpressions=[C1#7, GROUPING_ID#8] )
               +--LogicalProject[91] ( distinct=false, projects=[col_int_undef_signed2#1 AS `C1`#7], excepts=[] )
                  +--LogicalOlapScan (  )
```
This can lead to column not found in LogicalResultSink, report error:
Input slot(s) not in childs output: col_int_undef_signed2#1 in plan:
LogicalResultSink[97] ( outputExprs=[C1#7, col_int_undef_signed2#1] )
child output is: [C1#7]

This pr makes agg output unchanged after normalized repeat. After
fixing, the plan is:
```sql
LogicalResultSink[97] ( outputExprs=[C1#7, col_int_undef_signed2#1] )
      +--LogicalProject[94] ( distinct=false, projects=[C1#7, C1#7 as `col_int_undef_signed2`#1], excepts=[] )
         +--LogicalAggregate[93] ( groupByExpr=[C1#7, GROUPING_ID#8], outputExpr=[C1#7, GROUPING_ID#8], hasRepeat=true )
            +--LogicalRepeat ( groupingSets=[[C1#7], [C1#7]], outputExpressions=[C1#7, GROUPING_ID#8] )
               +--LogicalProject[91] ( distinct=false, projects=[col_int_undef_signed2#1 AS `C1`#7], excepts=[] )
                  +--LogicalOlapScan (  )
```

---------

Co-authored-by: feiniaofeiafei <moailing@selectdb.com>
morrySnow pushed a commit that referenced this pull request Jun 19, 2024
cherry-pick #36207 to branch-2.0

The NormalizeRepeat rule can change the output of agg.
For example:

         SELECT
             col_int_undef_signed2 AS C1 ,
             col_int_undef_signed2
         FROM
             normalize_repeat_name_unchanged
         GROUP BY
         GROUPING SETS (
         (col_int_undef_signed2),
         (col_int_undef_signed2))

Before fixing the bug, the plan is:

LogicalResultSink[97] ( outputExprs=[C1#7, col_int_undef_signed2#1] )
      +--LogicalProject[94] ( distinct=false, projects=[C1#7, C1#7], excepts=[] )
         +--LogicalAggregate[93] ( groupByExpr=[C1#7, GROUPING_ID#8], outputExpr=[C1#7, GROUPING_ID#8], hasRepeat=true )
            +--LogicalRepeat ( groupingSets=[[C1#7], [C1#7]], outputExpressions=[C1#7, GROUPING_ID#8] )
               +--LogicalProject[91] ( distinct=false, projects=[col_int_undef_signed2#1 AS `C1`#7], excepts=[] )
                  +--LogicalOlapScan (  )

This can lead to column not found in LogicalResultSink, report error:
Input slot(s) not in childs output: col_int_undef_signed2#1 in plan:
LogicalResultSink[97] ( outputExprs=[C1#7, col_int_undef_signed2#1] )
child output is: [C1#7]

This pr makes agg output unchanged after normalized repeat. After
fixing, the plan is:

LogicalResultSink[97] ( outputExprs=[C1#7, col_int_undef_signed2#1] )
      +--LogicalProject[94] ( distinct=false, projects=[C1#7, C1#7 as `col_int_undef_signed2`#1], excepts=[] )
         +--LogicalAggregate[93] ( groupByExpr=[C1#7, GROUPING_ID#8], outputExpr=[C1#7, GROUPING_ID#8], hasRepeat=true )
            +--LogicalRepeat ( groupingSets=[[C1#7], [C1#7]], outputExpressions=[C1#7, GROUPING_ID#8] )
               +--LogicalProject[91] ( distinct=false, projects=[col_int_undef_signed2#1 AS `C1`#7], excepts=[] )
                  +--LogicalOlapScan (  )
morningman pushed a commit that referenced this pull request Jun 19, 2024
cherry-pick #36207 to branch-2.1

Co-authored-by: feiniaofeiafei <moailing@selectdb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by one committer. dev/2.0.12-merged dev/2.1.4-merged reviewed
5 participants