Returning total number of rows in query

  • Oftentimes I find myself wanting the total number of rows returned by a query even though I only may display 50 or so per page. Instead of doing this in multiple queries like so:



    SELECT first_name, 
    last_name,
    (SELECT count(1) FROM sandbox.PEOPLE WHERE trunc(birthday) = trunc(sysdate) ) as totalRows
    FROM sandbox.PEOPLE
    WHERE trunc(birthday) = trunc(sysdate);


    It has been recommended to me to do this:



    SELECT first_name, 
    last_name,
    count(*) over () totalRows
    FROM sandbox.PEOPLE
    WHERE trunc(birthday) = trunc(sysdate);


    I am just looking for what is better as far as performance and if performance is a wash. Does this really improve readability of SQL? It is certainly cleaner/easier to write.


    Do you have a specific question?

    Really don't think this belongs here. Probably better on SO.

    @Mark: Maybe Programmers.SE? There's nothing wrong with the code. I assume @Scott wants to know which is better?

    @Jeremy: programmers makes sense to me.

    The second query looks better; performance-wise I expect it's a wash. How are you doing your pagination?

    Many people might find the analytic functions to be esoteric, so from a maintainability perspective it's a strike against the second option. In any event, doesn't it seem cheesy to repeat the same figure in every row? How about issuing two queries within the same transaction? If you have an index on birthday the count might be satisfied without needing to go to the table.

  • The latter query will be much more efficient-- it only requires hitting the table once. You can do a quick test yourself to confirm this.



    I'll create a simple two-column table with 1 million rows where the second column is one of 10 distinct values



    SQL> create table t (
    2 col1 number,
    3 col2 number
    4 );

    Table created.

    SQL> insert into t
    2 select level, mod(level,10)
    3 from dual
    4 connect by level <= 1000000;

    1000000 rows created.


    Now, I'll run two different queries that retrieve 10% of the data. I've set SQL*Plus to not bother displaying the data but to display the query plan and the basic execution statistics. With the first query, note that the query plan shows that Oracle has to access the table twice and then do a sort and aggregate. The query does ~10,000 consistent gets which is a measure of the amount of logical I/O being done (note that this is independent of whether data is cached so it is a much more stable measure-- if you run the same query many times, the consistent gets figure will fluctuate very little)



    SQL> set autotrace traceonly;
    SQL> select col1
    2 ,(select count(*) from t where col2=3)
    3 from t
    4 where col2 = 3;

    100000 rows selected.


    Execution Plan
    ----------------------------------------------------------
    Plan hash value: 3335345748

    ---------------------------------------------------------------------------
    | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
    ---------------------------------------------------------------------------
    | 0 | SELECT STATEMENT | | 85706 | 2176K| 525 (3)| 00:00:07 |
    | 1 | SORT AGGREGATE | | 1 | 13 | | |
    |* 2 | TABLE ACCESS FULL| T | 85706 | 1088K| 525 (3)| 00:00:07 |
    |* 3 | TABLE ACCESS FULL | T | 85706 | 2176K| 525 (3)| 00:00:07 |
    ---------------------------------------------------------------------------

    Predicate Information (identified by operation id):
    ---------------------------------------------------

    2 - filter("COL2"=3)
    3 - filter("COL2"=3)

    Note
    -----
    - dynamic sampling used for this statement (level=2)


    Statistics
    ----------------------------------------------------------
    32 recursive calls
    1 db block gets
    10465 consistent gets
    0 physical reads
    176 redo size
    2219528 bytes sent via SQL*Net to client
    73850 bytes received via SQL*Net from client
    6668 SQL*Net roundtrips to/from client
    0 sorts (memory)
    0 sorts (disk)
    100000 rows processed


    On the other hand, with the analytic function approach, the query plan shows that we only have to hit the table once. And we only have to do ~1,900 consistent gets-- less than 20% of the logical I/O that the first query had to do.



    SQL> select col1,
    2 count(*) over ()
    3 from t
    4 where col2 = 3;

    100000 rows selected.


    Execution Plan
    ----------------------------------------------------------
    Plan hash value: 2291049666

    ---------------------------------------------------------------------------
    | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
    ---------------------------------------------------------------------------
    | 0 | SELECT STATEMENT | | 85706 | 2176K| 525 (3)| 00:00:07 |
    | 1 | WINDOW BUFFER | | 85706 | 2176K| 525 (3)| 00:00:07 |
    |* 2 | TABLE ACCESS FULL| T | 85706 | 2176K| 525 (3)| 00:00:07 |
    ---------------------------------------------------------------------------

    Predicate Information (identified by operation id):
    ---------------------------------------------------

    2 - filter("COL2"=3)

    Note
    -----
    - dynamic sampling used for this statement (level=2)


    Statistics
    ----------------------------------------------------------
    4 recursive calls
    0 db block gets
    1892 consistent gets
    0 physical reads
    0 redo size
    2219510 bytes sent via SQL*Net to client
    73850 bytes received via SQL*Net from client
    6668 SQL*Net roundtrips to/from client
    1 sorts (memory)
    0 sorts (disk)
    100000 rows processed


    Now, to be fair, you probably won't cut out 80% of your consistent gets moving to the analytic function approach using this particular query because it is likely that far fewer than 10% of the rows in your PEOPLE table have a birth date of today. The fewer rows you return, the less the performance difference will be.



    Since this is Code Review, the analytic function approach is much easier to maintain over time because you don't violate the DRY principle and you have to remember to make all the same changes to your inline query that you make in the main query.


    If the total data set is very large and an estimate of the total number of rows is acceptable then a subquery with a sample clause might be handy: "(select count(*) *100 from t sample(1) where col2=3)"

License under CC-BY-SA with attribution


Content dated before 7/24/2021 11:53 AM

Tags used