How to get the last not-null value in an ordered column of a huge table? Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)Quickly change NULL column to NOT NULLTSQL change column constraint between null and not nullHow to get the first or last row in ordered result set, depending by column value?Can I use SPARSE somehow on a non-nullable bit column with mostly false values?Columnstore Aggregate Pushdown doesn't work for float/real data typesHow to handle null value number DataWarehouseHow to get 0 in the null value fieldsHOW to work with NULL in a NOT NULL column?How to select the set of last non-NULL values per column over a group?Select last non-null value for given partition - Postgres 10

Fishing simulator

How to politely respond to generic emails requesting a PhD/job in my lab? Without wasting too much time

Need a suitable toxic chemical for a murder plot in my novel

Limit for e and 1/e

Did the new image of black hole confirm the general theory of relativity?

Is there folklore associating late breastfeeding with low intelligence and/or gullibility?

Stopping real property loss from eroding embankment

Why does tar appear to skip file contents when output file is /dev/null?

Problem when applying foreach loop

How to say that you spent the night with someone, you were only sleeping and nothing else?

Is above average number of years spent on PhD considered a red flag in future academia or industry positions?

Can I throw a longsword at someone?

How do you clear the ApexPages.getMessages() collection in a test?

Stop battery usage [Ubuntu 18]

What items from the Roman-age tech-level could be used to deter all creatures from entering a small area?

What is the order of Mitzvot in Rambam's Sefer Hamitzvot?

What's the point in a preamp?

Can a zero nonce be safely used with AES-GCM if the key is random and never used again?

What LEGO pieces have "real-world" functionality?

I'm thinking of a number

I'm having difficulty getting my players to do stuff in a sandbox campaign

Why is there no army of Iron-Mans in the MCU?

Windows 10: How to Lock (not sleep) laptop on lid close?

How is simplicity better than precision and clarity in prose?



How to get the last not-null value in an ordered column of a huge table?



Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)Quickly change NULL column to NOT NULLTSQL change column constraint between null and not nullHow to get the first or last row in ordered result set, depending by column value?Can I use SPARSE somehow on a non-nullable bit column with mostly false values?Columnstore Aggregate Pushdown doesn't work for float/real data typesHow to handle null value number DataWarehouseHow to get 0 in the null value fieldsHOW to work with NULL in a NOT NULL column?How to select the set of last non-NULL values per column over a group?Select last non-null value for given partition - Postgres 10



.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








9















I have the following input:



 id | value 
----+-------
1 | 136
2 | NULL
3 | 650
4 | NULL
5 | NULL
6 | NULL
7 | 954
8 | NULL
9 | 104
10 | NULL


I expect the following result:



 id | value 
----+-------
1 | 136
2 | 136
3 | 650
4 | 650
5 | 650
6 | 650
7 | 954
8 | 954
9 | 104
10 | 104


The trivial solution would be join the tables with a < relation, and then selecting the MAX value in a GROUP BY:



WITH tmp AS (
SELECT t2.id, MAX(t1.id) AS lastKnownId
FROM t t1, t t2
WHERE
t1.value IS NOT NULL
AND
t2.id >= t1.id
GROUP BY t2.id
)
SELECT
tmp.id, t.value
FROM t, tmp
WHERE t.id = tmp.lastKnownId;


However, the trivial execution of this code would create internally the square of the count of the rows of the input table ( O(n^2) ). I expected t-sql to optimize it out - on a block/record level, the task to do is very easy and linear, essentially a for loop ( O(n) ).



However, on my experiments, the latest MS SQL 2016 can't optimize this query correctly, making this query impossible to execute for a large input table.



Furthermore, the query has to run quickly, making a similarly easy (but very different) cursor-based solution infeasible.



Using some memory-backed temporary table could be a good compromise, but I am not sure if it can be run significantly quicker, considered that my example query using subqueries didn't work.



I am also thinking on to dig out some windowing function from the t-sql docs, what could be tricked to do what I want. For example, cumulative sum is doing some very similar, but I couldn't trick it to give the latest non-null element, and not the sum of the elements before.



The ideal solution would be a quick query without procedural code or temporary tables. Alternatively, also a solution with temporary tables is okay, but iterating the table procedurally is not.










share|improve this question






























    9















    I have the following input:



     id | value 
    ----+-------
    1 | 136
    2 | NULL
    3 | 650
    4 | NULL
    5 | NULL
    6 | NULL
    7 | 954
    8 | NULL
    9 | 104
    10 | NULL


    I expect the following result:



     id | value 
    ----+-------
    1 | 136
    2 | 136
    3 | 650
    4 | 650
    5 | 650
    6 | 650
    7 | 954
    8 | 954
    9 | 104
    10 | 104


    The trivial solution would be join the tables with a < relation, and then selecting the MAX value in a GROUP BY:



    WITH tmp AS (
    SELECT t2.id, MAX(t1.id) AS lastKnownId
    FROM t t1, t t2
    WHERE
    t1.value IS NOT NULL
    AND
    t2.id >= t1.id
    GROUP BY t2.id
    )
    SELECT
    tmp.id, t.value
    FROM t, tmp
    WHERE t.id = tmp.lastKnownId;


    However, the trivial execution of this code would create internally the square of the count of the rows of the input table ( O(n^2) ). I expected t-sql to optimize it out - on a block/record level, the task to do is very easy and linear, essentially a for loop ( O(n) ).



    However, on my experiments, the latest MS SQL 2016 can't optimize this query correctly, making this query impossible to execute for a large input table.



    Furthermore, the query has to run quickly, making a similarly easy (but very different) cursor-based solution infeasible.



    Using some memory-backed temporary table could be a good compromise, but I am not sure if it can be run significantly quicker, considered that my example query using subqueries didn't work.



    I am also thinking on to dig out some windowing function from the t-sql docs, what could be tricked to do what I want. For example, cumulative sum is doing some very similar, but I couldn't trick it to give the latest non-null element, and not the sum of the elements before.



    The ideal solution would be a quick query without procedural code or temporary tables. Alternatively, also a solution with temporary tables is okay, but iterating the table procedurally is not.










    share|improve this question


























      9












      9








      9


      1






      I have the following input:



       id | value 
      ----+-------
      1 | 136
      2 | NULL
      3 | 650
      4 | NULL
      5 | NULL
      6 | NULL
      7 | 954
      8 | NULL
      9 | 104
      10 | NULL


      I expect the following result:



       id | value 
      ----+-------
      1 | 136
      2 | 136
      3 | 650
      4 | 650
      5 | 650
      6 | 650
      7 | 954
      8 | 954
      9 | 104
      10 | 104


      The trivial solution would be join the tables with a < relation, and then selecting the MAX value in a GROUP BY:



      WITH tmp AS (
      SELECT t2.id, MAX(t1.id) AS lastKnownId
      FROM t t1, t t2
      WHERE
      t1.value IS NOT NULL
      AND
      t2.id >= t1.id
      GROUP BY t2.id
      )
      SELECT
      tmp.id, t.value
      FROM t, tmp
      WHERE t.id = tmp.lastKnownId;


      However, the trivial execution of this code would create internally the square of the count of the rows of the input table ( O(n^2) ). I expected t-sql to optimize it out - on a block/record level, the task to do is very easy and linear, essentially a for loop ( O(n) ).



      However, on my experiments, the latest MS SQL 2016 can't optimize this query correctly, making this query impossible to execute for a large input table.



      Furthermore, the query has to run quickly, making a similarly easy (but very different) cursor-based solution infeasible.



      Using some memory-backed temporary table could be a good compromise, but I am not sure if it can be run significantly quicker, considered that my example query using subqueries didn't work.



      I am also thinking on to dig out some windowing function from the t-sql docs, what could be tricked to do what I want. For example, cumulative sum is doing some very similar, but I couldn't trick it to give the latest non-null element, and not the sum of the elements before.



      The ideal solution would be a quick query without procedural code or temporary tables. Alternatively, also a solution with temporary tables is okay, but iterating the table procedurally is not.










      share|improve this question
















      I have the following input:



       id | value 
      ----+-------
      1 | 136
      2 | NULL
      3 | 650
      4 | NULL
      5 | NULL
      6 | NULL
      7 | 954
      8 | NULL
      9 | 104
      10 | NULL


      I expect the following result:



       id | value 
      ----+-------
      1 | 136
      2 | 136
      3 | 650
      4 | 650
      5 | 650
      6 | 650
      7 | 954
      8 | 954
      9 | 104
      10 | 104


      The trivial solution would be join the tables with a < relation, and then selecting the MAX value in a GROUP BY:



      WITH tmp AS (
      SELECT t2.id, MAX(t1.id) AS lastKnownId
      FROM t t1, t t2
      WHERE
      t1.value IS NOT NULL
      AND
      t2.id >= t1.id
      GROUP BY t2.id
      )
      SELECT
      tmp.id, t.value
      FROM t, tmp
      WHERE t.id = tmp.lastKnownId;


      However, the trivial execution of this code would create internally the square of the count of the rows of the input table ( O(n^2) ). I expected t-sql to optimize it out - on a block/record level, the task to do is very easy and linear, essentially a for loop ( O(n) ).



      However, on my experiments, the latest MS SQL 2016 can't optimize this query correctly, making this query impossible to execute for a large input table.



      Furthermore, the query has to run quickly, making a similarly easy (but very different) cursor-based solution infeasible.



      Using some memory-backed temporary table could be a good compromise, but I am not sure if it can be run significantly quicker, considered that my example query using subqueries didn't work.



      I am also thinking on to dig out some windowing function from the t-sql docs, what could be tricked to do what I want. For example, cumulative sum is doing some very similar, but I couldn't trick it to give the latest non-null element, and not the sum of the elements before.



      The ideal solution would be a quick query without procedural code or temporary tables. Alternatively, also a solution with temporary tables is okay, but iterating the table procedurally is not.







      sql-server t-sql null window-functions running-totals






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Apr 3 at 3:17









      Paul White

      54.2k14288461




      54.2k14288461










      asked Mar 31 at 17:19









      peterhpeterh

      1,13341532




      1,13341532




















          3 Answers
          3






          active

          oldest

          votes


















          11














          A common solution to this type of problem is given by Itzik Ben-Gan in his article The Last non NULL Puzzle:



          DROP TABLE IF EXISTS dbo.Example;

          CREATE TABLE dbo.Example
          (
          id integer PRIMARY KEY,
          val integer NULL
          );

          INSERT dbo.Example
          (id, val)
          VALUES
          (1, 136),
          (2, NULL),
          (3, 650),
          (4, NULL),
          (5, NULL),
          (6, NULL),
          (7, 954),
          (8, NULL),
          (9, 104),
          (10, NULL);

          SELECT
          E.id,
          E.val,
          lastval =
          CAST(
          SUBSTRING(
          MAX(CAST(E.id AS binary(4)) + CAST(E.val AS binary(4))) OVER (
          ORDER BY E.id
          ROWS UNBOUNDED PRECEDING),
          5, 4)
          AS integer)
          FROM dbo.Example AS E
          ORDER BY
          E.id;


          Demo: db<>fiddle






          share|improve this answer






























            9















            I expected t-sql to optimize it out - on a block/record level, the
            task to do is very easy and linear, essentially a for loop ( O(n) ).




            That's not the query that you wrote. It may not be equivalent to the query that you wrote depending on some otherwise minor detail of the table schema. You're expecting too much from the query optimizer.



            With the right indexing you can get the algorithm that you seek through the following T-SQL:



            SELECT t1.id, ca.[VALUE] 
            FROM dbo.[BIG_TABLE(FOR_U)] t1
            CROSS APPLY (
            SELECT TOP (1) [VALUE]
            FROM dbo.[BIG_TABLE(FOR_U)] t2
            WHERE t2.ID <= t1.ID AND t2.[VALUE] IS NOT NULL
            ORDER BY t2.ID DESC
            ) ca; --ORDER BY t1.ID ASC


            For each row, the query processor traverses the index backwards and stops when it finds a row with a non null value for [VALUE]. On my machine this finishes in about 90 seconds for 100 million rows in the source table. The query runs longer than necessary because some amount of time is wasted on the client discarding all of those rows.



            It's not clear to me if you need ordered results or what you plan on doing with such a large result set. The query can be adjusted to meet the actual scenario. The biggest advantage of this approach is that it does not require a sort in the query plan. That can help for larger result sets. One disadvantage is that performance will not be optimal if there are a lot of NULLs in the table because many rows will be read from the index and discarded. You should be able to improve performance with a filtered index that excludes NULLs for that case.



            Sample data for the test:



            DROP TABLE IF EXISTS #t;

            CREATE TABLE #t (
            ID BIGINT NOT NULL
            );

            INSERT INTO #t WITH (TABLOCK)
            SELECT TOP (10000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) - 1
            FROM master..spt_values t1
            CROSS JOIN master..spt_values t2
            OPTION (MAXDOP 1);

            DROP TABLE IF EXISTS dbo.[BIG_TABLE(FOR_U)];

            CREATE TABLE dbo.[BIG_TABLE(FOR_U)] (
            ID BIGINT NOT NULL,
            [VALUE] BIGINT NULL
            );

            INSERT INTO dbo.[BIG_TABLE(FOR_U)] WITH (TABLOCK)
            SELECT 10000 * t1.ID + t2.ID, CASE WHEN (t1.ID + t2.ID) % 3 = 1 THEN t2.ID ELSE NULL END
            FROM #t t1
            CROSS JOIN #t t2;

            CREATE UNIQUE CLUSTERED INDEX ADD_ORDERING ON dbo.[BIG_TABLE(FOR_U)] (ID);





            share|improve this answer























            • Thanks the answer! I have missing data in my input, I need to approximate them based on the last known value. The ratio of the not-NULLs is between 0.1% and 1%, I have around 100million records, the latest recent hardware, and mssql 2016.

              – peterh
              Apr 1 at 2:57











            • I used a simple naive queries, without any (indexed) temporary tables, with WITH ... AS constructs. I am happy that ms sql can optimize it if it got enough advices. I am experimenting on.

              – peterh
              Apr 1 at 13:04











            • @peterh: To restate, in your real data about 99% of the rows have a NULLfor [value]?

              – Joe Obbish
              Apr 1 at 22:49











            • Yes. Maybe 99.9. In the real data, also the "value"s are growing with the ids, if it helps.

              – peterh
              Apr 2 at 6:11


















            7














            One method, by using OVER() and MAX() and COUNT() based on this source could be:



            SELECT ID, MAX(value) OVER (PARTITION BY Value2) as value
            FROM
            (
            SELECT ID, value
            ,COUNT(value) OVER (ORDER BY ID) AS Value2
            FROM dbo.HugeTable
            ) a
            ORDER BY ID;


            Result



            Id UpdatedValue
            1 136
            2 136
            3 650
            4 650
            5 650
            6 650
            7 954
            8 954
            9 104
            10 104



            Another method based on this source, closely related to the first example



            ;WITH CTE As 
            (
            SELECT value,
            Id,
            COUNT(value)
            OVER(ORDER BY Id) As Value2
            FROM dbo.HugeTable
            ),

            CTE2 AS (
            SELECT Id,
            value,
            First_Value(value)
            OVER( PARTITION BY Value2
            ORDER BY Id) As UpdatedValue
            FROM CTE
            )
            SELECT Id,UpdatedValue
            FROM CTE2;





            share|improve this answer




















            • 3





              Consider adding details about how these approaches perform with a "huge table".

              – Joe Obbish
              Mar 31 at 22:32











            • Thanks! This is what I finally did, although not COUNT()... OVER, instead MAX(). It became much faster, but it is still slow. Be back soon.

              – peterh
              Apr 1 at 3:02












            Your Answer








            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "182"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader:
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            ,
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );













            draft saved

            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f233610%2fhow-to-get-the-last-not-null-value-in-an-ordered-column-of-a-huge-table%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown

























            3 Answers
            3






            active

            oldest

            votes








            3 Answers
            3






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            11














            A common solution to this type of problem is given by Itzik Ben-Gan in his article The Last non NULL Puzzle:



            DROP TABLE IF EXISTS dbo.Example;

            CREATE TABLE dbo.Example
            (
            id integer PRIMARY KEY,
            val integer NULL
            );

            INSERT dbo.Example
            (id, val)
            VALUES
            (1, 136),
            (2, NULL),
            (3, 650),
            (4, NULL),
            (5, NULL),
            (6, NULL),
            (7, 954),
            (8, NULL),
            (9, 104),
            (10, NULL);

            SELECT
            E.id,
            E.val,
            lastval =
            CAST(
            SUBSTRING(
            MAX(CAST(E.id AS binary(4)) + CAST(E.val AS binary(4))) OVER (
            ORDER BY E.id
            ROWS UNBOUNDED PRECEDING),
            5, 4)
            AS integer)
            FROM dbo.Example AS E
            ORDER BY
            E.id;


            Demo: db<>fiddle






            share|improve this answer



























              11














              A common solution to this type of problem is given by Itzik Ben-Gan in his article The Last non NULL Puzzle:



              DROP TABLE IF EXISTS dbo.Example;

              CREATE TABLE dbo.Example
              (
              id integer PRIMARY KEY,
              val integer NULL
              );

              INSERT dbo.Example
              (id, val)
              VALUES
              (1, 136),
              (2, NULL),
              (3, 650),
              (4, NULL),
              (5, NULL),
              (6, NULL),
              (7, 954),
              (8, NULL),
              (9, 104),
              (10, NULL);

              SELECT
              E.id,
              E.val,
              lastval =
              CAST(
              SUBSTRING(
              MAX(CAST(E.id AS binary(4)) + CAST(E.val AS binary(4))) OVER (
              ORDER BY E.id
              ROWS UNBOUNDED PRECEDING),
              5, 4)
              AS integer)
              FROM dbo.Example AS E
              ORDER BY
              E.id;


              Demo: db<>fiddle






              share|improve this answer

























                11












                11








                11







                A common solution to this type of problem is given by Itzik Ben-Gan in his article The Last non NULL Puzzle:



                DROP TABLE IF EXISTS dbo.Example;

                CREATE TABLE dbo.Example
                (
                id integer PRIMARY KEY,
                val integer NULL
                );

                INSERT dbo.Example
                (id, val)
                VALUES
                (1, 136),
                (2, NULL),
                (3, 650),
                (4, NULL),
                (5, NULL),
                (6, NULL),
                (7, 954),
                (8, NULL),
                (9, 104),
                (10, NULL);

                SELECT
                E.id,
                E.val,
                lastval =
                CAST(
                SUBSTRING(
                MAX(CAST(E.id AS binary(4)) + CAST(E.val AS binary(4))) OVER (
                ORDER BY E.id
                ROWS UNBOUNDED PRECEDING),
                5, 4)
                AS integer)
                FROM dbo.Example AS E
                ORDER BY
                E.id;


                Demo: db<>fiddle






                share|improve this answer













                A common solution to this type of problem is given by Itzik Ben-Gan in his article The Last non NULL Puzzle:



                DROP TABLE IF EXISTS dbo.Example;

                CREATE TABLE dbo.Example
                (
                id integer PRIMARY KEY,
                val integer NULL
                );

                INSERT dbo.Example
                (id, val)
                VALUES
                (1, 136),
                (2, NULL),
                (3, 650),
                (4, NULL),
                (5, NULL),
                (6, NULL),
                (7, 954),
                (8, NULL),
                (9, 104),
                (10, NULL);

                SELECT
                E.id,
                E.val,
                lastval =
                CAST(
                SUBSTRING(
                MAX(CAST(E.id AS binary(4)) + CAST(E.val AS binary(4))) OVER (
                ORDER BY E.id
                ROWS UNBOUNDED PRECEDING),
                5, 4)
                AS integer)
                FROM dbo.Example AS E
                ORDER BY
                E.id;


                Demo: db<>fiddle







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Apr 1 at 0:30









                Paul WhitePaul White

                54.2k14288461




                54.2k14288461























                    9















                    I expected t-sql to optimize it out - on a block/record level, the
                    task to do is very easy and linear, essentially a for loop ( O(n) ).




                    That's not the query that you wrote. It may not be equivalent to the query that you wrote depending on some otherwise minor detail of the table schema. You're expecting too much from the query optimizer.



                    With the right indexing you can get the algorithm that you seek through the following T-SQL:



                    SELECT t1.id, ca.[VALUE] 
                    FROM dbo.[BIG_TABLE(FOR_U)] t1
                    CROSS APPLY (
                    SELECT TOP (1) [VALUE]
                    FROM dbo.[BIG_TABLE(FOR_U)] t2
                    WHERE t2.ID <= t1.ID AND t2.[VALUE] IS NOT NULL
                    ORDER BY t2.ID DESC
                    ) ca; --ORDER BY t1.ID ASC


                    For each row, the query processor traverses the index backwards and stops when it finds a row with a non null value for [VALUE]. On my machine this finishes in about 90 seconds for 100 million rows in the source table. The query runs longer than necessary because some amount of time is wasted on the client discarding all of those rows.



                    It's not clear to me if you need ordered results or what you plan on doing with such a large result set. The query can be adjusted to meet the actual scenario. The biggest advantage of this approach is that it does not require a sort in the query plan. That can help for larger result sets. One disadvantage is that performance will not be optimal if there are a lot of NULLs in the table because many rows will be read from the index and discarded. You should be able to improve performance with a filtered index that excludes NULLs for that case.



                    Sample data for the test:



                    DROP TABLE IF EXISTS #t;

                    CREATE TABLE #t (
                    ID BIGINT NOT NULL
                    );

                    INSERT INTO #t WITH (TABLOCK)
                    SELECT TOP (10000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) - 1
                    FROM master..spt_values t1
                    CROSS JOIN master..spt_values t2
                    OPTION (MAXDOP 1);

                    DROP TABLE IF EXISTS dbo.[BIG_TABLE(FOR_U)];

                    CREATE TABLE dbo.[BIG_TABLE(FOR_U)] (
                    ID BIGINT NOT NULL,
                    [VALUE] BIGINT NULL
                    );

                    INSERT INTO dbo.[BIG_TABLE(FOR_U)] WITH (TABLOCK)
                    SELECT 10000 * t1.ID + t2.ID, CASE WHEN (t1.ID + t2.ID) % 3 = 1 THEN t2.ID ELSE NULL END
                    FROM #t t1
                    CROSS JOIN #t t2;

                    CREATE UNIQUE CLUSTERED INDEX ADD_ORDERING ON dbo.[BIG_TABLE(FOR_U)] (ID);





                    share|improve this answer























                    • Thanks the answer! I have missing data in my input, I need to approximate them based on the last known value. The ratio of the not-NULLs is between 0.1% and 1%, I have around 100million records, the latest recent hardware, and mssql 2016.

                      – peterh
                      Apr 1 at 2:57











                    • I used a simple naive queries, without any (indexed) temporary tables, with WITH ... AS constructs. I am happy that ms sql can optimize it if it got enough advices. I am experimenting on.

                      – peterh
                      Apr 1 at 13:04











                    • @peterh: To restate, in your real data about 99% of the rows have a NULLfor [value]?

                      – Joe Obbish
                      Apr 1 at 22:49











                    • Yes. Maybe 99.9. In the real data, also the "value"s are growing with the ids, if it helps.

                      – peterh
                      Apr 2 at 6:11















                    9















                    I expected t-sql to optimize it out - on a block/record level, the
                    task to do is very easy and linear, essentially a for loop ( O(n) ).




                    That's not the query that you wrote. It may not be equivalent to the query that you wrote depending on some otherwise minor detail of the table schema. You're expecting too much from the query optimizer.



                    With the right indexing you can get the algorithm that you seek through the following T-SQL:



                    SELECT t1.id, ca.[VALUE] 
                    FROM dbo.[BIG_TABLE(FOR_U)] t1
                    CROSS APPLY (
                    SELECT TOP (1) [VALUE]
                    FROM dbo.[BIG_TABLE(FOR_U)] t2
                    WHERE t2.ID <= t1.ID AND t2.[VALUE] IS NOT NULL
                    ORDER BY t2.ID DESC
                    ) ca; --ORDER BY t1.ID ASC


                    For each row, the query processor traverses the index backwards and stops when it finds a row with a non null value for [VALUE]. On my machine this finishes in about 90 seconds for 100 million rows in the source table. The query runs longer than necessary because some amount of time is wasted on the client discarding all of those rows.



                    It's not clear to me if you need ordered results or what you plan on doing with such a large result set. The query can be adjusted to meet the actual scenario. The biggest advantage of this approach is that it does not require a sort in the query plan. That can help for larger result sets. One disadvantage is that performance will not be optimal if there are a lot of NULLs in the table because many rows will be read from the index and discarded. You should be able to improve performance with a filtered index that excludes NULLs for that case.



                    Sample data for the test:



                    DROP TABLE IF EXISTS #t;

                    CREATE TABLE #t (
                    ID BIGINT NOT NULL
                    );

                    INSERT INTO #t WITH (TABLOCK)
                    SELECT TOP (10000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) - 1
                    FROM master..spt_values t1
                    CROSS JOIN master..spt_values t2
                    OPTION (MAXDOP 1);

                    DROP TABLE IF EXISTS dbo.[BIG_TABLE(FOR_U)];

                    CREATE TABLE dbo.[BIG_TABLE(FOR_U)] (
                    ID BIGINT NOT NULL,
                    [VALUE] BIGINT NULL
                    );

                    INSERT INTO dbo.[BIG_TABLE(FOR_U)] WITH (TABLOCK)
                    SELECT 10000 * t1.ID + t2.ID, CASE WHEN (t1.ID + t2.ID) % 3 = 1 THEN t2.ID ELSE NULL END
                    FROM #t t1
                    CROSS JOIN #t t2;

                    CREATE UNIQUE CLUSTERED INDEX ADD_ORDERING ON dbo.[BIG_TABLE(FOR_U)] (ID);





                    share|improve this answer























                    • Thanks the answer! I have missing data in my input, I need to approximate them based on the last known value. The ratio of the not-NULLs is between 0.1% and 1%, I have around 100million records, the latest recent hardware, and mssql 2016.

                      – peterh
                      Apr 1 at 2:57











                    • I used a simple naive queries, without any (indexed) temporary tables, with WITH ... AS constructs. I am happy that ms sql can optimize it if it got enough advices. I am experimenting on.

                      – peterh
                      Apr 1 at 13:04











                    • @peterh: To restate, in your real data about 99% of the rows have a NULLfor [value]?

                      – Joe Obbish
                      Apr 1 at 22:49











                    • Yes. Maybe 99.9. In the real data, also the "value"s are growing with the ids, if it helps.

                      – peterh
                      Apr 2 at 6:11













                    9












                    9








                    9








                    I expected t-sql to optimize it out - on a block/record level, the
                    task to do is very easy and linear, essentially a for loop ( O(n) ).




                    That's not the query that you wrote. It may not be equivalent to the query that you wrote depending on some otherwise minor detail of the table schema. You're expecting too much from the query optimizer.



                    With the right indexing you can get the algorithm that you seek through the following T-SQL:



                    SELECT t1.id, ca.[VALUE] 
                    FROM dbo.[BIG_TABLE(FOR_U)] t1
                    CROSS APPLY (
                    SELECT TOP (1) [VALUE]
                    FROM dbo.[BIG_TABLE(FOR_U)] t2
                    WHERE t2.ID <= t1.ID AND t2.[VALUE] IS NOT NULL
                    ORDER BY t2.ID DESC
                    ) ca; --ORDER BY t1.ID ASC


                    For each row, the query processor traverses the index backwards and stops when it finds a row with a non null value for [VALUE]. On my machine this finishes in about 90 seconds for 100 million rows in the source table. The query runs longer than necessary because some amount of time is wasted on the client discarding all of those rows.



                    It's not clear to me if you need ordered results or what you plan on doing with such a large result set. The query can be adjusted to meet the actual scenario. The biggest advantage of this approach is that it does not require a sort in the query plan. That can help for larger result sets. One disadvantage is that performance will not be optimal if there are a lot of NULLs in the table because many rows will be read from the index and discarded. You should be able to improve performance with a filtered index that excludes NULLs for that case.



                    Sample data for the test:



                    DROP TABLE IF EXISTS #t;

                    CREATE TABLE #t (
                    ID BIGINT NOT NULL
                    );

                    INSERT INTO #t WITH (TABLOCK)
                    SELECT TOP (10000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) - 1
                    FROM master..spt_values t1
                    CROSS JOIN master..spt_values t2
                    OPTION (MAXDOP 1);

                    DROP TABLE IF EXISTS dbo.[BIG_TABLE(FOR_U)];

                    CREATE TABLE dbo.[BIG_TABLE(FOR_U)] (
                    ID BIGINT NOT NULL,
                    [VALUE] BIGINT NULL
                    );

                    INSERT INTO dbo.[BIG_TABLE(FOR_U)] WITH (TABLOCK)
                    SELECT 10000 * t1.ID + t2.ID, CASE WHEN (t1.ID + t2.ID) % 3 = 1 THEN t2.ID ELSE NULL END
                    FROM #t t1
                    CROSS JOIN #t t2;

                    CREATE UNIQUE CLUSTERED INDEX ADD_ORDERING ON dbo.[BIG_TABLE(FOR_U)] (ID);





                    share|improve this answer














                    I expected t-sql to optimize it out - on a block/record level, the
                    task to do is very easy and linear, essentially a for loop ( O(n) ).




                    That's not the query that you wrote. It may not be equivalent to the query that you wrote depending on some otherwise minor detail of the table schema. You're expecting too much from the query optimizer.



                    With the right indexing you can get the algorithm that you seek through the following T-SQL:



                    SELECT t1.id, ca.[VALUE] 
                    FROM dbo.[BIG_TABLE(FOR_U)] t1
                    CROSS APPLY (
                    SELECT TOP (1) [VALUE]
                    FROM dbo.[BIG_TABLE(FOR_U)] t2
                    WHERE t2.ID <= t1.ID AND t2.[VALUE] IS NOT NULL
                    ORDER BY t2.ID DESC
                    ) ca; --ORDER BY t1.ID ASC


                    For each row, the query processor traverses the index backwards and stops when it finds a row with a non null value for [VALUE]. On my machine this finishes in about 90 seconds for 100 million rows in the source table. The query runs longer than necessary because some amount of time is wasted on the client discarding all of those rows.



                    It's not clear to me if you need ordered results or what you plan on doing with such a large result set. The query can be adjusted to meet the actual scenario. The biggest advantage of this approach is that it does not require a sort in the query plan. That can help for larger result sets. One disadvantage is that performance will not be optimal if there are a lot of NULLs in the table because many rows will be read from the index and discarded. You should be able to improve performance with a filtered index that excludes NULLs for that case.



                    Sample data for the test:



                    DROP TABLE IF EXISTS #t;

                    CREATE TABLE #t (
                    ID BIGINT NOT NULL
                    );

                    INSERT INTO #t WITH (TABLOCK)
                    SELECT TOP (10000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) - 1
                    FROM master..spt_values t1
                    CROSS JOIN master..spt_values t2
                    OPTION (MAXDOP 1);

                    DROP TABLE IF EXISTS dbo.[BIG_TABLE(FOR_U)];

                    CREATE TABLE dbo.[BIG_TABLE(FOR_U)] (
                    ID BIGINT NOT NULL,
                    [VALUE] BIGINT NULL
                    );

                    INSERT INTO dbo.[BIG_TABLE(FOR_U)] WITH (TABLOCK)
                    SELECT 10000 * t1.ID + t2.ID, CASE WHEN (t1.ID + t2.ID) % 3 = 1 THEN t2.ID ELSE NULL END
                    FROM #t t1
                    CROSS JOIN #t t2;

                    CREATE UNIQUE CLUSTERED INDEX ADD_ORDERING ON dbo.[BIG_TABLE(FOR_U)] (ID);






                    share|improve this answer












                    share|improve this answer



                    share|improve this answer










                    answered Mar 31 at 22:31









                    Joe ObbishJoe Obbish

                    21.9k43291




                    21.9k43291












                    • Thanks the answer! I have missing data in my input, I need to approximate them based on the last known value. The ratio of the not-NULLs is between 0.1% and 1%, I have around 100million records, the latest recent hardware, and mssql 2016.

                      – peterh
                      Apr 1 at 2:57











                    • I used a simple naive queries, without any (indexed) temporary tables, with WITH ... AS constructs. I am happy that ms sql can optimize it if it got enough advices. I am experimenting on.

                      – peterh
                      Apr 1 at 13:04











                    • @peterh: To restate, in your real data about 99% of the rows have a NULLfor [value]?

                      – Joe Obbish
                      Apr 1 at 22:49











                    • Yes. Maybe 99.9. In the real data, also the "value"s are growing with the ids, if it helps.

                      – peterh
                      Apr 2 at 6:11

















                    • Thanks the answer! I have missing data in my input, I need to approximate them based on the last known value. The ratio of the not-NULLs is between 0.1% and 1%, I have around 100million records, the latest recent hardware, and mssql 2016.

                      – peterh
                      Apr 1 at 2:57











                    • I used a simple naive queries, without any (indexed) temporary tables, with WITH ... AS constructs. I am happy that ms sql can optimize it if it got enough advices. I am experimenting on.

                      – peterh
                      Apr 1 at 13:04











                    • @peterh: To restate, in your real data about 99% of the rows have a NULLfor [value]?

                      – Joe Obbish
                      Apr 1 at 22:49











                    • Yes. Maybe 99.9. In the real data, also the "value"s are growing with the ids, if it helps.

                      – peterh
                      Apr 2 at 6:11
















                    Thanks the answer! I have missing data in my input, I need to approximate them based on the last known value. The ratio of the not-NULLs is between 0.1% and 1%, I have around 100million records, the latest recent hardware, and mssql 2016.

                    – peterh
                    Apr 1 at 2:57





                    Thanks the answer! I have missing data in my input, I need to approximate them based on the last known value. The ratio of the not-NULLs is between 0.1% and 1%, I have around 100million records, the latest recent hardware, and mssql 2016.

                    – peterh
                    Apr 1 at 2:57













                    I used a simple naive queries, without any (indexed) temporary tables, with WITH ... AS constructs. I am happy that ms sql can optimize it if it got enough advices. I am experimenting on.

                    – peterh
                    Apr 1 at 13:04





                    I used a simple naive queries, without any (indexed) temporary tables, with WITH ... AS constructs. I am happy that ms sql can optimize it if it got enough advices. I am experimenting on.

                    – peterh
                    Apr 1 at 13:04













                    @peterh: To restate, in your real data about 99% of the rows have a NULLfor [value]?

                    – Joe Obbish
                    Apr 1 at 22:49





                    @peterh: To restate, in your real data about 99% of the rows have a NULLfor [value]?

                    – Joe Obbish
                    Apr 1 at 22:49













                    Yes. Maybe 99.9. In the real data, also the "value"s are growing with the ids, if it helps.

                    – peterh
                    Apr 2 at 6:11





                    Yes. Maybe 99.9. In the real data, also the "value"s are growing with the ids, if it helps.

                    – peterh
                    Apr 2 at 6:11











                    7














                    One method, by using OVER() and MAX() and COUNT() based on this source could be:



                    SELECT ID, MAX(value) OVER (PARTITION BY Value2) as value
                    FROM
                    (
                    SELECT ID, value
                    ,COUNT(value) OVER (ORDER BY ID) AS Value2
                    FROM dbo.HugeTable
                    ) a
                    ORDER BY ID;


                    Result



                    Id UpdatedValue
                    1 136
                    2 136
                    3 650
                    4 650
                    5 650
                    6 650
                    7 954
                    8 954
                    9 104
                    10 104



                    Another method based on this source, closely related to the first example



                    ;WITH CTE As 
                    (
                    SELECT value,
                    Id,
                    COUNT(value)
                    OVER(ORDER BY Id) As Value2
                    FROM dbo.HugeTable
                    ),

                    CTE2 AS (
                    SELECT Id,
                    value,
                    First_Value(value)
                    OVER( PARTITION BY Value2
                    ORDER BY Id) As UpdatedValue
                    FROM CTE
                    )
                    SELECT Id,UpdatedValue
                    FROM CTE2;





                    share|improve this answer




















                    • 3





                      Consider adding details about how these approaches perform with a "huge table".

                      – Joe Obbish
                      Mar 31 at 22:32











                    • Thanks! This is what I finally did, although not COUNT()... OVER, instead MAX(). It became much faster, but it is still slow. Be back soon.

                      – peterh
                      Apr 1 at 3:02
















                    7














                    One method, by using OVER() and MAX() and COUNT() based on this source could be:



                    SELECT ID, MAX(value) OVER (PARTITION BY Value2) as value
                    FROM
                    (
                    SELECT ID, value
                    ,COUNT(value) OVER (ORDER BY ID) AS Value2
                    FROM dbo.HugeTable
                    ) a
                    ORDER BY ID;


                    Result



                    Id UpdatedValue
                    1 136
                    2 136
                    3 650
                    4 650
                    5 650
                    6 650
                    7 954
                    8 954
                    9 104
                    10 104



                    Another method based on this source, closely related to the first example



                    ;WITH CTE As 
                    (
                    SELECT value,
                    Id,
                    COUNT(value)
                    OVER(ORDER BY Id) As Value2
                    FROM dbo.HugeTable
                    ),

                    CTE2 AS (
                    SELECT Id,
                    value,
                    First_Value(value)
                    OVER( PARTITION BY Value2
                    ORDER BY Id) As UpdatedValue
                    FROM CTE
                    )
                    SELECT Id,UpdatedValue
                    FROM CTE2;





                    share|improve this answer




















                    • 3





                      Consider adding details about how these approaches perform with a "huge table".

                      – Joe Obbish
                      Mar 31 at 22:32











                    • Thanks! This is what I finally did, although not COUNT()... OVER, instead MAX(). It became much faster, but it is still slow. Be back soon.

                      – peterh
                      Apr 1 at 3:02














                    7












                    7








                    7







                    One method, by using OVER() and MAX() and COUNT() based on this source could be:



                    SELECT ID, MAX(value) OVER (PARTITION BY Value2) as value
                    FROM
                    (
                    SELECT ID, value
                    ,COUNT(value) OVER (ORDER BY ID) AS Value2
                    FROM dbo.HugeTable
                    ) a
                    ORDER BY ID;


                    Result



                    Id UpdatedValue
                    1 136
                    2 136
                    3 650
                    4 650
                    5 650
                    6 650
                    7 954
                    8 954
                    9 104
                    10 104



                    Another method based on this source, closely related to the first example



                    ;WITH CTE As 
                    (
                    SELECT value,
                    Id,
                    COUNT(value)
                    OVER(ORDER BY Id) As Value2
                    FROM dbo.HugeTable
                    ),

                    CTE2 AS (
                    SELECT Id,
                    value,
                    First_Value(value)
                    OVER( PARTITION BY Value2
                    ORDER BY Id) As UpdatedValue
                    FROM CTE
                    )
                    SELECT Id,UpdatedValue
                    FROM CTE2;





                    share|improve this answer















                    One method, by using OVER() and MAX() and COUNT() based on this source could be:



                    SELECT ID, MAX(value) OVER (PARTITION BY Value2) as value
                    FROM
                    (
                    SELECT ID, value
                    ,COUNT(value) OVER (ORDER BY ID) AS Value2
                    FROM dbo.HugeTable
                    ) a
                    ORDER BY ID;


                    Result



                    Id UpdatedValue
                    1 136
                    2 136
                    3 650
                    4 650
                    5 650
                    6 650
                    7 954
                    8 954
                    9 104
                    10 104



                    Another method based on this source, closely related to the first example



                    ;WITH CTE As 
                    (
                    SELECT value,
                    Id,
                    COUNT(value)
                    OVER(ORDER BY Id) As Value2
                    FROM dbo.HugeTable
                    ),

                    CTE2 AS (
                    SELECT Id,
                    value,
                    First_Value(value)
                    OVER( PARTITION BY Value2
                    ORDER BY Id) As UpdatedValue
                    FROM CTE
                    )
                    SELECT Id,UpdatedValue
                    FROM CTE2;






                    share|improve this answer














                    share|improve this answer



                    share|improve this answer








                    edited Apr 1 at 13:13

























                    answered Mar 31 at 17:54









                    Randi VertongenRandi Vertongen

                    4,8211924




                    4,8211924







                    • 3





                      Consider adding details about how these approaches perform with a "huge table".

                      – Joe Obbish
                      Mar 31 at 22:32











                    • Thanks! This is what I finally did, although not COUNT()... OVER, instead MAX(). It became much faster, but it is still slow. Be back soon.

                      – peterh
                      Apr 1 at 3:02













                    • 3





                      Consider adding details about how these approaches perform with a "huge table".

                      – Joe Obbish
                      Mar 31 at 22:32











                    • Thanks! This is what I finally did, although not COUNT()... OVER, instead MAX(). It became much faster, but it is still slow. Be back soon.

                      – peterh
                      Apr 1 at 3:02








                    3




                    3





                    Consider adding details about how these approaches perform with a "huge table".

                    – Joe Obbish
                    Mar 31 at 22:32





                    Consider adding details about how these approaches perform with a "huge table".

                    – Joe Obbish
                    Mar 31 at 22:32













                    Thanks! This is what I finally did, although not COUNT()... OVER, instead MAX(). It became much faster, but it is still slow. Be back soon.

                    – peterh
                    Apr 1 at 3:02






                    Thanks! This is what I finally did, although not COUNT()... OVER, instead MAX(). It became much faster, but it is still slow. Be back soon.

                    – peterh
                    Apr 1 at 3:02


















                    draft saved

                    draft discarded
















































                    Thanks for contributing an answer to Database Administrators Stack Exchange!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid


                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.

                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f233610%2fhow-to-get-the-last-not-null-value-in-an-ordered-column-of-a-huge-table%23new-answer', 'question_page');

                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Triangular numbers and gcdProving sum of a set is $0 pmod n$ if $n$ is odd, or $fracn2 pmod n$ if $n$ is even?Is greatest common divisor of two numbers really their smallest linear combination?GCD, LCM RelationshipProve a set of nonnegative integers with greatest common divisor 1 and closed under addition has all but finite many nonnegative integers.all pairs of a and b in an equation containing gcdTriangular Numbers Modulo $k$ - Hit All Values?Understanding the Existence and Uniqueness of the GCDGCD and LCM with logical symbolsThe greatest common divisor of two positive integers less than 100 is equal to 3. Their least common multiple is twelve times one of the integers.Suppose that for all integers $x$, $x|a$ and $x|b$ if and only if $x|c$. Then $c = gcd(a,b)$Which is the gcd of 2 numbers which are multiplied and the result is 600000?

                    Ingelân Ynhâld Etymology | Geografy | Skiednis | Polityk en bestjoer | Ekonomy | Demografy | Kultuer | Klimaat | Sjoch ek | Keppelings om utens | Boarnen, noaten en referinsjes Navigaasjemenuwww.gov.ukOffisjele webside fan it regear fan it Feriene KeninkrykOffisjele webside fan it Britske FerkearsburoNederlânsktalige ynformaasje fan it Britske FerkearsburoOffisjele webside fan English Heritage, de organisaasje dy't him ynset foar it behâld fan it Ingelske kultuergoedYnwennertallen fan alle Britske stêden út 'e folkstelling fan 2011Notes en References, op dizze sideEngland

                    Հադիս Բովանդակություն Անվանում և նշանակություն | Դասակարգում | Աղբյուրներ | Նավարկման ցանկ