Cardinality estimation using ordered statisticsStatistical inference, estimation, conceptual troubleVariance of maximum likelihood estimator for discrete distributionWhy is the MLE a special case of the minimum contrast estimator?Expectation of the MLE $e^-frac1overlineX$How to find the estimator using random variables in statisticsCheck if the MLE is unbiased and/or consistentTheoretical variance of an unbiased estimator.Can we estimate entropy with a parametric density model as a sample average of log probabilities?moments estimation using Rayleigh distributionunbiased pool estimator of variance

The magic money tree problem

What is the offset in a seaplane's hull?

Test if tikzmark exists on same page

What is the word for reserving something for yourself before others do?

"to be prejudice towards/against someone" vs "to be prejudiced against/towards someone"

Dragon forelimb placement

Modeling an IPv4 Address

Can I ask the recruiters in my resume to put the reason why I am rejected?

What are these boxed doors outside store fronts in New York?

How do I create uniquely male characters?

Do I have a twin with permutated remainders?

is the intersection of subgroups a subgroup of each subgroup

Why doesn't Newton's third law mean a person bounces back to where they started when they hit the ground?

The use of multiple foreign keys on same column in SQL Server

How to say job offer in Mandarin/Cantonese?

What's the point of deactivating Num Lock on login screens?

What do the dots in this tr command do: tr .............A-Z A-ZA-Z <<< "JVPQBOV" (with 13 dots)

Have astronauts in space suits ever taken selfies? If so, how?

Fencing style for blades that can attack from a distance

Animated Series: Alien black spider robot crashes on Earth

Problem of parity - Can we draw a closed path made up of 20 line segments...

The Two and the One

How did the USSR manage to innovate in an environment characterized by government censorship and high bureaucracy?

What are the differences between the usage of 'it' and 'they'?



Cardinality estimation using ordered statistics


Statistical inference, estimation, conceptual troubleVariance of maximum likelihood estimator for discrete distributionWhy is the MLE a special case of the minimum contrast estimator?Expectation of the MLE $e^-frac1overlineX$How to find the estimator using random variables in statisticsCheck if the MLE is unbiased and/or consistentTheoretical variance of an unbiased estimator.Can we estimate entropy with a parametric density model as a sample average of log probabilities?moments estimation using Rayleigh distributionunbiased pool estimator of variance













0












$begingroup$


In cardinality problem (count-distinct problem) the goal is to estimate the number of unique elements in a set. HyperLogLog is one such algorithm [ref] Another approach is using order statistics, such as what Giroire did in MinCount [ref].



For background, following this [page]



Say we have $n$ iid random variables $X_i sim U(0,1)$ for $i=1,dots,n$. The pdf of the minimum $Y$ and range $R$ for $n>1$ is



$$f_Y(x) = n(1-x)^n-1$$
$$f_R(x) = n(n-1)(1-x)x^n-2$$



We have



$$E[Y]=frac1n+1$$



which, as Giroire also pointed out, we may hope that $1/Y$ may be a good estimate of $n$, but



$$Eleft[frac1Yright]=+ infty$$



Likewise, for the range we have



$$E[R]=fracn-1n+1$$



and we may hope solving for $n$ in terms of observed $r$ may be a good estimator, however



$$Eleft[frac1+R1-Rright]=2n-1$$



Which suggests that a better estimator is then



$$Eleft[frac11-Rright]=n$$



In MinCount Giroire subdivides the (0,1) interval into $m$ buckets of size $1/m$ and finds the (normalized) minimum within each set. This forms the set $M^(k)_i$ (for $k=1$), and derives three different algorithms. For $k=1$ only the logarithm family is valid.



Assuming all $n$ values are uniformly distributed as a first approximation we can make an assumption that exactly $n/m$ are in each bucket, and that each bucket is independent. (They are correlated and the joint distribution across all $m$ buckets should be taken into account, but doing first approximation) From that we can solve for the maximum likelihood estimate of $m$ samples of $Y$



$$hatell(theta,;y)=frac1m sum_i=1^m ln f_Y(y_imidtheta)$$



With $hattheta=frachatnm$ we have a solution for $hatn$ (Labeled subscript 1 for reference)



$$hatn_1 = frac-m^2sum log(1-y_i)$$



Likewise, with same assumptions of equal numbers across buckets and using the expected value of range we could have a second estimate



$$hatn_2 = sum frac11-r_i$$



Question: I know I'm either missing something fundamental or misunderstanding something. Both MinCount and HyperLogLog, and others, use more advanced procedures to estimate cardinality, with HLL using the harmonic means and bias corrections. Is MLE not applicable? Is even using $E[frac11-r_i]$ not valid, although being unbiased?



Even assuming exact counts of $fracnm$ across $m$ buckets, and independence across buckets, the MLE ($hatn_1$) and sum of $frac11-r_i$ ($hatn_2$) seems to perform on par with MinCount for $k=1$ (Logarithm Family) ($hatn_3$) with less complicated algorithm or formulas.



Experiment: $n=10^6$ iid random variables $X_i sim U(0,1)$. Using $m=64$ equally spaced buckets, and estimating $n_1,n_2,n_3$ with same data each. Repeat 10,000 Monte Carlo experiments. Distribution of $n_1,n_2,n_3$ below.



enter image description here










share|cite|improve this question









$endgroup$
















    0












    $begingroup$


    In cardinality problem (count-distinct problem) the goal is to estimate the number of unique elements in a set. HyperLogLog is one such algorithm [ref] Another approach is using order statistics, such as what Giroire did in MinCount [ref].



    For background, following this [page]



    Say we have $n$ iid random variables $X_i sim U(0,1)$ for $i=1,dots,n$. The pdf of the minimum $Y$ and range $R$ for $n>1$ is



    $$f_Y(x) = n(1-x)^n-1$$
    $$f_R(x) = n(n-1)(1-x)x^n-2$$



    We have



    $$E[Y]=frac1n+1$$



    which, as Giroire also pointed out, we may hope that $1/Y$ may be a good estimate of $n$, but



    $$Eleft[frac1Yright]=+ infty$$



    Likewise, for the range we have



    $$E[R]=fracn-1n+1$$



    and we may hope solving for $n$ in terms of observed $r$ may be a good estimator, however



    $$Eleft[frac1+R1-Rright]=2n-1$$



    Which suggests that a better estimator is then



    $$Eleft[frac11-Rright]=n$$



    In MinCount Giroire subdivides the (0,1) interval into $m$ buckets of size $1/m$ and finds the (normalized) minimum within each set. This forms the set $M^(k)_i$ (for $k=1$), and derives three different algorithms. For $k=1$ only the logarithm family is valid.



    Assuming all $n$ values are uniformly distributed as a first approximation we can make an assumption that exactly $n/m$ are in each bucket, and that each bucket is independent. (They are correlated and the joint distribution across all $m$ buckets should be taken into account, but doing first approximation) From that we can solve for the maximum likelihood estimate of $m$ samples of $Y$



    $$hatell(theta,;y)=frac1m sum_i=1^m ln f_Y(y_imidtheta)$$



    With $hattheta=frachatnm$ we have a solution for $hatn$ (Labeled subscript 1 for reference)



    $$hatn_1 = frac-m^2sum log(1-y_i)$$



    Likewise, with same assumptions of equal numbers across buckets and using the expected value of range we could have a second estimate



    $$hatn_2 = sum frac11-r_i$$



    Question: I know I'm either missing something fundamental or misunderstanding something. Both MinCount and HyperLogLog, and others, use more advanced procedures to estimate cardinality, with HLL using the harmonic means and bias corrections. Is MLE not applicable? Is even using $E[frac11-r_i]$ not valid, although being unbiased?



    Even assuming exact counts of $fracnm$ across $m$ buckets, and independence across buckets, the MLE ($hatn_1$) and sum of $frac11-r_i$ ($hatn_2$) seems to perform on par with MinCount for $k=1$ (Logarithm Family) ($hatn_3$) with less complicated algorithm or formulas.



    Experiment: $n=10^6$ iid random variables $X_i sim U(0,1)$. Using $m=64$ equally spaced buckets, and estimating $n_1,n_2,n_3$ with same data each. Repeat 10,000 Monte Carlo experiments. Distribution of $n_1,n_2,n_3$ below.



    enter image description here










    share|cite|improve this question









    $endgroup$














      0












      0








      0





      $begingroup$


      In cardinality problem (count-distinct problem) the goal is to estimate the number of unique elements in a set. HyperLogLog is one such algorithm [ref] Another approach is using order statistics, such as what Giroire did in MinCount [ref].



      For background, following this [page]



      Say we have $n$ iid random variables $X_i sim U(0,1)$ for $i=1,dots,n$. The pdf of the minimum $Y$ and range $R$ for $n>1$ is



      $$f_Y(x) = n(1-x)^n-1$$
      $$f_R(x) = n(n-1)(1-x)x^n-2$$



      We have



      $$E[Y]=frac1n+1$$



      which, as Giroire also pointed out, we may hope that $1/Y$ may be a good estimate of $n$, but



      $$Eleft[frac1Yright]=+ infty$$



      Likewise, for the range we have



      $$E[R]=fracn-1n+1$$



      and we may hope solving for $n$ in terms of observed $r$ may be a good estimator, however



      $$Eleft[frac1+R1-Rright]=2n-1$$



      Which suggests that a better estimator is then



      $$Eleft[frac11-Rright]=n$$



      In MinCount Giroire subdivides the (0,1) interval into $m$ buckets of size $1/m$ and finds the (normalized) minimum within each set. This forms the set $M^(k)_i$ (for $k=1$), and derives three different algorithms. For $k=1$ only the logarithm family is valid.



      Assuming all $n$ values are uniformly distributed as a first approximation we can make an assumption that exactly $n/m$ are in each bucket, and that each bucket is independent. (They are correlated and the joint distribution across all $m$ buckets should be taken into account, but doing first approximation) From that we can solve for the maximum likelihood estimate of $m$ samples of $Y$



      $$hatell(theta,;y)=frac1m sum_i=1^m ln f_Y(y_imidtheta)$$



      With $hattheta=frachatnm$ we have a solution for $hatn$ (Labeled subscript 1 for reference)



      $$hatn_1 = frac-m^2sum log(1-y_i)$$



      Likewise, with same assumptions of equal numbers across buckets and using the expected value of range we could have a second estimate



      $$hatn_2 = sum frac11-r_i$$



      Question: I know I'm either missing something fundamental or misunderstanding something. Both MinCount and HyperLogLog, and others, use more advanced procedures to estimate cardinality, with HLL using the harmonic means and bias corrections. Is MLE not applicable? Is even using $E[frac11-r_i]$ not valid, although being unbiased?



      Even assuming exact counts of $fracnm$ across $m$ buckets, and independence across buckets, the MLE ($hatn_1$) and sum of $frac11-r_i$ ($hatn_2$) seems to perform on par with MinCount for $k=1$ (Logarithm Family) ($hatn_3$) with less complicated algorithm or formulas.



      Experiment: $n=10^6$ iid random variables $X_i sim U(0,1)$. Using $m=64$ equally spaced buckets, and estimating $n_1,n_2,n_3$ with same data each. Repeat 10,000 Monte Carlo experiments. Distribution of $n_1,n_2,n_3$ below.



      enter image description here










      share|cite|improve this question









      $endgroup$




      In cardinality problem (count-distinct problem) the goal is to estimate the number of unique elements in a set. HyperLogLog is one such algorithm [ref] Another approach is using order statistics, such as what Giroire did in MinCount [ref].



      For background, following this [page]



      Say we have $n$ iid random variables $X_i sim U(0,1)$ for $i=1,dots,n$. The pdf of the minimum $Y$ and range $R$ for $n>1$ is



      $$f_Y(x) = n(1-x)^n-1$$
      $$f_R(x) = n(n-1)(1-x)x^n-2$$



      We have



      $$E[Y]=frac1n+1$$



      which, as Giroire also pointed out, we may hope that $1/Y$ may be a good estimate of $n$, but



      $$Eleft[frac1Yright]=+ infty$$



      Likewise, for the range we have



      $$E[R]=fracn-1n+1$$



      and we may hope solving for $n$ in terms of observed $r$ may be a good estimator, however



      $$Eleft[frac1+R1-Rright]=2n-1$$



      Which suggests that a better estimator is then



      $$Eleft[frac11-Rright]=n$$



      In MinCount Giroire subdivides the (0,1) interval into $m$ buckets of size $1/m$ and finds the (normalized) minimum within each set. This forms the set $M^(k)_i$ (for $k=1$), and derives three different algorithms. For $k=1$ only the logarithm family is valid.



      Assuming all $n$ values are uniformly distributed as a first approximation we can make an assumption that exactly $n/m$ are in each bucket, and that each bucket is independent. (They are correlated and the joint distribution across all $m$ buckets should be taken into account, but doing first approximation) From that we can solve for the maximum likelihood estimate of $m$ samples of $Y$



      $$hatell(theta,;y)=frac1m sum_i=1^m ln f_Y(y_imidtheta)$$



      With $hattheta=frachatnm$ we have a solution for $hatn$ (Labeled subscript 1 for reference)



      $$hatn_1 = frac-m^2sum log(1-y_i)$$



      Likewise, with same assumptions of equal numbers across buckets and using the expected value of range we could have a second estimate



      $$hatn_2 = sum frac11-r_i$$



      Question: I know I'm either missing something fundamental or misunderstanding something. Both MinCount and HyperLogLog, and others, use more advanced procedures to estimate cardinality, with HLL using the harmonic means and bias corrections. Is MLE not applicable? Is even using $E[frac11-r_i]$ not valid, although being unbiased?



      Even assuming exact counts of $fracnm$ across $m$ buckets, and independence across buckets, the MLE ($hatn_1$) and sum of $frac11-r_i$ ($hatn_2$) seems to perform on par with MinCount for $k=1$ (Logarithm Family) ($hatn_3$) with less complicated algorithm or formulas.



      Experiment: $n=10^6$ iid random variables $X_i sim U(0,1)$. Using $m=64$ equally spaced buckets, and estimating $n_1,n_2,n_3$ with same data each. Repeat 10,000 Monte Carlo experiments. Distribution of $n_1,n_2,n_3$ below.



      enter image description here







      estimation






      share|cite|improve this question













      share|cite|improve this question











      share|cite|improve this question




      share|cite|improve this question










      asked Mar 29 at 16:58









      sheppa28sheppa28

      318110




      318110




















          0






          active

          oldest

          votes












          Your Answer





          StackExchange.ifUsing("editor", function ()
          return StackExchange.using("mathjaxEditing", function ()
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
          );
          );
          , "mathjax-editing");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "69"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          noCode: true, onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f3167352%2fcardinality-estimation-using-ordered-statistics%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          0






          active

          oldest

          votes








          0






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes















          draft saved

          draft discarded
















































          Thanks for contributing an answer to Mathematics Stack Exchange!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          Use MathJax to format equations. MathJax reference.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f3167352%2fcardinality-estimation-using-ordered-statistics%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Triangular numbers and gcdProving sum of a set is $0 pmod n$ if $n$ is odd, or $fracn2 pmod n$ if $n$ is even?Is greatest common divisor of two numbers really their smallest linear combination?GCD, LCM RelationshipProve a set of nonnegative integers with greatest common divisor 1 and closed under addition has all but finite many nonnegative integers.all pairs of a and b in an equation containing gcdTriangular Numbers Modulo $k$ - Hit All Values?Understanding the Existence and Uniqueness of the GCDGCD and LCM with logical symbolsThe greatest common divisor of two positive integers less than 100 is equal to 3. Their least common multiple is twelve times one of the integers.Suppose that for all integers $x$, $x|a$ and $x|b$ if and only if $x|c$. Then $c = gcd(a,b)$Which is the gcd of 2 numbers which are multiplied and the result is 600000?

          Barbados Ynhâld Skiednis | Geografy | Demografy | Navigaasjemenu

          Σερβία Πίνακας περιεχομένων Γεωγραφία | Ιστορία | Πολιτική | Δημογραφία | Οικονομία | Τουρισμός | Εκπαίδευση και επιστήμη | Πολιτισμός | Δείτε επίσης | Παραπομπές | Εξωτερικοί σύνδεσμοι | Μενού πλοήγησης43°49′00″N 21°08′00″E / 43.8167°N 21.1333°E / 43.8167; 21.133344°49′14″N 20°27′44″E / 44.8206°N 20.4622°E / 44.8206; 20.4622 (Βελιγράδι)Επίσημη εκτίμηση«Σερβία»«Human Development Report 2018»Παγκόσμιος Οργανισμός Υγείας, Προσδόκιμο ζωής και υγιές προσδόκιμο ζωής, Δεδομένα ανά χώρα2003 statistics2004 statistics2005 statistics2006 statistics2007 statistics2008 statistics2009-2013 statistics2014 statisticsStatistical Yearbook of the Republic of Serbia – Tourism, 20152016 statisticsStatistical Yearbook of the Republic of Serbia – Tourism, 2015Πληροφορίες σχετικά με τη Σερβία και τον πολιτισμό τηςΣερβική ΠροεδρίαΕθνικός Οργανισμός Τουρισμού της ΣερβίαςΣερβική ΕθνοσυνέλευσηΣερβίαεε