Cardinality estimation using ordered statisticsStatistical inference, estimation, conceptual troubleVariance of maximum likelihood estimator for discrete distributionWhy is the MLE a special case of the minimum contrast estimator?Expectation of the MLE $e^-frac1overlineX$How to find the estimator using random variables in statisticsCheck if the MLE is unbiased and/or consistentTheoretical variance of an unbiased estimator.Can we estimate entropy with a parametric density model as a sample average of log probabilities?moments estimation using Rayleigh distributionunbiased pool estimator of variance

The magic money tree problem

What is the offset in a seaplane's hull?

Test if tikzmark exists on same page

What is the word for reserving something for yourself before others do?

"to be prejudice towards/against someone" vs "to be prejudiced against/towards someone"

Dragon forelimb placement

Modeling an IPv4 Address

Can I ask the recruiters in my resume to put the reason why I am rejected?

What are these boxed doors outside store fronts in New York?

How do I create uniquely male characters?

Do I have a twin with permutated remainders?

is the intersection of subgroups a subgroup of each subgroup

Why doesn't Newton's third law mean a person bounces back to where they started when they hit the ground?

The use of multiple foreign keys on same column in SQL Server

How to say job offer in Mandarin/Cantonese?

What's the point of deactivating Num Lock on login screens?

What do the dots in this tr command do: tr .............A-Z A-ZA-Z <<< "JVPQBOV" (with 13 dots)

Have astronauts in space suits ever taken selfies? If so, how?

Fencing style for blades that can attack from a distance

Animated Series: Alien black spider robot crashes on Earth

Problem of parity - Can we draw a closed path made up of 20 line segments...

The Two and the One

How did the USSR manage to innovate in an environment characterized by government censorship and high bureaucracy?

What are the differences between the usage of 'it' and 'they'?

Cardinality estimation using ordered statistics

Statistical inference, estimation, conceptual troubleVariance of maximum likelihood estimator for discrete distributionWhy is the MLE a special case of the minimum contrast estimator?Expectation of the MLE $e^-frac1overlineX$How to find the estimator using random variables in statisticsCheck if the MLE is unbiased and/or consistentTheoretical variance of an unbiased estimator.Can we estimate entropy with a parametric density model as a sample average of log probabilities?moments estimation using Rayleigh distributionunbiased pool estimator of variance

In cardinality problem (count-distinct problem) the goal is to estimate the number of unique elements in a set. HyperLogLog is one such algorithm [ref] Another approach is using order statistics, such as what Giroire did in MinCount [ref].

For background, following this [page]

Say we have $n$ iid random variables $X_i sim U(0,1)$ for $i=1,dots,n$. The pdf of the minimum $Y$ and range $R$ for $n>1$ is

$$f_Y(x) = n(1-x)^n-1$$
$$f_R(x) = n(n-1)(1-x)x^n-2$$

We have

$$E[Y]=frac1n+1$$

which, as Giroire also pointed out, we may hope that $1/Y$ may be a good estimate of $n$, but

$$Eleft[frac1Yright]=+ infty$$

Likewise, for the range we have

$$E[R]=fracn-1n+1$$

and we may hope solving for $n$ in terms of observed $r$ may be a good estimator, however

$$Eleft[frac1+R1-Rright]=2n-1$$

Which suggests that a better estimator is then

$$Eleft[frac11-Rright]=n$$

In MinCount Giroire subdivides the (0,1) interval into $m$ buckets of size $1/m$ and finds the (normalized) minimum within each set. This forms the set $M^(k)_i$ (for $k=1$), and derives three different algorithms. For $k=1$ only the logarithm family is valid.

Assuming all $n$ values are uniformly distributed as a first approximation we can make an assumption that exactly $n/m$ are in each bucket, and that each bucket is independent. (They are correlated and the joint distribution across all $m$ buckets should be taken into account, but doing first approximation) From that we can solve for the maximum likelihood estimate of $m$ samples of $Y$

$$hatell(theta,;y)=frac1m sum_i=1^m ln f_Y(y_imidtheta)$$

With $hattheta=frachatnm$ we have a solution for $hatn$ (Labeled subscript 1 for reference)

$$hatn_1 = frac-m^2sum log(1-y_i)$$

Likewise, with same assumptions of equal numbers across buckets and using the expected value of range we could have a second estimate

$$hatn_2 = sum frac11-r_i$$

Question: I know I'm either missing something fundamental or misunderstanding something. Both MinCount and HyperLogLog, and others, use more advanced procedures to estimate cardinality, with HLL using the harmonic means and bias corrections. Is MLE not applicable? Is even using $E[frac11-r_i]$ not valid, although being unbiased?

Even assuming exact counts of $fracnm$ across $m$ buckets, and independence across buckets, the MLE ($hatn_1$) and sum of $frac11-r_i$ ($hatn_2$) seems to perform on par with MinCount for $k=1$ (Logarithm Family) ($hatn_3$) with less complicated algorithm or formulas.

Experiment: $n=10^6$ iid random variables $X_i sim U(0,1)$. Using $m=64$ equally spaced buckets, and estimating $n_1,n_2,n_3$ with same data each. Repeat 10,000 Monte Carlo experiments. Distribution of $n_1,n_2,n_3$ below.

enter image description here

asked Mar 29 at 16:58

sheppa28

318110

add a comment |

For background, following this [page]

Say we have $n$ iid random variables $X_i sim U(0,1)$ for $i=1,dots,n$. The pdf of the minimum $Y$ and range $R$ for $n>1$ is

$$f_Y(x) = n(1-x)^n-1$$
$$f_R(x) = n(n-1)(1-x)x^n-2$$

We have

$$E[Y]=frac1n+1$$

which, as Giroire also pointed out, we may hope that $1/Y$ may be a good estimate of $n$, but

$$Eleft[frac1Yright]=+ infty$$

Likewise, for the range we have

$$E[R]=fracn-1n+1$$

and we may hope solving for $n$ in terms of observed $r$ may be a good estimator, however

$$Eleft[frac1+R1-Rright]=2n-1$$

Which suggests that a better estimator is then

$$Eleft[frac11-Rright]=n$$

$$hatell(theta,;y)=frac1m sum_i=1^m ln f_Y(y_imidtheta)$$

With $hattheta=frachatnm$ we have a solution for $hatn$ (Labeled subscript 1 for reference)

$$hatn_1 = frac-m^2sum log(1-y_i)$$

Likewise, with same assumptions of equal numbers across buckets and using the expected value of range we could have a second estimate

$$hatn_2 = sum frac11-r_i$$

enter image description here

asked Mar 29 at 16:58

sheppa28

318110

add a comment |

For background, following this [page]

Say we have $n$ iid random variables $X_i sim U(0,1)$ for $i=1,dots,n$. The pdf of the minimum $Y$ and range $R$ for $n>1$ is

$$f_Y(x) = n(1-x)^n-1$$
$$f_R(x) = n(n-1)(1-x)x^n-2$$

We have

$$E[Y]=frac1n+1$$

which, as Giroire also pointed out, we may hope that $1/Y$ may be a good estimate of $n$, but

$$Eleft[frac1Yright]=+ infty$$

Likewise, for the range we have

$$E[R]=fracn-1n+1$$

and we may hope solving for $n$ in terms of observed $r$ may be a good estimator, however

$$Eleft[frac1+R1-Rright]=2n-1$$

Which suggests that a better estimator is then

$$Eleft[frac11-Rright]=n$$

$$hatell(theta,;y)=frac1m sum_i=1^m ln f_Y(y_imidtheta)$$

With $hattheta=frachatnm$ we have a solution for $hatn$ (Labeled subscript 1 for reference)

$$hatn_1 = frac-m^2sum log(1-y_i)$$

Likewise, with same assumptions of equal numbers across buckets and using the expected value of range we could have a second estimate

$$hatn_2 = sum frac11-r_i$$

enter image description here

asked Mar 29 at 16:58

sheppa28

318110

For background, following this [page]

Say we have $n$ iid random variables $X_i sim U(0,1)$ for $i=1,dots,n$. The pdf of the minimum $Y$ and range $R$ for $n>1$ is

$$f_Y(x) = n(1-x)^n-1$$
$$f_R(x) = n(n-1)(1-x)x^n-2$$

We have

$$E[Y]=frac1n+1$$

which, as Giroire also pointed out, we may hope that $1/Y$ may be a good estimate of $n$, but

$$Eleft[frac1Yright]=+ infty$$

Likewise, for the range we have

$$E[R]=fracn-1n+1$$

and we may hope solving for $n$ in terms of observed $r$ may be a good estimator, however

$$Eleft[frac1+R1-Rright]=2n-1$$

Which suggests that a better estimator is then

$$Eleft[frac11-Rright]=n$$

$$hatell(theta,;y)=frac1m sum_i=1^m ln f_Y(y_imidtheta)$$

With $hattheta=frachatnm$ we have a solution for $hatn$ (Labeled subscript 1 for reference)

$$hatn_1 = frac-m^2sum log(1-y_i)$$

Likewise, with same assumptions of equal numbers across buckets and using the expected value of range we could have a second estimate

$$hatn_2 = sum frac11-r_i$$

enter image description here

estimation

asked Mar 29 at 16:58

sheppa28

318110

asked Mar 29 at 16:58

sheppa28

318110

asked Mar 29 at 16:58

sheppa28

318110

asked Mar 29 at 16:58

sheppa28

318110

asked Mar 29 at 16:58

sheppa28

318110

add a comment |

0

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "69"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f3167352%2fcardinality-estimation-using-ordered-statistics%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

0

active

oldest

votes

0

active

oldest

votes

draft saved

draft discarded

Thanks for contributing an answer to Mathematics Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Dgdrxrt

0

Your Answer

Post as a guest

0

0

Post as a guest

Popular posts from this blog

Boston (Lincolnshire) Stedsbyld | Berne yn Boston | NavigaasjemenuBoston Borough CouncilBoston, Lincolnshire

Ballerup Komuun Stääden an saarpen | Futnuuten | Luke uk diar | Nawigatsjuunwww.ballerup.dkwww.statistikbanken.dk: Tabelle BEF44 (Folketal pr. 1. januar fordelt på byer)Commonskategorii: Ballerup Komuun55° 44′ N, 12° 22′ O

0

Your Answer

Sign up or log in

Post as a guest

Post as a guest

0

0

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Boston (Lincolnshire) Stedsbyld | Berne yn Boston | NavigaasjemenuBoston Borough CouncilBoston, Lincolnshire

Ballerup Komuun Stääden an saarpen | Futnuuten | Luke uk diar | Nawigatsjuunwww.ballerup.dkwww.statistikbanken.dk: Tabelle BEF44 (Folketal pr. 1. januar fordelt på byer)Commonskategorii: Ballerup Komuun55° 44′ N, 12° 22′ O