Cardinality estimation using ordered statisticsStatistical inference, estimation, conceptual troubleVariance of maximum likelihood estimator for discrete distributionWhy is the MLE a special case of the minimum contrast estimator?Expectation of the MLE $e^-frac1overlineX$How to find the estimator using random variables in statisticsCheck if the MLE is unbiased and/or consistentTheoretical variance of an unbiased estimator.Can we estimate entropy with a parametric density model as a sample average of log probabilities?moments estimation using Rayleigh distributionunbiased pool estimator of variance
The magic money tree problem
What is the offset in a seaplane's hull?
Test if tikzmark exists on same page
What is the word for reserving something for yourself before others do?
"to be prejudice towards/against someone" vs "to be prejudiced against/towards someone"
Dragon forelimb placement
Modeling an IPv4 Address
Can I ask the recruiters in my resume to put the reason why I am rejected?
What are these boxed doors outside store fronts in New York?
How do I create uniquely male characters?
Do I have a twin with permutated remainders?
is the intersection of subgroups a subgroup of each subgroup
Why doesn't Newton's third law mean a person bounces back to where they started when they hit the ground?
The use of multiple foreign keys on same column in SQL Server
How to say job offer in Mandarin/Cantonese?
What's the point of deactivating Num Lock on login screens?
What do the dots in this tr command do: tr .............A-Z A-ZA-Z <<< "JVPQBOV" (with 13 dots)
Have astronauts in space suits ever taken selfies? If so, how?
Fencing style for blades that can attack from a distance
Animated Series: Alien black spider robot crashes on Earth
Problem of parity - Can we draw a closed path made up of 20 line segments...
The Two and the One
How did the USSR manage to innovate in an environment characterized by government censorship and high bureaucracy?
What are the differences between the usage of 'it' and 'they'?
Cardinality estimation using ordered statistics
Statistical inference, estimation, conceptual troubleVariance of maximum likelihood estimator for discrete distributionWhy is the MLE a special case of the minimum contrast estimator?Expectation of the MLE $e^-frac1overlineX$How to find the estimator using random variables in statisticsCheck if the MLE is unbiased and/or consistentTheoretical variance of an unbiased estimator.Can we estimate entropy with a parametric density model as a sample average of log probabilities?moments estimation using Rayleigh distributionunbiased pool estimator of variance
$begingroup$
In cardinality problem (count-distinct problem) the goal is to estimate the number of unique elements in a set. HyperLogLog is one such algorithm [ref] Another approach is using order statistics, such as what Giroire did in MinCount [ref].
For background, following this [page]
Say we have $n$ iid random variables $X_i sim U(0,1)$ for $i=1,dots,n$. The pdf of the minimum $Y$ and range $R$ for $n>1$ is
$$f_Y(x) = n(1-x)^n-1$$
$$f_R(x) = n(n-1)(1-x)x^n-2$$
We have
$$E[Y]=frac1n+1$$
which, as Giroire also pointed out, we may hope that $1/Y$ may be a good estimate of $n$, but
$$Eleft[frac1Yright]=+ infty$$
Likewise, for the range we have
$$E[R]=fracn-1n+1$$
and we may hope solving for $n$ in terms of observed $r$ may be a good estimator, however
$$Eleft[frac1+R1-Rright]=2n-1$$
Which suggests that a better estimator is then
$$Eleft[frac11-Rright]=n$$
In MinCount Giroire subdivides the (0,1) interval into $m$ buckets of size $1/m$ and finds the (normalized) minimum within each set. This forms the set $M^(k)_i$ (for $k=1$), and derives three different algorithms. For $k=1$ only the logarithm family is valid.
Assuming all $n$ values are uniformly distributed as a first approximation we can make an assumption that exactly $n/m$ are in each bucket, and that each bucket is independent. (They are correlated and the joint distribution across all $m$ buckets should be taken into account, but doing first approximation) From that we can solve for the maximum likelihood estimate of $m$ samples of $Y$
$$hatell(theta,;y)=frac1m sum_i=1^m ln f_Y(y_imidtheta)$$
With $hattheta=frachatnm$ we have a solution for $hatn$ (Labeled subscript 1 for reference)
$$hatn_1 = frac-m^2sum log(1-y_i)$$
Likewise, with same assumptions of equal numbers across buckets and using the expected value of range we could have a second estimate
$$hatn_2 = sum frac11-r_i$$
Question: I know I'm either missing something fundamental or misunderstanding something. Both MinCount and HyperLogLog, and others, use more advanced procedures to estimate cardinality, with HLL using the harmonic means and bias corrections. Is MLE not applicable? Is even using $E[frac11-r_i]$ not valid, although being unbiased?
Even assuming exact counts of $fracnm$ across $m$ buckets, and independence across buckets, the MLE ($hatn_1$) and sum of $frac11-r_i$ ($hatn_2$) seems to perform on par with MinCount for $k=1$ (Logarithm Family) ($hatn_3$) with less complicated algorithm or formulas.
Experiment: $n=10^6$ iid random variables $X_i sim U(0,1)$. Using $m=64$ equally spaced buckets, and estimating $n_1,n_2,n_3$ with same data each. Repeat 10,000 Monte Carlo experiments. Distribution of $n_1,n_2,n_3$ below.
estimation
$endgroup$
add a comment |
$begingroup$
In cardinality problem (count-distinct problem) the goal is to estimate the number of unique elements in a set. HyperLogLog is one such algorithm [ref] Another approach is using order statistics, such as what Giroire did in MinCount [ref].
For background, following this [page]
Say we have $n$ iid random variables $X_i sim U(0,1)$ for $i=1,dots,n$. The pdf of the minimum $Y$ and range $R$ for $n>1$ is
$$f_Y(x) = n(1-x)^n-1$$
$$f_R(x) = n(n-1)(1-x)x^n-2$$
We have
$$E[Y]=frac1n+1$$
which, as Giroire also pointed out, we may hope that $1/Y$ may be a good estimate of $n$, but
$$Eleft[frac1Yright]=+ infty$$
Likewise, for the range we have
$$E[R]=fracn-1n+1$$
and we may hope solving for $n$ in terms of observed $r$ may be a good estimator, however
$$Eleft[frac1+R1-Rright]=2n-1$$
Which suggests that a better estimator is then
$$Eleft[frac11-Rright]=n$$
In MinCount Giroire subdivides the (0,1) interval into $m$ buckets of size $1/m$ and finds the (normalized) minimum within each set. This forms the set $M^(k)_i$ (for $k=1$), and derives three different algorithms. For $k=1$ only the logarithm family is valid.
Assuming all $n$ values are uniformly distributed as a first approximation we can make an assumption that exactly $n/m$ are in each bucket, and that each bucket is independent. (They are correlated and the joint distribution across all $m$ buckets should be taken into account, but doing first approximation) From that we can solve for the maximum likelihood estimate of $m$ samples of $Y$
$$hatell(theta,;y)=frac1m sum_i=1^m ln f_Y(y_imidtheta)$$
With $hattheta=frachatnm$ we have a solution for $hatn$ (Labeled subscript 1 for reference)
$$hatn_1 = frac-m^2sum log(1-y_i)$$
Likewise, with same assumptions of equal numbers across buckets and using the expected value of range we could have a second estimate
$$hatn_2 = sum frac11-r_i$$
Question: I know I'm either missing something fundamental or misunderstanding something. Both MinCount and HyperLogLog, and others, use more advanced procedures to estimate cardinality, with HLL using the harmonic means and bias corrections. Is MLE not applicable? Is even using $E[frac11-r_i]$ not valid, although being unbiased?
Even assuming exact counts of $fracnm$ across $m$ buckets, and independence across buckets, the MLE ($hatn_1$) and sum of $frac11-r_i$ ($hatn_2$) seems to perform on par with MinCount for $k=1$ (Logarithm Family) ($hatn_3$) with less complicated algorithm or formulas.
Experiment: $n=10^6$ iid random variables $X_i sim U(0,1)$. Using $m=64$ equally spaced buckets, and estimating $n_1,n_2,n_3$ with same data each. Repeat 10,000 Monte Carlo experiments. Distribution of $n_1,n_2,n_3$ below.
estimation
$endgroup$
add a comment |
$begingroup$
In cardinality problem (count-distinct problem) the goal is to estimate the number of unique elements in a set. HyperLogLog is one such algorithm [ref] Another approach is using order statistics, such as what Giroire did in MinCount [ref].
For background, following this [page]
Say we have $n$ iid random variables $X_i sim U(0,1)$ for $i=1,dots,n$. The pdf of the minimum $Y$ and range $R$ for $n>1$ is
$$f_Y(x) = n(1-x)^n-1$$
$$f_R(x) = n(n-1)(1-x)x^n-2$$
We have
$$E[Y]=frac1n+1$$
which, as Giroire also pointed out, we may hope that $1/Y$ may be a good estimate of $n$, but
$$Eleft[frac1Yright]=+ infty$$
Likewise, for the range we have
$$E[R]=fracn-1n+1$$
and we may hope solving for $n$ in terms of observed $r$ may be a good estimator, however
$$Eleft[frac1+R1-Rright]=2n-1$$
Which suggests that a better estimator is then
$$Eleft[frac11-Rright]=n$$
In MinCount Giroire subdivides the (0,1) interval into $m$ buckets of size $1/m$ and finds the (normalized) minimum within each set. This forms the set $M^(k)_i$ (for $k=1$), and derives three different algorithms. For $k=1$ only the logarithm family is valid.
Assuming all $n$ values are uniformly distributed as a first approximation we can make an assumption that exactly $n/m$ are in each bucket, and that each bucket is independent. (They are correlated and the joint distribution across all $m$ buckets should be taken into account, but doing first approximation) From that we can solve for the maximum likelihood estimate of $m$ samples of $Y$
$$hatell(theta,;y)=frac1m sum_i=1^m ln f_Y(y_imidtheta)$$
With $hattheta=frachatnm$ we have a solution for $hatn$ (Labeled subscript 1 for reference)
$$hatn_1 = frac-m^2sum log(1-y_i)$$
Likewise, with same assumptions of equal numbers across buckets and using the expected value of range we could have a second estimate
$$hatn_2 = sum frac11-r_i$$
Question: I know I'm either missing something fundamental or misunderstanding something. Both MinCount and HyperLogLog, and others, use more advanced procedures to estimate cardinality, with HLL using the harmonic means and bias corrections. Is MLE not applicable? Is even using $E[frac11-r_i]$ not valid, although being unbiased?
Even assuming exact counts of $fracnm$ across $m$ buckets, and independence across buckets, the MLE ($hatn_1$) and sum of $frac11-r_i$ ($hatn_2$) seems to perform on par with MinCount for $k=1$ (Logarithm Family) ($hatn_3$) with less complicated algorithm or formulas.
Experiment: $n=10^6$ iid random variables $X_i sim U(0,1)$. Using $m=64$ equally spaced buckets, and estimating $n_1,n_2,n_3$ with same data each. Repeat 10,000 Monte Carlo experiments. Distribution of $n_1,n_2,n_3$ below.
estimation
$endgroup$
In cardinality problem (count-distinct problem) the goal is to estimate the number of unique elements in a set. HyperLogLog is one such algorithm [ref] Another approach is using order statistics, such as what Giroire did in MinCount [ref].
For background, following this [page]
Say we have $n$ iid random variables $X_i sim U(0,1)$ for $i=1,dots,n$. The pdf of the minimum $Y$ and range $R$ for $n>1$ is
$$f_Y(x) = n(1-x)^n-1$$
$$f_R(x) = n(n-1)(1-x)x^n-2$$
We have
$$E[Y]=frac1n+1$$
which, as Giroire also pointed out, we may hope that $1/Y$ may be a good estimate of $n$, but
$$Eleft[frac1Yright]=+ infty$$
Likewise, for the range we have
$$E[R]=fracn-1n+1$$
and we may hope solving for $n$ in terms of observed $r$ may be a good estimator, however
$$Eleft[frac1+R1-Rright]=2n-1$$
Which suggests that a better estimator is then
$$Eleft[frac11-Rright]=n$$
In MinCount Giroire subdivides the (0,1) interval into $m$ buckets of size $1/m$ and finds the (normalized) minimum within each set. This forms the set $M^(k)_i$ (for $k=1$), and derives three different algorithms. For $k=1$ only the logarithm family is valid.
Assuming all $n$ values are uniformly distributed as a first approximation we can make an assumption that exactly $n/m$ are in each bucket, and that each bucket is independent. (They are correlated and the joint distribution across all $m$ buckets should be taken into account, but doing first approximation) From that we can solve for the maximum likelihood estimate of $m$ samples of $Y$
$$hatell(theta,;y)=frac1m sum_i=1^m ln f_Y(y_imidtheta)$$
With $hattheta=frachatnm$ we have a solution for $hatn$ (Labeled subscript 1 for reference)
$$hatn_1 = frac-m^2sum log(1-y_i)$$
Likewise, with same assumptions of equal numbers across buckets and using the expected value of range we could have a second estimate
$$hatn_2 = sum frac11-r_i$$
Question: I know I'm either missing something fundamental or misunderstanding something. Both MinCount and HyperLogLog, and others, use more advanced procedures to estimate cardinality, with HLL using the harmonic means and bias corrections. Is MLE not applicable? Is even using $E[frac11-r_i]$ not valid, although being unbiased?
Even assuming exact counts of $fracnm$ across $m$ buckets, and independence across buckets, the MLE ($hatn_1$) and sum of $frac11-r_i$ ($hatn_2$) seems to perform on par with MinCount for $k=1$ (Logarithm Family) ($hatn_3$) with less complicated algorithm or formulas.
Experiment: $n=10^6$ iid random variables $X_i sim U(0,1)$. Using $m=64$ equally spaced buckets, and estimating $n_1,n_2,n_3$ with same data each. Repeat 10,000 Monte Carlo experiments. Distribution of $n_1,n_2,n_3$ below.
estimation
estimation
asked Mar 29 at 16:58
sheppa28sheppa28
318110
318110
add a comment |
add a comment |
0
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
);
);
, "mathjax-editing");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "69"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f3167352%2fcardinality-estimation-using-ordered-statistics%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Mathematics Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f3167352%2fcardinality-estimation-using-ordered-statistics%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown