Optimal Prediction of the Number of Unseen Species with Multiplicity Yi Hao

May-29-2025, 12:33:00 GMT–Neural Information Processing Systems

Based on a sample of size n, we consider estimating the number of symbols that appear at least µ times in an independent sample of size a n, where a is a given parameter. This formulation includes, as a special case, the well-known problem of inferring the number of unseen species introduced by [Fisher et al.] in 1943 and considered by many others. Of considerable interest in this line of works is the largest a for which the quantity can be accurately predicted. We completely resolve this problem by determining the limit of estimation to be a (log n)/µ, with both lower and upper bounds matching up to constant factors. For the particular case of µ = 1, this implies the recent result by [Orlitsky et al.] on the unseen species problem. Experimental evaluations show that the proposed estimator performs exceptionally well in practice. Furthermore, the estimator is a linear combination of symbols' empirical counts, and hence linear-time computable.

artificial intelligence, estimator, machine learning, (16 more...)

Neural Information Processing Systems

May-29-2025, 12:33:00 GMT

Conferences PDF

Add feedback

Country:
- Europe > United Kingdom
  - England (0.14)
- North America > United States
  - California (0.14)

Industry:
- Government > Regional Government (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (0.93)
  - Representation & Reasoning (0.68)