In many real-world scenarios, sequential data are time-series sampled from some underlying continuous-time process, so datasets consist of long, irregularly sampled sequences of varied lengths.
The power of DNNs relies heavily on the quantity and quality of training data. However, collecting and annotating data on a large scale is often expensive and time-consuming.
However, while popular, the system's suitability for assessing entities with constant skill levels, such as LLMs, remains relatively unexplored. We study two fundamental axioms that evaluation methods should adhere to: reliability and transitivity .