Adaptive scheduling for adaptive sampling in POS taggers construction

Ferro, Manuel Vilares, Bilbao, Victor M. Darriba, Ferro, Jesús Vilares

arXiv.org Artificial Intelligence 

However, managing large amounts of information is an expensive, time-consuming and non-trivial activity, especially when expert knowledge is needed. Furthermore, having access to vast data bases does not imply that ml algorithms must use them all and a subset is therefore preferred, provided it does not reduce the quality of the mined knowledge. Such observations then supply the same learning power with far less computational cost and allow the training process to be speeded up, whilst their nature and optimal size are rarely obvious. This justifies the interest of developing efficient sampling techniques, which involves anticipating the link between performance and experience regarding the accuracy of the system we are generating. At this point, correctness with respect to the working hypotheses and robustness against changes to them should be guaranteed in order to supply a practical solution. The former ensures the effectiveness of the proposed strategy in the framework considered, while the latter enables fluctuations in the learning conditions to be assimilated without compromising correctness, thus providing reliability to our calculations. An area of work that is particularly sensitive to these inconveniences is natural language processing (nlp), the components of which are increasingly based on ml [3, 50].