An overview of neural architectures for self-supervised audio representation learning from masked spectrograms