Unified CNNs and transformers underlying learning mechanism reveals multi-head attention modus vivendi