An efficient encoder-decoder architecture with top-down attention for speech separation