Learning What to Remember: Adaptive Probabilistic Memory Retention for Memory-Efficient Language Models