The technique aims to ease GPU memory constraints that limit how enterprises scale AI inference and long-context applications ...