- Fix mask format conversion (float to boolean) for scaled_dot_product_attention - Fix mask dimensions for proper broadcasting [batch, 1, seq_len, seq_len] - Resolve conflict between is_causal and custom mask parameters - Enable training with optimized attention and KV caching