Merge pull request #4197 from KoolJBlack/op_kernels_transpose_optimization

Implementation of recursive vmla transpose kernel with test.