It is not valid in C++ to store into shadow_matrix1[16] with shadow_matrix1[16 * j]
(for j > 0). Even though there's a valid space in a struct after shadow_matrix1.
Knowing that GCC performs aggressive optimizations that eventually lead
to a wrong code. Code has been changed into union where one can either
use shadow_matrix[4 * 16], or individual shadow_matrix1, shadow_matrix2, etc. GCC pragma
is not needed any longer.
(cherry picked from commit d9eb6a5b20)