This time, the performance increase is got with MSVC and GCC. On non-AES CPU, there were an useless load/store SSE2 register. The last MSVC "hack" is replaced by a portable code and he's more complete (a load is saved).
On my C2Q6600, with 3 thread, I have +16% with MSVC2015 and +20% with GCC 7.3, compared to official 2.4.4 version.