From ce2985b22cc33537e376d896e16409e595f9fc31 Mon Sep 17 00:00:00 2001
From: Janos Follath <janos.follath@arm.com>
Date: Fri, 24 Feb 2023 16:00:21 +0000
Subject: [PATCH 01/45] Add Threat Model Summary

Signed-off-by: Janos Follath <janos.follath@arm.com>
---
 SECURITY.md | 60 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 60 insertions(+)

diff --git a/SECURITY.md b/SECURITY.md
index 33bbc2ff30..ae37dab778 100644
--- a/SECURITY.md
+++ b/SECURITY.md
@@ -18,3 +18,63 @@ goes public.
 Only the maintained branches, as listed in [`BRANCHES.md`](BRANCHES.md),
 get security fixes.
 Users are urged to always use the latest version of a maintained branch.
+
+## Threat model
+
+We use the following classification of attacks:
+
+- **Remote Attacks:** The attacker can observe and modify data sent over the
+  network. This includes observing timing of individual packets and potentially
+  delaying legitimate messages.
+- **Timing Attacks:** The attacker can gain information about the time certain
+  sets of instructions in Mbed TLS operations take.
+- **Physical Attacks:** The attacker has access to physical information about
+  the hardware Mbed TLS is running on and/or can alter the physical state of
+  the hardware.
+
+### Remote attacks
+
+Mbed TLS aims to fully protect against remote attacks. Mbed Crypto aims to
+enable the user application in providing full protection against remote
+attacks. Said protection is limited to providing security guarantees offered by
+the protocol in question. (For example Mbed TLS alone won't guarantee that the
+messages will arrive without delay, as the TLS protocol doesn't guarantee that
+either.)
+
+### Timing attacks
+
+Mbed TLS and Mbed Crypto provide limited protection against timing attacks. The
+cost of protecting against timing attacks widely varies depending on the
+granularity of the measurements and the noise present. Therefore the protection
+in Mbed TLS and Mbed Crypto is limited. We are only aiming to provide protection
+against publicly documented attacks.
+
+**Warning!** Block ciphers constitute an exception from this protection. For
+details and workarounds see the section below.
+
+#### Block Ciphers
+
+Currently there are 4 block ciphers in Mbed TLS: AES, CAMELLIA, ARIA and DES.
+The Mbed TLS implementation uses lookup tables, which are vulnerable to timing
+attacks.
+
+**Workarounds:**
+
+- Turn on hardware acceleration for AES. This is supported only on selected
+  architectures and currently only available for AES. See configuration options
+  `MBEDTLS_AESCE_C`, `MBEDTLS_AESNI_C` and `MBEDTLS_PADLOCK_C` for details.
+- Add a secure alternative implementation (typically bitslice implementation or
+  hardware acceleration) for the vulnerable cipher. See the [Alternative
+Implementations Guide](docs/architecture/alternative-implementations.md) for
+  more information.
+- Instead of a block cipher, use ChaCha20/Poly1305 for encryption and data
+  origin authentication.
+
+### Physical attacks
+
+Physical attacks are out of scope. Any attack using information about or
+influencing the physical state of the hardware is considered physical,
+independently of the attack vector. (For example Row Hammer and Screaming
+Channels are considered physical attacks.) If physical attacks are present in a
+use case or a user application's threat model, it needs to be mitigated by
+physical countermeasures.

From 661c88f2ba1ccbb0d95c81743d63c67c897cbe54 Mon Sep 17 00:00:00 2001
From: Janos Follath <janos.follath@arm.com>
Date: Fri, 3 Mar 2023 14:16:12 +0000
Subject: [PATCH 02/45] Threat Model: Improve wording

Signed-off-by: Janos Follath <janos.follath@arm.com>

Co-authored-by: Dave Rodgman <dave.rodgman@arm.com>
---
 SECURITY.md | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/SECURITY.md b/SECURITY.md
index ae37dab778..50c8ffd980 100644
--- a/SECURITY.md
+++ b/SECURITY.md
@@ -26,8 +26,8 @@ We use the following classification of attacks:
 - **Remote Attacks:** The attacker can observe and modify data sent over the
   network. This includes observing timing of individual packets and potentially
   delaying legitimate messages.
-- **Timing Attacks:** The attacker can gain information about the time certain
-  sets of instructions in Mbed TLS operations take.
+- **Timing Attacks:** The attacker can gain information about the time taken
+  by certain sets of instructions in Mbed TLS operations.
 - **Physical Attacks:** The attacker has access to physical information about
   the hardware Mbed TLS is running on and/or can alter the physical state of
   the hardware.
@@ -47,14 +47,14 @@ Mbed TLS and Mbed Crypto provide limited protection against timing attacks. The
 cost of protecting against timing attacks widely varies depending on the
 granularity of the measurements and the noise present. Therefore the protection
 in Mbed TLS and Mbed Crypto is limited. We are only aiming to provide protection
-against publicly documented attacks.
+against publicly documented attacks, and this protection is not currently complete.
 
-**Warning!** Block ciphers constitute an exception from this protection. For
+**Warning!** Block ciphers do not yet achieve full protection. For
 details and workarounds see the section below.
 
 #### Block Ciphers
 
-Currently there are 4 block ciphers in Mbed TLS: AES, CAMELLIA, ARIA and DES.
+Currently there are four block ciphers in Mbed TLS: AES, CAMELLIA, ARIA and DES.
 The Mbed TLS implementation uses lookup tables, which are vulnerable to timing
 attacks.
 
@@ -63,7 +63,7 @@ attacks.
 - Turn on hardware acceleration for AES. This is supported only on selected
   architectures and currently only available for AES. See configuration options
   `MBEDTLS_AESCE_C`, `MBEDTLS_AESNI_C` and `MBEDTLS_PADLOCK_C` for details.
-- Add a secure alternative implementation (typically bitslice implementation or
+- Add a secure alternative implementation (typically a bitsliced implementation or
   hardware acceleration) for the vulnerable cipher. See the [Alternative
 Implementations Guide](docs/architecture/alternative-implementations.md) for
   more information.

From e57ed98f9e4d3049519cb46aa8ab887877e07d32 Mon Sep 17 00:00:00 2001
From: Janos Follath <janos.follath@arm.com>
Date: Fri, 3 Mar 2023 14:56:38 +0000
Subject: [PATCH 03/45] Threat Model: Miscellaneous clarifications

Signed-off-by: Janos Follath <janos.follath@arm.com>
---
 SECURITY.md | 53 ++++++++++++++++++++++++++---------------------------
 1 file changed, 26 insertions(+), 27 deletions(-)

diff --git a/SECURITY.md b/SECURITY.md
index 50c8ffd980..4ed9d3807c 100644
--- a/SECURITY.md
+++ b/SECURITY.md
@@ -24,8 +24,8 @@ Users are urged to always use the latest version of a maintained branch.
 We use the following classification of attacks:
 
 - **Remote Attacks:** The attacker can observe and modify data sent over the
-  network. This includes observing timing of individual packets and potentially
-  delaying legitimate messages.
+  network. This includes observing the content and timing of individual packets,
+  as well as suppressing or delaying legitimate messages, and injecting messages.
 - **Timing Attacks:** The attacker can gain information about the time taken
   by certain sets of instructions in Mbed TLS operations.
 - **Physical Attacks:** The attacker has access to physical information about
@@ -34,20 +34,19 @@ We use the following classification of attacks:
 
 ### Remote attacks
 
-Mbed TLS aims to fully protect against remote attacks. Mbed Crypto aims to
-enable the user application in providing full protection against remote
-attacks. Said protection is limited to providing security guarantees offered by
-the protocol in question. (For example Mbed TLS alone won't guarantee that the
-messages will arrive without delay, as the TLS protocol doesn't guarantee that
-either.)
+Mbed TLS aims to fully protect against remote attacks and to enable the user
+application in providing full protection against remote attacks. Said
+protection is limited to providing security guarantees offered by the protocol
+in question. (For example Mbed TLS alone won't guarantee that the messages will
+arrive without delay, as the TLS protocol doesn't guarantee that either.)
 
 ### Timing attacks
 
-Mbed TLS and Mbed Crypto provide limited protection against timing attacks. The
-cost of protecting against timing attacks widely varies depending on the
-granularity of the measurements and the noise present. Therefore the protection
-in Mbed TLS and Mbed Crypto is limited. We are only aiming to provide protection
-against publicly documented attacks, and this protection is not currently complete.
+Mbed TLS provides limited protection against timing attacks. The cost of
+protecting against timing attacks widely varies depending on the granularity of
+the measurements and the noise present. Therefore the protection in Mbed TLS is
+limited. We are only aiming to provide protection against publicly documented
+attacks, and this protection is not currently complete.
 
 **Warning!** Block ciphers do not yet achieve full protection. For
 details and workarounds see the section below.
@@ -55,26 +54,26 @@ details and workarounds see the section below.
 #### Block Ciphers
 
 Currently there are four block ciphers in Mbed TLS: AES, CAMELLIA, ARIA and DES.
-The Mbed TLS implementation uses lookup tables, which are vulnerable to timing
-attacks.
+The pure software implementation in Mbed TLS implementation uses lookup tables,
+which are vulnerable to timing attacks.
 
 **Workarounds:**
 
 - Turn on hardware acceleration for AES. This is supported only on selected
   architectures and currently only available for AES. See configuration options
   `MBEDTLS_AESCE_C`, `MBEDTLS_AESNI_C` and `MBEDTLS_PADLOCK_C` for details.
-- Add a secure alternative implementation (typically a bitsliced implementation or
-  hardware acceleration) for the vulnerable cipher. See the [Alternative
-Implementations Guide](docs/architecture/alternative-implementations.md) for
-  more information.
-- Instead of a block cipher, use ChaCha20/Poly1305 for encryption and data
-  origin authentication.
+- Add a secure alternative implementation (typically hardware acceleration) for
+  the vulnerable cipher. See the [Alternative Implementations
+Guide](docs/architecture/alternative-implementations.md) for more information.
+- Use cryptographic mechanisms that are not based on block ciphers. In
+  particular, for authenticated encryption, use ChaCha20/Poly1305 instead of
+  block cipher modes. For random generation, use HMAC\_DRBG instead of CTR\_DRBG.
 
 ### Physical attacks
 
-Physical attacks are out of scope. Any attack using information about or
-influencing the physical state of the hardware is considered physical,
-independently of the attack vector. (For example Row Hammer and Screaming
-Channels are considered physical attacks.) If physical attacks are present in a
-use case or a user application's threat model, it needs to be mitigated by
-physical countermeasures.
+Physical attacks are out of scope (eg. power analysis or radio emissions). Any
+attack using information about or influencing the physical state of the
+hardware is considered physical, independently of the attack vector. (For
+example Row Hammer and Screaming Channels are considered physical attacks.) If
+physical attacks are present in a use case or a user application's threat
+model, it needs to be mitigated by physical countermeasures.

From 5adb2c2328acb5d8280e2bd666860caa3d4ec174 Mon Sep 17 00:00:00 2001
From: Janos Follath <janos.follath@arm.com>
Date: Mon, 6 Mar 2023 14:54:59 +0000
Subject: [PATCH 04/45] Threat Model: reorganise threat definitions

Simplify organisation by placing threat definitions in their respective
sections.

Signed-off-by: Janos Follath <janos.follath@arm.com>
---
 SECURITY.md | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/SECURITY.md b/SECURITY.md
index 4ed9d3807c..7981a44b64 100644
--- a/SECURITY.md
+++ b/SECURITY.md
@@ -23,17 +23,12 @@ Users are urged to always use the latest version of a maintained branch.
 
 We use the following classification of attacks:
 
-- **Remote Attacks:** The attacker can observe and modify data sent over the
-  network. This includes observing the content and timing of individual packets,
-  as well as suppressing or delaying legitimate messages, and injecting messages.
-- **Timing Attacks:** The attacker can gain information about the time taken
-  by certain sets of instructions in Mbed TLS operations.
-- **Physical Attacks:** The attacker has access to physical information about
-  the hardware Mbed TLS is running on and/or can alter the physical state of
-  the hardware.
-
 ### Remote attacks
 
+The attacker can observe and modify data sent over the network. This includes
+observing the content and timing of individual packets, as well as suppressing
+or delaying legitimate messages, and injecting messages.
+
 Mbed TLS aims to fully protect against remote attacks and to enable the user
 application in providing full protection against remote attacks. Said
 protection is limited to providing security guarantees offered by the protocol
@@ -42,6 +37,9 @@ arrive without delay, as the TLS protocol doesn't guarantee that either.)
 
 ### Timing attacks
 
+The attacker can gain information about the time taken by certain sets of
+instructions in Mbed TLS operations.
+
 Mbed TLS provides limited protection against timing attacks. The cost of
 protecting against timing attacks widely varies depending on the granularity of
 the measurements and the noise present. Therefore the protection in Mbed TLS is
@@ -71,6 +69,9 @@ Guide](docs/architecture/alternative-implementations.md) for more information.
 
 ### Physical attacks
 
+The attacker has access to physical information about the hardware Mbed TLS is
+running on and/or can alter the physical state of the hardware.
+
 Physical attacks are out of scope (eg. power analysis or radio emissions). Any
 attack using information about or influencing the physical state of the
 hardware is considered physical, independently of the attack vector. (For

From adc8a0bceff7bb2fb9d8b4dc0cdcc956e5d74a0d Mon Sep 17 00:00:00 2001
From: Janos Follath <janos.follath@arm.com>
Date: Wed, 8 Mar 2023 16:10:39 +0000
Subject: [PATCH 05/45] Threat Model: increase classification detail

Originally for the sake of simplicity there was a single category for
software based attacks, namely timing side channel attacks.

Be more precise and categorise attacks as software based whether or not
they rely on physical information.

Signed-off-by: Janos Follath <janos.follath@arm.com>
---
 SECURITY.md | 54 ++++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 43 insertions(+), 11 deletions(-)

diff --git a/SECURITY.md b/SECURITY.md
index 7981a44b64..c6345d65c8 100644
--- a/SECURITY.md
+++ b/SECURITY.md
@@ -35,22 +35,33 @@ protection is limited to providing security guarantees offered by the protocol
 in question. (For example Mbed TLS alone won't guarantee that the messages will
 arrive without delay, as the TLS protocol doesn't guarantee that either.)
 
-### Timing attacks
+### Local attacks
+
+The attacker is capable of running code on the same hardware as Mbed TLS, but
+there is still a security boundary between them (ie. the attacker can't for
+example read secrets from Mbed TLS' memory directly).
+
+#### Timing attacks
 
 The attacker can gain information about the time taken by certain sets of
-instructions in Mbed TLS operations.
+instructions in Mbed TLS operations. (See for example the [Flush+Reload
+paper](https://eprint.iacr.org/2013/448.pdf).)
+
+(Technically, timing information can be observed over the network or through
+physical side channels as well. Network timing attacks are less powerful than
+local and countermeasures protecting against local attacks prevent network
+attacks as well. If the timing information is gained through physical side
+channels, we consider them physical attacks and as such they are out of scope.)
 
 Mbed TLS provides limited protection against timing attacks. The cost of
 protecting against timing attacks widely varies depending on the granularity of
 the measurements and the noise present. Therefore the protection in Mbed TLS is
-limited. We are only aiming to provide protection against publicly documented
-attacks, and this protection is not currently complete.
+limited. We are only aiming to provide protection against **publicly
+documented** attacks, and this protection is not currently complete.
 
 **Warning!** Block ciphers do not yet achieve full protection. For
 details and workarounds see the section below.
 
-#### Block Ciphers
-
 Currently there are four block ciphers in Mbed TLS: AES, CAMELLIA, ARIA and DES.
 The pure software implementation in Mbed TLS implementation uses lookup tables,
 which are vulnerable to timing attacks.
@@ -67,14 +78,35 @@ Guide](docs/architecture/alternative-implementations.md) for more information.
   particular, for authenticated encryption, use ChaCha20/Poly1305 instead of
   block cipher modes. For random generation, use HMAC\_DRBG instead of CTR\_DRBG.
 
+#### Local non-timing side channels
+
+The attacker code running on the platform has access to some sensor capable of
+picking up information on the physical state of the hardware while Mbed TLS is
+running. This can for example be any analogue to digital converter on the
+platform that is located unfortunately enough to pick up the CPU noise. (See
+for example the [Leaky Noise
+paper](https://tches.iacr.org/index.php/TCHES/article/view/8297).)
+
+Mbed TLS doesn't offer any security guarantees against local non-timing based
+side channel attacks. If local non-timing attacks are present in a use case or
+a user application's threat model, it needs to be mitigated by the platform.
+
+#### Local fault injection attacks
+
+Software running on the same hardware can affect the physical state of the
+device and introduce faults. (See for example the [Row Hammer
+paper](https://users.ece.cmu.edu/~yoonguk/papers/kim-isca14.pdf).)
+
+Mbed TLS doesn't offer any security guarantees against local fault injection
+attacks. If local fault injection attacks are present in a use case or a user
+application's threat model, it needs to be mitigated by the platform.
+
 ### Physical attacks
 
 The attacker has access to physical information about the hardware Mbed TLS is
-running on and/or can alter the physical state of the hardware.
+running on and/or can alter the physical state of the hardware (eg. power
+analysis, radio emissions or fault injection).
 
-Physical attacks are out of scope (eg. power analysis or radio emissions). Any
-attack using information about or influencing the physical state of the
-hardware is considered physical, independently of the attack vector. (For
-example Row Hammer and Screaming Channels are considered physical attacks.) If
+Mbed TLS doesn't offer any security guarantees against physical attacks. If
 physical attacks are present in a use case or a user application's threat
 model, it needs to be mitigated by physical countermeasures.

From 389cdf43ab0404741e9433d68b8658ea011d29d6 Mon Sep 17 00:00:00 2001
From: Janos Follath <janos.follath@arm.com>
Date: Wed, 8 Mar 2023 16:38:07 +0000
Subject: [PATCH 06/45] Threat model: explain dangling countermeasures

Signed-off-by: Janos Follath <janos.follath@arm.com>
---
 SECURITY.md | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/SECURITY.md b/SECURITY.md
index c6345d65c8..95e549f44e 100644
--- a/SECURITY.md
+++ b/SECURITY.md
@@ -110,3 +110,16 @@ analysis, radio emissions or fault injection).
 Mbed TLS doesn't offer any security guarantees against physical attacks. If
 physical attacks are present in a use case or a user application's threat
 model, it needs to be mitigated by physical countermeasures.
+
+### Caveats
+
+#### Out of scope countermeasures
+
+Mbed TLS has evolved organically and a well defined threat model hasn't always
+been present. Therefore, Mbed TLS might have countermeasures against attacks
+outside the above defined threat model.
+
+The presence of such countermeasures don't mean that Mbed TLS provides
+protection against a class of attacks outside of the above described threat
+model. Neither does it mean that the failure of such a countermeasure is
+considered a vulnerability.

From 5e68d3b05f299f9de9a56678a044f06396580419 Mon Sep 17 00:00:00 2001
From: Janos Follath <janos.follath@arm.com>
Date: Wed, 8 Mar 2023 16:53:50 +0000
Subject: [PATCH 07/45] Threat Model: move the block cipher section

The block cipher exception affects both remote and local timing attacks.
Move them to the Caveats section and reference it from both the local
and the remote attack section.

Signed-off-by: Janos Follath <janos.follath@arm.com>
---
 SECURITY.md | 44 +++++++++++++++++++++++++++-----------------
 1 file changed, 27 insertions(+), 17 deletions(-)

diff --git a/SECURITY.md b/SECURITY.md
index 95e549f44e..677e68555d 100644
--- a/SECURITY.md
+++ b/SECURITY.md
@@ -35,6 +35,11 @@ protection is limited to providing security guarantees offered by the protocol
 in question. (For example Mbed TLS alone won't guarantee that the messages will
 arrive without delay, as the TLS protocol doesn't guarantee that either.)
 
+**Warning!** Depending on network latency, the timing of messages might be
+enough to launch some timing attacks. Block ciphers do not yet achieve full
+protection against these. For details and workarounds see the [Block
+Ciphers](#block-ciphers) section.
+
 ### Local attacks
 
 The attacker is capable of running code on the same hardware as Mbed TLS, but
@@ -60,23 +65,7 @@ limited. We are only aiming to provide protection against **publicly
 documented** attacks, and this protection is not currently complete.
 
 **Warning!** Block ciphers do not yet achieve full protection. For
-details and workarounds see the section below.
-
-Currently there are four block ciphers in Mbed TLS: AES, CAMELLIA, ARIA and DES.
-The pure software implementation in Mbed TLS implementation uses lookup tables,
-which are vulnerable to timing attacks.
-
-**Workarounds:**
-
-- Turn on hardware acceleration for AES. This is supported only on selected
-  architectures and currently only available for AES. See configuration options
-  `MBEDTLS_AESCE_C`, `MBEDTLS_AESNI_C` and `MBEDTLS_PADLOCK_C` for details.
-- Add a secure alternative implementation (typically hardware acceleration) for
-  the vulnerable cipher. See the [Alternative Implementations
-Guide](docs/architecture/alternative-implementations.md) for more information.
-- Use cryptographic mechanisms that are not based on block ciphers. In
-  particular, for authenticated encryption, use ChaCha20/Poly1305 instead of
-  block cipher modes. For random generation, use HMAC\_DRBG instead of CTR\_DRBG.
+details and workarounds see the [Block Ciphers](#block-ciphers) section.
 
 #### Local non-timing side channels
 
@@ -123,3 +112,24 @@ The presence of such countermeasures don't mean that Mbed TLS provides
 protection against a class of attacks outside of the above described threat
 model. Neither does it mean that the failure of such a countermeasure is
 considered a vulnerability.
+
+#### Block ciphers
+
+Currently there are four block ciphers in Mbed TLS: AES, CAMELLIA, ARIA and
+DES. The pure software implementation in Mbed TLS implementation uses lookup
+tables, which are vulnerable to timing attacks.
+
+These timing attacks can be physical, local or depending on network latency
+even a remote. The attacks can result in key recovery.
+
+**Workarounds:**
+
+- Turn on hardware acceleration for AES. This is supported only on selected
+  architectures and currently only available for AES. See configuration options
+  `MBEDTLS_AESCE_C`, `MBEDTLS_AESNI_C` and `MBEDTLS_PADLOCK_C` for details.
+- Add a secure alternative implementation (typically hardware acceleration) for
+  the vulnerable cipher. See the [Alternative Implementations
+Guide](docs/architecture/alternative-implementations.md) for more information.
+- Use cryptographic mechanisms that are not based on block ciphers. In
+  particular, for authenticated encryption, use ChaCha20/Poly1305 instead of
+  block cipher modes. For random generation, use HMAC\_DRBG instead of CTR\_DRBG.

From 18ffba6100c8a12380debe128b24ccb649482495 Mon Sep 17 00:00:00 2001
From: Janos Follath <janos.follath@arm.com>
Date: Wed, 8 Mar 2023 19:58:29 +0000
Subject: [PATCH 08/45] Threat Model: improve wording

Signed-off-by: Janos Follath <janos.follath@arm.com>
---
 SECURITY.md | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/SECURITY.md b/SECURITY.md
index 677e68555d..d0281ace93 100644
--- a/SECURITY.md
+++ b/SECURITY.md
@@ -42,14 +42,14 @@ Ciphers](#block-ciphers) section.
 
 ### Local attacks
 
-The attacker is capable of running code on the same hardware as Mbed TLS, but
-there is still a security boundary between them (ie. the attacker can't for
-example read secrets from Mbed TLS' memory directly).
+The attacker can run software on the same machine. The attacker has
+insufficient privileges to directly access Mbed TLS assets such as memory and
+files.
 
 #### Timing attacks
 
-The attacker can gain information about the time taken by certain sets of
-instructions in Mbed TLS operations. (See for example the [Flush+Reload
+The attacker is able to observe the timing of instructions executed by Mbed
+TLS.(See for example the [Flush+Reload
 paper](https://eprint.iacr.org/2013/448.pdf).)
 
 (Technically, timing information can be observed over the network or through

From 8257d8aa000a51e9e7064ecf891459db0c115652 Mon Sep 17 00:00:00 2001
From: Janos Follath <janos.follath@arm.com>
Date: Wed, 8 Mar 2023 20:07:59 +0000
Subject: [PATCH 09/45] Threat Model: clarify attack vectors

Timing attacks can be launched by any of the main 3 attackers. Clarify
exactly how these are covered.

Signed-off-by: Janos Follath <janos.follath@arm.com>
---
 SECURITY.md | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/SECURITY.md b/SECURITY.md
index d0281ace93..387221e61f 100644
--- a/SECURITY.md
+++ b/SECURITY.md
@@ -52,17 +52,16 @@ The attacker is able to observe the timing of instructions executed by Mbed
 TLS.(See for example the [Flush+Reload
 paper](https://eprint.iacr.org/2013/448.pdf).)
 
-(Technically, timing information can be observed over the network or through
-physical side channels as well. Network timing attacks are less powerful than
-local and countermeasures protecting against local attacks prevent network
-attacks as well. If the timing information is gained through physical side
-channels, we consider them physical attacks and as such they are out of scope.)
-
 Mbed TLS provides limited protection against timing attacks. The cost of
 protecting against timing attacks widely varies depending on the granularity of
 the measurements and the noise present. Therefore the protection in Mbed TLS is
 limited. We are only aiming to provide protection against **publicly
-documented** attacks, and this protection is not currently complete.
+documented** attacks.
+
+**Remark:** Timing information can be observed over the network or through
+physical side channels as well. Remote and physical timing attacks are covered
+in the [Remote attacks](remote-attacks) and [Physical
+attacks](physical-attacks) sections respectively.
 
 **Warning!** Block ciphers do not yet achieve full protection. For
 details and workarounds see the [Block Ciphers](#block-ciphers) section.

From 6ce259d287b48e0420caf6a9a89aaaf8fb710c2e Mon Sep 17 00:00:00 2001
From: Janos Follath <janos.follath@arm.com>
Date: Tue, 14 Mar 2023 12:47:27 +0000
Subject: [PATCH 10/45] Threat Model: improve wording and grammar

Signed-off-by: Janos Follath <janos.follath@arm.com>
---
 SECURITY.md | 32 ++++++++++++++++----------------
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/SECURITY.md b/SECURITY.md
index 387221e61f..dcffa1d9be 100644
--- a/SECURITY.md
+++ b/SECURITY.md
@@ -21,7 +21,7 @@ Users are urged to always use the latest version of a maintained branch.
 
 ## Threat model
 
-We use the following classification of attacks:
+We classify attacks based on the capabilities of the attacker.
 
 ### Remote attacks
 
@@ -32,13 +32,13 @@ or delaying legitimate messages, and injecting messages.
 Mbed TLS aims to fully protect against remote attacks and to enable the user
 application in providing full protection against remote attacks. Said
 protection is limited to providing security guarantees offered by the protocol
-in question. (For example Mbed TLS alone won't guarantee that the messages will
-arrive without delay, as the TLS protocol doesn't guarantee that either.)
+being implemented. (For example Mbed TLS alone won't guarantee that the
+messages will arrive without delay, as the TLS protocol doesn't guarantee that
+either.)
 
-**Warning!** Depending on network latency, the timing of messages might be
-enough to launch some timing attacks. Block ciphers do not yet achieve full
-protection against these. For details and workarounds see the [Block
-Ciphers](#block-ciphers) section.
+**Warning!** Block ciphers do not yet achieve full protection against attackers
+who can measure the timing of packets with sufficient precision. For details
+and workarounds see the [Block Ciphers](#block-ciphers) section.
 
 ### Local attacks
 
@@ -70,14 +70,14 @@ details and workarounds see the [Block Ciphers](#block-ciphers) section.
 
 The attacker code running on the platform has access to some sensor capable of
 picking up information on the physical state of the hardware while Mbed TLS is
-running. This can for example be any analogue to digital converter on the
+running. This could for example be an analogue-to-digital converter on the
 platform that is located unfortunately enough to pick up the CPU noise. (See
 for example the [Leaky Noise
 paper](https://tches.iacr.org/index.php/TCHES/article/view/8297).)
 
-Mbed TLS doesn't offer any security guarantees against local non-timing based
+Mbed TLS doesn't make any security guarantees against local non-timing-based
 side channel attacks. If local non-timing attacks are present in a use case or
-a user application's threat model, it needs to be mitigated by the platform.
+a user application's threat model, they need to be mitigated by the platform.
 
 #### Local fault injection attacks
 
@@ -85,23 +85,23 @@ Software running on the same hardware can affect the physical state of the
 device and introduce faults. (See for example the [Row Hammer
 paper](https://users.ece.cmu.edu/~yoonguk/papers/kim-isca14.pdf).)
 
-Mbed TLS doesn't offer any security guarantees against local fault injection
+Mbed TLS doesn't make any security guarantees against local fault injection
 attacks. If local fault injection attacks are present in a use case or a user
-application's threat model, it needs to be mitigated by the platform.
+application's threat model, they need to be mitigated by the platform.
 
 ### Physical attacks
 
 The attacker has access to physical information about the hardware Mbed TLS is
-running on and/or can alter the physical state of the hardware (eg. power
+running on and/or can alter the physical state of the hardware (e.g. power
 analysis, radio emissions or fault injection).
 
-Mbed TLS doesn't offer any security guarantees against physical attacks. If
+Mbed TLS doesn't make any security guarantees against physical attacks. If
 physical attacks are present in a use case or a user application's threat
-model, it needs to be mitigated by physical countermeasures.
+model, they need to be mitigated by physical countermeasures.
 
 ### Caveats
 
-#### Out of scope countermeasures
+#### Out-of-scope countermeasures
 
 Mbed TLS has evolved organically and a well defined threat model hasn't always
 been present. Therefore, Mbed TLS might have countermeasures against attacks

From 08094b831382066eb8111adbe5545ffb2a0a07f7 Mon Sep 17 00:00:00 2001
From: Janos Follath <janos.follath@arm.com>
Date: Tue, 14 Mar 2023 14:49:34 +0000
Subject: [PATCH 11/45] Threat Model: clarify stance on timing attacks

Signed-off-by: Janos Follath <janos.follath@arm.com>
---
 SECURITY.md | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/SECURITY.md b/SECURITY.md
index dcffa1d9be..97fe0e7475 100644
--- a/SECURITY.md
+++ b/SECURITY.md
@@ -48,15 +48,20 @@ files.
 
 #### Timing attacks
 
-The attacker is able to observe the timing of instructions executed by Mbed
-TLS.(See for example the [Flush+Reload
-paper](https://eprint.iacr.org/2013/448.pdf).)
+The attacker is able to observe the timing of instructions executed by Mbed TLS
+by leveraging shared hardware that both Mbed TLS and the attacker have access
+to. Typical attack vectors include cache timings, memory bus contention and
+branch prediction.
 
 Mbed TLS provides limited protection against timing attacks. The cost of
 protecting against timing attacks widely varies depending on the granularity of
 the measurements and the noise present. Therefore the protection in Mbed TLS is
 limited. We are only aiming to provide protection against **publicly
-documented** attacks.
+documented attack techniques**.
+
+As attacks keep improving, so does Mbed TLS's protection. Mbed TLS is moving
+towards a model of fully timing-invariant code, but has not reached this point
+yet.
 
 **Remark:** Timing information can be observed over the network or through
 physical side channels as well. Remote and physical timing attacks are covered

From e3d677c6aa2b246c16ae8b2bf824f61e110d9a26 Mon Sep 17 00:00:00 2001
From: Janos Follath <janos.follath@arm.com>
Date: Tue, 14 Mar 2023 14:54:44 +0000
Subject: [PATCH 12/45] Threat Model: remove references

Remove references to scientific papers as they are too specific and
might be misleading.

Signed-off-by: Janos Follath <janos.follath@arm.com>
---
 SECURITY.md | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/SECURITY.md b/SECURITY.md
index 97fe0e7475..8d2337111c 100644
--- a/SECURITY.md
+++ b/SECURITY.md
@@ -76,9 +76,7 @@ details and workarounds see the [Block Ciphers](#block-ciphers) section.
 The attacker code running on the platform has access to some sensor capable of
 picking up information on the physical state of the hardware while Mbed TLS is
 running. This could for example be an analogue-to-digital converter on the
-platform that is located unfortunately enough to pick up the CPU noise. (See
-for example the [Leaky Noise
-paper](https://tches.iacr.org/index.php/TCHES/article/view/8297).)
+platform that is located unfortunately enough to pick up the CPU noise.
 
 Mbed TLS doesn't make any security guarantees against local non-timing-based
 side channel attacks. If local non-timing attacks are present in a use case or
@@ -87,8 +85,7 @@ a user application's threat model, they need to be mitigated by the platform.
 #### Local fault injection attacks
 
 Software running on the same hardware can affect the physical state of the
-device and introduce faults. (See for example the [Row Hammer
-paper](https://users.ece.cmu.edu/~yoonguk/papers/kim-isca14.pdf).)
+device and introduce faults.
 
 Mbed TLS doesn't make any security guarantees against local fault injection
 attacks. If local fault injection attacks are present in a use case or a user

From 6cd045905fa83de7e4c450774d260ba663094e97 Mon Sep 17 00:00:00 2001
From: Janos Follath <janos.follath@arm.com>
Date: Tue, 14 Mar 2023 15:43:24 +0000
Subject: [PATCH 13/45] Threat Model: adjust modality

Signed-off-by: Janos Follath <janos.follath@arm.com>
---
 SECURITY.md | 20 +++++++++++---------
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/SECURITY.md b/SECURITY.md
index 8d2337111c..8d3678a5ee 100644
--- a/SECURITY.md
+++ b/SECURITY.md
@@ -25,9 +25,10 @@ We classify attacks based on the capabilities of the attacker.
 
 ### Remote attacks
 
-The attacker can observe and modify data sent over the network. This includes
-observing the content and timing of individual packets, as well as suppressing
-or delaying legitimate messages, and injecting messages.
+In this section, we consider an attacker who can observe and modify data sent
+over the network. This includes observing the content and timing of individual
+packets, as well as suppressing or delaying legitimate messages, and injecting
+messages.
 
 Mbed TLS aims to fully protect against remote attacks and to enable the user
 application in providing full protection against remote attacks. Said
@@ -42,9 +43,9 @@ and workarounds see the [Block Ciphers](#block-ciphers) section.
 
 ### Local attacks
 
-The attacker can run software on the same machine. The attacker has
-insufficient privileges to directly access Mbed TLS assets such as memory and
-files.
+In this section, we consider an attacker who can run software on the same
+machine. The attacker has insufficient privileges to directly access Mbed TLS
+assets such as memory and files.
 
 #### Timing attacks
 
@@ -93,9 +94,10 @@ application's threat model, they need to be mitigated by the platform.
 
 ### Physical attacks
 
-The attacker has access to physical information about the hardware Mbed TLS is
-running on and/or can alter the physical state of the hardware (e.g. power
-analysis, radio emissions or fault injection).
+In this section, we consider an attacker who can attacker has access to
+physical information about the hardware Mbed TLS is running on and/or can alter
+the physical state of the hardware (e.g. power analysis, radio emissions or
+fault injection).
 
 Mbed TLS doesn't make any security guarantees against physical attacks. If
 physical attacks are present in a use case or a user application's threat

From 35f5ef01f21da18111ce9e59a7e194e1bf55b149 Mon Sep 17 00:00:00 2001
From: Janos Follath <janos.follath@arm.com>
Date: Wed, 15 Mar 2023 15:43:08 +0000
Subject: [PATCH 14/45] Threat Model: adjust to 2.28

MBEDTLS_AESCE_C is not available in 2.28., remove it from workarounds.

Signed-off-by: Janos Follath <janos.follath@arm.com>
---
 SECURITY.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/SECURITY.md b/SECURITY.md
index 8d3678a5ee..e25601bcd5 100644
--- a/SECURITY.md
+++ b/SECURITY.md
@@ -129,7 +129,7 @@ even a remote. The attacks can result in key recovery.
 
 - Turn on hardware acceleration for AES. This is supported only on selected
   architectures and currently only available for AES. See configuration options
-  `MBEDTLS_AESCE_C`, `MBEDTLS_AESNI_C` and `MBEDTLS_PADLOCK_C` for details.
+  `MBEDTLS_AESNI_C` and `MBEDTLS_PADLOCK_C` for details.
 - Add a secure alternative implementation (typically hardware acceleration) for
   the vulnerable cipher. See the [Alternative Implementations
 Guide](docs/architecture/alternative-implementations.md) for more information.

From 83050519a7bf27078f3421bf1e6abc6d8e2c7376 Mon Sep 17 00:00:00 2001
From: Janos Follath <janos.follath@arm.com>
Date: Thu, 16 Mar 2023 15:00:03 +0000
Subject: [PATCH 15/45] Threat Model: fix copy paste

Signed-off-by: Janos Follath <janos.follath@arm.com>
---
 SECURITY.md | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/SECURITY.md b/SECURITY.md
index e25601bcd5..732335b233 100644
--- a/SECURITY.md
+++ b/SECURITY.md
@@ -94,10 +94,9 @@ application's threat model, they need to be mitigated by the platform.
 
 ### Physical attacks
 
-In this section, we consider an attacker who can attacker has access to
-physical information about the hardware Mbed TLS is running on and/or can alter
-the physical state of the hardware (e.g. power analysis, radio emissions or
-fault injection).
+In this section, we consider an attacker who has access to physical information
+about the hardware Mbed TLS is running on and/or can alter the physical state
+of the hardware (e.g. power analysis, radio emissions or fault injection).
 
 Mbed TLS doesn't make any security guarantees against physical attacks. If
 physical attacks are present in a use case or a user application's threat

From 6055b783285ae767a681c9a19ff58f053c37e8d7 Mon Sep 17 00:00:00 2001
From: Gilles Peskine <Gilles.Peskine@arm.com>
Date: Fri, 10 Mar 2023 22:21:47 +0100
Subject: [PATCH 16/45] Update bibliographic references

There are new versions of the Intel whitepapers and they've moved.

Signed-off-by: Gilles Peskine <Gilles.Peskine@arm.com>
---
 library/aesni.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/library/aesni.c b/library/aesni.c
index 2a44b0ea32..624df8cb67 100644
--- a/library/aesni.c
+++ b/library/aesni.c
@@ -18,8 +18,8 @@
  */
 
 /*
- * [AES-WP] http://software.intel.com/en-us/articles/intel-advanced-encryption-standard-aes-instructions-set
- * [CLMUL-WP] http://software.intel.com/en-us/articles/intel-carry-less-multiplication-instruction-and-its-usage-for-computing-the-gcm-mode/
+ * [AES-WP] https://www.intel.com/content/www/us/en/developer/articles/tool/intel-advanced-encryption-standard-aes-instructions-set.html
+ * [CLMUL-WP] https://www.intel.com/content/www/us/en/develop/download/intel-carry-less-multiplication-instruction-and-its-usage-for-computing-the-gcm-mode.html
  */
 
 #include "common.h"
@@ -158,7 +158,7 @@ void mbedtls_aesni_gcm_mult(unsigned char c[16],
 
          /*
           * Caryless multiplication xmm2:xmm1 = xmm0 * xmm1
-          * using [CLMUL-WP] algorithm 1 (p. 13).
+          * using [CLMUL-WP] algorithm 1 (p. 12).
           */
          "movdqa %%xmm1, %%xmm2             \n\t" // copy of b1:b0
          "movdqa %%xmm1, %%xmm3             \n\t" // same
@@ -176,7 +176,7 @@ void mbedtls_aesni_gcm_mult(unsigned char c[16],
 
          /*
           * Now shift the result one bit to the left,
-          * taking advantage of [CLMUL-WP] eq 27 (p. 20)
+          * taking advantage of [CLMUL-WP] eq 27 (p. 18)
           */
                              "movdqa %%xmm1, %%xmm3             \n\t" // r1:r0
                              "movdqa %%xmm2, %%xmm4             \n\t" // r3:r2
@@ -194,7 +194,7 @@ void mbedtls_aesni_gcm_mult(unsigned char c[16],
 
          /*
           * Now reduce modulo the GCM polynomial x^128 + x^7 + x^2 + x + 1
-          * using [CLMUL-WP] algorithm 5 (p. 20).
+          * using [CLMUL-WP] algorithm 5 (p. 18).
           * Currently xmm2:xmm1 holds x3:x2:x1:x0 (already shifted).
           */
          /* Step 2 (1) */

From 18d521a57d66a9414665f309f237390dfd9d2397 Mon Sep 17 00:00:00 2001
From: Gilles Peskine <Gilles.Peskine@arm.com>
Date: Fri, 10 Mar 2023 22:25:13 +0100
Subject: [PATCH 17/45] Don't warn about Msan/Valgrind if AESNI isn't actually
 built

The warning is only correct if the assembly code for AESNI is built, not if
MBEDTLS_AESNI_C is activated but MBEDTLS_HAVE_ASM is disabled or the target
architecture isn't x86_64.

This is a partial fix for #7236.

Signed-off-by: Gilles Peskine <Gilles.Peskine@arm.com>
---
 library/aesni.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/library/aesni.c b/library/aesni.c
index 624df8cb67..9ade1bf08a 100644
--- a/library/aesni.c
+++ b/library/aesni.c
@@ -26,13 +26,6 @@
 
 #if defined(MBEDTLS_AESNI_C)
 
-#if defined(__has_feature)
-#if __has_feature(memory_sanitizer)
-#warning \
-    "MBEDTLS_AESNI_C is known to cause spurious error reports with some memory sanitizers as they do not understand the assembly code."
-#endif
-#endif
-
 #include "mbedtls/aesni.h"
 
 #include <string.h>
@@ -65,6 +58,13 @@ int mbedtls_aesni_has_support(unsigned int what)
     return (c & what) != 0;
 }
 
+#if defined(__has_feature)
+#if __has_feature(memory_sanitizer)
+#warning \
+    "MBEDTLS_AESNI_C is known to cause spurious error reports with some memory sanitizers as they do not understand the assembly code."
+#endif
+#endif
+
 /*
  * Binutils needs to be at least 2.19 to support AES-NI instructions.
  * Unfortunately, a lot of users have a lower version now (2014-04).

From 2808a6047cc5ff0fc8fe48502d59ee113c7c89d8 Mon Sep 17 00:00:00 2001
From: Gilles Peskine <Gilles.Peskine@arm.com>
Date: Wed, 15 Mar 2023 19:36:03 +0100
Subject: [PATCH 18/45] Improve the presentation of assembly blocks

Uncrustify indents
```
    asm("foo"
        HELLO "bar"
              "wibble");
```
but we would like
```
    asm("foo"
        HELLO "bar"
        "wibble");
```
Make "bar" an argument of the macro HELLO, which makes the indentation from
uncrustify match the semantics (everything should be aligned to the same
column).

Signed-off-by: Gilles Peskine <Gilles.Peskine@arm.com>
---
 library/aesni.c | 236 ++++++++++++++++++++++++------------------------
 1 file changed, 118 insertions(+), 118 deletions(-)

diff --git a/library/aesni.c b/library/aesni.c
index 9ade1bf08a..2756194da8 100644
--- a/library/aesni.c
+++ b/library/aesni.c
@@ -75,13 +75,13 @@ int mbedtls_aesni_has_support(unsigned int what)
  * Operand macros are in gas order (src, dst) as opposed to Intel order
  * (dst, src) in order to blend better into the surrounding assembly code.
  */
-#define AESDEC      ".byte 0x66,0x0F,0x38,0xDE,"
-#define AESDECLAST  ".byte 0x66,0x0F,0x38,0xDF,"
-#define AESENC      ".byte 0x66,0x0F,0x38,0xDC,"
-#define AESENCLAST  ".byte 0x66,0x0F,0x38,0xDD,"
-#define AESIMC      ".byte 0x66,0x0F,0x38,0xDB,"
-#define AESKEYGENA  ".byte 0x66,0x0F,0x3A,0xDF,"
-#define PCLMULQDQ   ".byte 0x66,0x0F,0x3A,0x44,"
+#define AESDEC(regs)      ".byte 0x66,0x0F,0x38,0xDE," regs "\n\t"
+#define AESDECLAST(regs)  ".byte 0x66,0x0F,0x38,0xDF," regs "\n\t"
+#define AESENC(regs)      ".byte 0x66,0x0F,0x38,0xDC," regs "\n\t"
+#define AESENCLAST(regs)  ".byte 0x66,0x0F,0x38,0xDD," regs "\n\t"
+#define AESIMC(regs)      ".byte 0x66,0x0F,0x38,0xDB," regs "\n\t"
+#define AESKEYGENA(regs, imm)  ".byte 0x66,0x0F,0x3A,0xDF," regs "," imm "\n\t"
+#define PCLMULQDQ(regs, imm)   ".byte 0x66,0x0F,0x3A,0x44," regs "," imm "\n\t"
 
 #define xmm0_xmm0   "0xC0"
 #define xmm0_xmm1   "0xC8"
@@ -109,25 +109,25 @@ int mbedtls_aesni_crypt_ecb(mbedtls_aes_context *ctx,
 
          "1:                        \n\t" // encryption loop
          "movdqu    (%1), %%xmm1    \n\t" // load round key
-         AESENC     xmm1_xmm0      "\n\t" // do round
-                                   "add       $16, %1         \n\t" // point to next round key
-                                   "subl      $1, %0          \n\t" // loop
-                                   "jnz       1b              \n\t"
-                                   "movdqu    (%1), %%xmm1    \n\t" // load round key
-         AESENCLAST xmm1_xmm0      "\n\t" // last round
-                                   "jmp       3f              \n\t"
+         AESENC(xmm1_xmm0)                // do round
+         "add       $16, %1         \n\t" // point to next round key
+         "subl      $1, %0          \n\t" // loop
+         "jnz       1b              \n\t"
+         "movdqu    (%1), %%xmm1    \n\t" // load round key
+         AESENCLAST(xmm1_xmm0)            // last round
+         "jmp       3f              \n\t"
 
-                                   "2:                        \n\t" // decryption loop
-                                   "movdqu    (%1), %%xmm1    \n\t"
-         AESDEC     xmm1_xmm0      "\n\t" // do round
-                                   "add       $16, %1         \n\t"
-                                   "subl      $1, %0          \n\t"
-                                   "jnz       2b              \n\t"
-                                   "movdqu    (%1), %%xmm1    \n\t" // load round key
-         AESDECLAST xmm1_xmm0      "\n\t" // last round
+         "2:                        \n\t" // decryption loop
+         "movdqu    (%1), %%xmm1    \n\t"
+         AESDEC(xmm1_xmm0)                // do round
+         "add       $16, %1         \n\t"
+         "subl      $1, %0          \n\t"
+         "jnz       2b              \n\t"
+         "movdqu    (%1), %%xmm1    \n\t" // load round key
+         AESDECLAST(xmm1_xmm0)            // last round
 
-                                   "3:                        \n\t"
-                                   "movdqu    %%xmm0, (%4)    \n\t" // export output
+         "3:                        \n\t"
+         "movdqu    %%xmm0, (%4)    \n\t" // export output
          :
          : "r" (ctx->nr), "r" (ctx->rk), "r" (mode), "r" (input), "r" (output)
          : "memory", "cc", "xmm0", "xmm1");
@@ -163,34 +163,34 @@ void mbedtls_aesni_gcm_mult(unsigned char c[16],
          "movdqa %%xmm1, %%xmm2             \n\t" // copy of b1:b0
          "movdqa %%xmm1, %%xmm3             \n\t" // same
          "movdqa %%xmm1, %%xmm4             \n\t" // same
-         PCLMULQDQ xmm0_xmm1 ",0x00         \n\t" // a0*b0 = c1:c0
-         PCLMULQDQ xmm0_xmm2 ",0x11         \n\t" // a1*b1 = d1:d0
-         PCLMULQDQ xmm0_xmm3 ",0x10         \n\t" // a0*b1 = e1:e0
-         PCLMULQDQ xmm0_xmm4 ",0x01         \n\t" // a1*b0 = f1:f0
-                             "pxor %%xmm3, %%xmm4               \n\t" // e1+f1:e0+f0
-                             "movdqa %%xmm4, %%xmm3             \n\t" // same
-                             "psrldq $8, %%xmm4                 \n\t" // 0:e1+f1
-                             "pslldq $8, %%xmm3                 \n\t" // e0+f0:0
-                             "pxor %%xmm4, %%xmm2               \n\t" // d1:d0+e1+f1
-                             "pxor %%xmm3, %%xmm1               \n\t" // c1+e0+f1:c0
+         PCLMULQDQ(xmm0_xmm1, "0x00")             // a0*b0 = c1:c0
+         PCLMULQDQ(xmm0_xmm2, "0x11")             // a1*b1 = d1:d0
+         PCLMULQDQ(xmm0_xmm3, "0x10")             // a0*b1 = e1:e0
+         PCLMULQDQ(xmm0_xmm4, "0x01")             // a1*b0 = f1:f0
+         "pxor %%xmm3, %%xmm4               \n\t" // e1+f1:e0+f0
+         "movdqa %%xmm4, %%xmm3             \n\t" // same
+         "psrldq $8, %%xmm4                 \n\t" // 0:e1+f1
+         "pslldq $8, %%xmm3                 \n\t" // e0+f0:0
+         "pxor %%xmm4, %%xmm2               \n\t" // d1:d0+e1+f1
+         "pxor %%xmm3, %%xmm1               \n\t" // c1+e0+f1:c0
 
          /*
           * Now shift the result one bit to the left,
           * taking advantage of [CLMUL-WP] eq 27 (p. 18)
           */
-                             "movdqa %%xmm1, %%xmm3             \n\t" // r1:r0
-                             "movdqa %%xmm2, %%xmm4             \n\t" // r3:r2
-                             "psllq $1, %%xmm1                  \n\t" // r1<<1:r0<<1
-                             "psllq $1, %%xmm2                  \n\t" // r3<<1:r2<<1
-                             "psrlq $63, %%xmm3                 \n\t" // r1>>63:r0>>63
-                             "psrlq $63, %%xmm4                 \n\t" // r3>>63:r2>>63
-                             "movdqa %%xmm3, %%xmm5             \n\t" // r1>>63:r0>>63
-                             "pslldq $8, %%xmm3                 \n\t" // r0>>63:0
-                             "pslldq $8, %%xmm4                 \n\t" // r2>>63:0
-                             "psrldq $8, %%xmm5                 \n\t" // 0:r1>>63
-                             "por %%xmm3, %%xmm1                \n\t" // r1<<1|r0>>63:r0<<1
-                             "por %%xmm4, %%xmm2                \n\t" // r3<<1|r2>>62:r2<<1
-                             "por %%xmm5, %%xmm2                \n\t" // r3<<1|r2>>62:r2<<1|r1>>63
+         "movdqa %%xmm1, %%xmm3             \n\t" // r1:r0
+         "movdqa %%xmm2, %%xmm4             \n\t" // r3:r2
+         "psllq $1, %%xmm1                  \n\t" // r1<<1:r0<<1
+         "psllq $1, %%xmm2                  \n\t" // r3<<1:r2<<1
+         "psrlq $63, %%xmm3                 \n\t" // r1>>63:r0>>63
+         "psrlq $63, %%xmm4                 \n\t" // r3>>63:r2>>63
+         "movdqa %%xmm3, %%xmm5             \n\t" // r1>>63:r0>>63
+         "pslldq $8, %%xmm3                 \n\t" // r0>>63:0
+         "pslldq $8, %%xmm4                 \n\t" // r2>>63:0
+         "psrldq $8, %%xmm5                 \n\t" // 0:r1>>63
+         "por %%xmm3, %%xmm1                \n\t" // r1<<1|r0>>63:r0<<1
+         "por %%xmm4, %%xmm2                \n\t" // r3<<1|r2>>62:r2<<1
+         "por %%xmm5, %%xmm2                \n\t" // r3<<1|r2>>62:r2<<1|r1>>63
 
          /*
           * Now reduce modulo the GCM polynomial x^128 + x^7 + x^2 + x + 1
@@ -198,44 +198,44 @@ void mbedtls_aesni_gcm_mult(unsigned char c[16],
           * Currently xmm2:xmm1 holds x3:x2:x1:x0 (already shifted).
           */
          /* Step 2 (1) */
-                             "movdqa %%xmm1, %%xmm3             \n\t" // x1:x0
-                             "movdqa %%xmm1, %%xmm4             \n\t" // same
-                             "movdqa %%xmm1, %%xmm5             \n\t" // same
-                             "psllq $63, %%xmm3                 \n\t" // x1<<63:x0<<63 = stuff:a
-                             "psllq $62, %%xmm4                 \n\t" // x1<<62:x0<<62 = stuff:b
-                             "psllq $57, %%xmm5                 \n\t" // x1<<57:x0<<57 = stuff:c
+         "movdqa %%xmm1, %%xmm3             \n\t" // x1:x0
+         "movdqa %%xmm1, %%xmm4             \n\t" // same
+         "movdqa %%xmm1, %%xmm5             \n\t" // same
+         "psllq $63, %%xmm3                 \n\t" // x1<<63:x0<<63 = stuff:a
+         "psllq $62, %%xmm4                 \n\t" // x1<<62:x0<<62 = stuff:b
+         "psllq $57, %%xmm5                 \n\t" // x1<<57:x0<<57 = stuff:c
 
          /* Step 2 (2) */
-                             "pxor %%xmm4, %%xmm3               \n\t" // stuff:a+b
-                             "pxor %%xmm5, %%xmm3               \n\t" // stuff:a+b+c
-                             "pslldq $8, %%xmm3                 \n\t" // a+b+c:0
-                             "pxor %%xmm3, %%xmm1               \n\t" // x1+a+b+c:x0 = d:x0
+         "pxor %%xmm4, %%xmm3               \n\t" // stuff:a+b
+         "pxor %%xmm5, %%xmm3               \n\t" // stuff:a+b+c
+         "pslldq $8, %%xmm3                 \n\t" // a+b+c:0
+         "pxor %%xmm3, %%xmm1               \n\t" // x1+a+b+c:x0 = d:x0
 
          /* Steps 3 and 4 */
-                             "movdqa %%xmm1,%%xmm0              \n\t" // d:x0
-                             "movdqa %%xmm1,%%xmm4              \n\t" // same
-                             "movdqa %%xmm1,%%xmm5              \n\t" // same
-                             "psrlq $1, %%xmm0                  \n\t" // e1:x0>>1 = e1:e0'
-                             "psrlq $2, %%xmm4                  \n\t" // f1:x0>>2 = f1:f0'
-                             "psrlq $7, %%xmm5                  \n\t" // g1:x0>>7 = g1:g0'
-                             "pxor %%xmm4, %%xmm0               \n\t" // e1+f1:e0'+f0'
-                             "pxor %%xmm5, %%xmm0               \n\t" // e1+f1+g1:e0'+f0'+g0'
+         "movdqa %%xmm1,%%xmm0              \n\t" // d:x0
+         "movdqa %%xmm1,%%xmm4              \n\t" // same
+         "movdqa %%xmm1,%%xmm5              \n\t" // same
+         "psrlq $1, %%xmm0                  \n\t" // e1:x0>>1 = e1:e0'
+         "psrlq $2, %%xmm4                  \n\t" // f1:x0>>2 = f1:f0'
+         "psrlq $7, %%xmm5                  \n\t" // g1:x0>>7 = g1:g0'
+         "pxor %%xmm4, %%xmm0               \n\t" // e1+f1:e0'+f0'
+         "pxor %%xmm5, %%xmm0               \n\t" // e1+f1+g1:e0'+f0'+g0'
          // e0'+f0'+g0' is almost e0+f0+g0, ex\tcept for some missing
          // bits carried from d. Now get those\t bits back in.
-                             "movdqa %%xmm1,%%xmm3              \n\t" // d:x0
-                             "movdqa %%xmm1,%%xmm4              \n\t" // same
-                             "movdqa %%xmm1,%%xmm5              \n\t" // same
-                             "psllq $63, %%xmm3                 \n\t" // d<<63:stuff
-                             "psllq $62, %%xmm4                 \n\t" // d<<62:stuff
-                             "psllq $57, %%xmm5                 \n\t" // d<<57:stuff
-                             "pxor %%xmm4, %%xmm3               \n\t" // d<<63+d<<62:stuff
-                             "pxor %%xmm5, %%xmm3               \n\t" // missing bits of d:stuff
-                             "psrldq $8, %%xmm3                 \n\t" // 0:missing bits of d
-                             "pxor %%xmm3, %%xmm0               \n\t" // e1+f1+g1:e0+f0+g0
-                             "pxor %%xmm1, %%xmm0               \n\t" // h1:h0
-                             "pxor %%xmm2, %%xmm0               \n\t" // x3+h1:x2+h0
+         "movdqa %%xmm1,%%xmm3              \n\t" // d:x0
+         "movdqa %%xmm1,%%xmm4              \n\t" // same
+         "movdqa %%xmm1,%%xmm5              \n\t" // same
+         "psllq $63, %%xmm3                 \n\t" // d<<63:stuff
+         "psllq $62, %%xmm4                 \n\t" // d<<62:stuff
+         "psllq $57, %%xmm5                 \n\t" // d<<57:stuff
+         "pxor %%xmm4, %%xmm3               \n\t" // d<<63+d<<62:stuff
+         "pxor %%xmm5, %%xmm3               \n\t" // missing bits of d:stuff
+         "psrldq $8, %%xmm3                 \n\t" // 0:missing bits of d
+         "pxor %%xmm3, %%xmm0               \n\t" // e1+f1+g1:e0+f0+g0
+         "pxor %%xmm1, %%xmm0               \n\t" // h1:h0
+         "pxor %%xmm2, %%xmm0               \n\t" // x3+h1:x2+h0
 
-                             "movdqu %%xmm0, (%2)               \n\t" // done
+         "movdqu %%xmm0, (%2)               \n\t" // done
          :
          : "r" (aa), "r" (bb), "r" (cc)
          : "memory", "cc", "xmm0", "xmm1", "xmm2", "xmm3", "xmm4", "xmm5");
@@ -261,8 +261,8 @@ void mbedtls_aesni_inverse_key(unsigned char *invkey,
 
     for (fk -= 16, ik += 16; fk > fwdkey; fk -= 16, ik += 16) {
         asm ("movdqu (%0), %%xmm0       \n\t"
-             AESIMC  xmm0_xmm0         "\n\t"
-                                       "movdqu %%xmm0, (%1)       \n\t"
+             AESIMC(xmm0_xmm0)
+             "movdqu %%xmm0, (%1)       \n\t"
              :
              : "r" (fk), "r" (ik)
              : "memory", "xmm0");
@@ -306,16 +306,16 @@ static void aesni_setkey_enc_128(unsigned char *rk,
 
          /* Main "loop" */
          "2:                                \n\t"
-         AESKEYGENA xmm0_xmm1 ",0x01        \n\tcall 1b \n\t"
-         AESKEYGENA xmm0_xmm1 ",0x02        \n\tcall 1b \n\t"
-         AESKEYGENA xmm0_xmm1 ",0x04        \n\tcall 1b \n\t"
-         AESKEYGENA xmm0_xmm1 ",0x08        \n\tcall 1b \n\t"
-         AESKEYGENA xmm0_xmm1 ",0x10        \n\tcall 1b \n\t"
-         AESKEYGENA xmm0_xmm1 ",0x20        \n\tcall 1b \n\t"
-         AESKEYGENA xmm0_xmm1 ",0x40        \n\tcall 1b \n\t"
-         AESKEYGENA xmm0_xmm1 ",0x80        \n\tcall 1b \n\t"
-         AESKEYGENA xmm0_xmm1 ",0x1B        \n\tcall 1b \n\t"
-         AESKEYGENA xmm0_xmm1 ",0x36        \n\tcall 1b \n\t"
+         AESKEYGENA(xmm0_xmm1, "0x01")      "call 1b \n\t"
+         AESKEYGENA(xmm0_xmm1, "0x02")      "call 1b \n\t"
+         AESKEYGENA(xmm0_xmm1, "0x04")      "call 1b \n\t"
+         AESKEYGENA(xmm0_xmm1, "0x08")      "call 1b \n\t"
+         AESKEYGENA(xmm0_xmm1, "0x10")      "call 1b \n\t"
+         AESKEYGENA(xmm0_xmm1, "0x20")      "call 1b \n\t"
+         AESKEYGENA(xmm0_xmm1, "0x40")      "call 1b \n\t"
+         AESKEYGENA(xmm0_xmm1, "0x80")      "call 1b \n\t"
+         AESKEYGENA(xmm0_xmm1, "0x1B")      "call 1b \n\t"
+         AESKEYGENA(xmm0_xmm1, "0x36")      "call 1b \n\t"
          :
          : "r" (rk), "r" (key)
          : "memory", "cc", "0");
@@ -364,14 +364,14 @@ static void aesni_setkey_enc_192(unsigned char *rk,
          "ret                           \n\t"
 
          "2:                            \n\t"
-         AESKEYGENA xmm1_xmm2 ",0x01    \n\tcall 1b \n\t"
-         AESKEYGENA xmm1_xmm2 ",0x02    \n\tcall 1b \n\t"
-         AESKEYGENA xmm1_xmm2 ",0x04    \n\tcall 1b \n\t"
-         AESKEYGENA xmm1_xmm2 ",0x08    \n\tcall 1b \n\t"
-         AESKEYGENA xmm1_xmm2 ",0x10    \n\tcall 1b \n\t"
-         AESKEYGENA xmm1_xmm2 ",0x20    \n\tcall 1b \n\t"
-         AESKEYGENA xmm1_xmm2 ",0x40    \n\tcall 1b \n\t"
-         AESKEYGENA xmm1_xmm2 ",0x80    \n\tcall 1b \n\t"
+         AESKEYGENA(xmm1_xmm2, "0x01")  "call 1b \n\t"
+         AESKEYGENA(xmm1_xmm2, "0x02")  "call 1b \n\t"
+         AESKEYGENA(xmm1_xmm2, "0x04")  "call 1b \n\t"
+         AESKEYGENA(xmm1_xmm2, "0x08")  "call 1b \n\t"
+         AESKEYGENA(xmm1_xmm2, "0x10")  "call 1b \n\t"
+         AESKEYGENA(xmm1_xmm2, "0x20")  "call 1b \n\t"
+         AESKEYGENA(xmm1_xmm2, "0x40")  "call 1b \n\t"
+         AESKEYGENA(xmm1_xmm2, "0x80")  "call 1b \n\t"
 
          :
          : "r" (rk), "r" (key)
@@ -414,31 +414,31 @@ static void aesni_setkey_enc_256(unsigned char *rk,
 
          /* Set xmm2 to stuff:Y:stuff:stuff with Y = subword( r11 )
           * and proceed to generate next round key from there */
-         AESKEYGENA xmm0_xmm2 ",0x00        \n\t"
-                              "pshufd $0xaa, %%xmm2, %%xmm2      \n\t"
-                              "pxor %%xmm1, %%xmm2               \n\t"
-                              "pslldq $4, %%xmm1                 \n\t"
-                              "pxor %%xmm1, %%xmm2               \n\t"
-                              "pslldq $4, %%xmm1                 \n\t"
-                              "pxor %%xmm1, %%xmm2               \n\t"
-                              "pslldq $4, %%xmm1                 \n\t"
-                              "pxor %%xmm2, %%xmm1               \n\t"
-                              "add $16, %0                       \n\t"
-                              "movdqu %%xmm1, (%0)               \n\t"
-                              "ret                               \n\t"
+         AESKEYGENA(xmm0_xmm2, "0x00")
+         "pshufd $0xaa, %%xmm2, %%xmm2      \n\t"
+         "pxor %%xmm1, %%xmm2               \n\t"
+         "pslldq $4, %%xmm1                 \n\t"
+         "pxor %%xmm1, %%xmm2               \n\t"
+         "pslldq $4, %%xmm1                 \n\t"
+         "pxor %%xmm1, %%xmm2               \n\t"
+         "pslldq $4, %%xmm1                 \n\t"
+         "pxor %%xmm2, %%xmm1               \n\t"
+         "add $16, %0                       \n\t"
+         "movdqu %%xmm1, (%0)               \n\t"
+         "ret                               \n\t"
 
          /*
           * Main "loop" - Generating one more key than necessary,
           * see definition of mbedtls_aes_context.buf
           */
-                              "2:                                \n\t"
-         AESKEYGENA xmm1_xmm2 ",0x01        \n\tcall 1b \n\t"
-         AESKEYGENA xmm1_xmm2 ",0x02        \n\tcall 1b \n\t"
-         AESKEYGENA xmm1_xmm2 ",0x04        \n\tcall 1b \n\t"
-         AESKEYGENA xmm1_xmm2 ",0x08        \n\tcall 1b \n\t"
-         AESKEYGENA xmm1_xmm2 ",0x10        \n\tcall 1b \n\t"
-         AESKEYGENA xmm1_xmm2 ",0x20        \n\tcall 1b \n\t"
-         AESKEYGENA xmm1_xmm2 ",0x40        \n\tcall 1b \n\t"
+         "2:                                \n\t"
+         AESKEYGENA(xmm1_xmm2, "0x01")      "call 1b \n\t"
+         AESKEYGENA(xmm1_xmm2, "0x02")      "call 1b \n\t"
+         AESKEYGENA(xmm1_xmm2, "0x04")      "call 1b \n\t"
+         AESKEYGENA(xmm1_xmm2, "0x08")      "call 1b \n\t"
+         AESKEYGENA(xmm1_xmm2, "0x10")      "call 1b \n\t"
+         AESKEYGENA(xmm1_xmm2, "0x20")      "call 1b \n\t"
+         AESKEYGENA(xmm1_xmm2, "0x40")      "call 1b \n\t"
          :
          : "r" (rk), "r" (key)
          : "memory", "cc", "0");

From 5511a34566aa455512a704604a11120dee426631 Mon Sep 17 00:00:00 2001
From: Gilles Peskine <Gilles.Peskine@arm.com>
Date: Fri, 10 Mar 2023 22:29:32 +0100
Subject: [PATCH 19/45] New preprocessor symbol indicating that AESNI support
 is present

The configuration symbol MBEDTLS_AESNI_C requests AESNI support, but it is
ignored if the platform doesn't have AESNI. This allows keeping
MBEDTLS_AESNI_C enabled (as it is in the default build) when building for
platforms other than x86_64, or when MBEDTLS_HAVE_ASM is disabled.

To facilitate maintenance, always use the symbol MBEDTLS_AESNI_HAVE_CODE to
answer the question "can I call mbedtls_aesni_xxx functions?", rather than
repeating the check `defined(MBEDTLS_AESNI_C) && ...`.

Signed-off-by: Gilles Peskine <Gilles.Peskine@arm.com>
---
 include/mbedtls/aesni.h | 22 ++++++++++++++++++++--
 library/aes.c           |  6 +++---
 library/gcm.c           |  6 +++---
 3 files changed, 26 insertions(+), 8 deletions(-)

diff --git a/include/mbedtls/aesni.h b/include/mbedtls/aesni.h
index 653b146e7f..b3d49e4380 100644
--- a/include/mbedtls/aesni.h
+++ b/include/mbedtls/aesni.h
@@ -36,13 +36,30 @@
 #define MBEDTLS_AESNI_AES      0x02000000u
 #define MBEDTLS_AESNI_CLMUL    0x00000002u
 
-#if defined(MBEDTLS_HAVE_ASM) && defined(__GNUC__) &&  \
+#if defined(MBEDTLS_HAVE_ASM) && defined(__GNUC__) && \
     (defined(__amd64__) || defined(__x86_64__))   &&  \
     !defined(MBEDTLS_HAVE_X86_64)
 #define MBEDTLS_HAVE_X86_64
 #endif
 
+#if defined(MBEDTLS_AESNI_C)
+
 #if defined(MBEDTLS_HAVE_X86_64)
+#define MBEDTLS_AESNI_HAVE_CODE // via assembly
+#endif
+
+#if defined(_MSC_VER)
+#define MBEDTLS_HAVE_AESNI_INTRINSICS
+#endif
+#if defined(__GNUC__) && defined(__AES__)
+#define MBEDTLS_HAVE_AESNI_INTRINSICS
+#endif
+
+#if defined(MBEDTLS_HAVE_AESNI_INTRINSICS)
+#define MBEDTLS_AESNI_HAVE_CODE // via intrinsics
+#endif
+
+#if defined(MBEDTLS_AESNI_HAVE_CODE)
 
 #ifdef __cplusplus
 extern "C" {
@@ -131,6 +148,7 @@ int mbedtls_aesni_setkey_enc(unsigned char *rk,
 }
 #endif
 
-#endif /* MBEDTLS_HAVE_X86_64 */
+#endif /* MBEDTLS_AESNI_HAVE_CODE */
+#endif  /* MBEDTLS_AESNI_C */
 
 #endif /* MBEDTLS_AESNI_H */
diff --git a/library/aes.c b/library/aes.c
index bcdf3c782b..66c697a796 100644
--- a/library/aes.c
+++ b/library/aes.c
@@ -550,7 +550,7 @@ int mbedtls_aes_setkey_enc(mbedtls_aes_context *ctx, const unsigned char *key,
 #endif
     ctx->rk = RK = ctx->buf;
 
-#if defined(MBEDTLS_AESNI_C) && defined(MBEDTLS_HAVE_X86_64)
+#if defined(MBEDTLS_AESNI_HAVE_CODE)
     if (mbedtls_aesni_has_support(MBEDTLS_AESNI_AES)) {
         return mbedtls_aesni_setkey_enc((unsigned char *) ctx->rk, key, keybits);
     }
@@ -658,7 +658,7 @@ int mbedtls_aes_setkey_dec(mbedtls_aes_context *ctx, const unsigned char *key,
 
     ctx->nr = cty.nr;
 
-#if defined(MBEDTLS_AESNI_C) && defined(MBEDTLS_HAVE_X86_64)
+#if defined(MBEDTLS_AESNI_HAVE_CODE)
     if (mbedtls_aesni_has_support(MBEDTLS_AESNI_AES)) {
         mbedtls_aesni_inverse_key((unsigned char *) ctx->rk,
                                   (const unsigned char *) cty.rk, ctx->nr);
@@ -978,7 +978,7 @@ int mbedtls_aes_crypt_ecb(mbedtls_aes_context *ctx,
     AES_VALIDATE_RET(mode == MBEDTLS_AES_ENCRYPT ||
                      mode == MBEDTLS_AES_DECRYPT);
 
-#if defined(MBEDTLS_AESNI_C) && defined(MBEDTLS_HAVE_X86_64)
+#if defined(MBEDTLS_AESNI_HAVE_CODE)
     if (mbedtls_aesni_has_support(MBEDTLS_AESNI_AES)) {
         return mbedtls_aesni_crypt_ecb(ctx, mode, input, output);
     }
diff --git a/library/gcm.c b/library/gcm.c
index f7db0d42df..2778012d2d 100644
--- a/library/gcm.c
+++ b/library/gcm.c
@@ -93,7 +93,7 @@ static int gcm_gen_table(mbedtls_gcm_context *ctx)
     ctx->HL[8] = vl;
     ctx->HH[8] = vh;
 
-#if defined(MBEDTLS_AESNI_C) && defined(MBEDTLS_HAVE_X86_64)
+#if defined(MBEDTLS_AESNI_HAVE_CODE)
     /* With CLMUL support, we need only h, not the rest of the table */
     if (mbedtls_aesni_has_support(MBEDTLS_AESNI_CLMUL)) {
         return 0;
@@ -190,7 +190,7 @@ static void gcm_mult(mbedtls_gcm_context *ctx, const unsigned char x[16],
     unsigned char lo, hi, rem;
     uint64_t zh, zl;
 
-#if defined(MBEDTLS_AESNI_C) && defined(MBEDTLS_HAVE_X86_64)
+#if defined(MBEDTLS_AESNI_HAVE_CODE)
     if (mbedtls_aesni_has_support(MBEDTLS_AESNI_CLMUL)) {
         unsigned char h[16];
 
@@ -202,7 +202,7 @@ static void gcm_mult(mbedtls_gcm_context *ctx, const unsigned char x[16],
         mbedtls_aesni_gcm_mult(output, x, h);
         return;
     }
-#endif /* MBEDTLS_AESNI_C && MBEDTLS_HAVE_X86_64 */
+#endif /* MBEDTLS_AESNI_HAVE_CODE */
 
     lo = x[15] & 0xf;
 

From 2c8ad9400be0d58aebaa968864dea268857b0fb0 Mon Sep 17 00:00:00 2001
From: Gilles Peskine <Gilles.Peskine@arm.com>
Date: Fri, 10 Mar 2023 22:35:24 +0100
Subject: [PATCH 20/45] AES, GCM selftest: indicate which implementation is
 used

Signed-off-by: Gilles Peskine <Gilles.Peskine@arm.com>
---
 library/aes.c | 23 +++++++++++++++++++++++
 library/gcm.c | 14 ++++++++++++++
 2 files changed, 37 insertions(+)

diff --git a/library/aes.c b/library/aes.c
index 66c697a796..a81332d390 100644
--- a/library/aes.c
+++ b/library/aes.c
@@ -1785,6 +1785,29 @@ int mbedtls_aes_self_test(int verbose)
     memset(key, 0, 32);
     mbedtls_aes_init(&ctx);
 
+    if (verbose != 0) {
+#if defined(MBEDTLS_AES_ALT)
+        mbedtls_printf("  AES note: alternative implementation.\n");
+#else /* MBEDTLS_AES_ALT */
+#if defined(MBEDTLS_PADLOCK_C) && defined(MBEDTLS_HAVE_X86)
+        if (mbedtls_padlock_has_support(MBEDTLS_PADLOCK_ACE)) {
+            mbedtls_printf("  AES note: using VIA Padlock.\n");
+        } else
+#endif
+#if defined(MBEDTLS_AESNI_HAVE_CODE)
+        if (mbedtls_aesni_has_support(MBEDTLS_AESNI_AES)) {
+            mbedtls_printf("  AES note: using AESNI.\n");
+        } else
+#endif
+#if defined(MBEDTLS_AESCE_C) && defined(MBEDTLS_HAVE_ARM64)
+        if (mbedtls_aesce_has_support()) {
+            mbedtls_printf("  AES note: using AESCE.\n");
+        } else
+#endif
+        mbedtls_printf("  AES note: built-in implementation.\n");
+#endif /* MBEDTLS_AES_ALT */
+    }
+
     /*
      * ECB mode
      */
diff --git a/library/gcm.c b/library/gcm.c
index 2778012d2d..463ef48fcf 100644
--- a/library/gcm.c
+++ b/library/gcm.c
@@ -754,6 +754,20 @@ int mbedtls_gcm_self_test(int verbose)
     int i, j, ret;
     mbedtls_cipher_id_t cipher = MBEDTLS_CIPHER_ID_AES;
 
+    if (verbose != 0)
+    {
+#if defined(MBEDTLS_GCM_ALT)
+        mbedtls_printf("  GCM note: alternative implementation.\n");
+#else /* MBEDTLS_GCM_ALT */
+#if defined(MBEDTLS_AESNI_HAVE_CODE)
+        if (mbedtls_aesni_has_support(MBEDTLS_AESNI_CLMUL)) {
+            mbedtls_printf("  GCM note: using AESNI.\n");
+        } else
+#endif
+        mbedtls_printf("  GCM note: built-in implementation.\n");
+#endif /* MBEDTLS_GCM_ALT */
+    }
+
     for (j = 0; j < 3; j++) {
         int key_len = 128 + 64 * j;
 

From e7dc21fabbf2093812e8df1adfa1ff1dbf5b0d11 Mon Sep 17 00:00:00 2001
From: Gilles Peskine <Gilles.Peskine@arm.com>
Date: Fri, 10 Mar 2023 22:37:11 +0100
Subject: [PATCH 21/45] AESNI: add implementation with intrinsics

As of this commit, to use the intrinsics for MBEDTLS_AESNI_C:

* With MSVC, this should be the default.
* With Clang, build with `clang -maes -mpclmul` or equivalent.
* With GCC, build with `gcc -mpclmul -msse2` or equivalent.

In particular, for now, with a GCC-like compiler, when building specifically
for a target that supports both the AES and GCM instructions, the old
implementation using assembly is selected.

This method for platform selection will likely be improved in the future.

Signed-off-by: Gilles Peskine <Gilles.Peskine@arm.com>
---
 library/aes.c   |  19 +++
 library/aesni.c | 339 +++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 357 insertions(+), 1 deletion(-)

diff --git a/library/aes.c b/library/aes.c
index a81332d390..36aa7f2999 100644
--- a/library/aes.c
+++ b/library/aes.c
@@ -552,6 +552,14 @@ int mbedtls_aes_setkey_enc(mbedtls_aes_context *ctx, const unsigned char *key,
 
 #if defined(MBEDTLS_AESNI_HAVE_CODE)
     if (mbedtls_aesni_has_support(MBEDTLS_AESNI_AES)) {
+        /* The intrinsics-based implementation needs 16-byte alignment
+         * for the round key array. */
+        unsigned delta = (uintptr_t) ctx->buf & 0x0000000f;
+        size_t rk_offset = 0;
+        if (delta != 0) {
+            rk_offset = 4 - delta / 4; // 16 bytes = 4 uint32_t
+        }
+        ctx->rk = RK = ctx->buf + rk_offset;
         return mbedtls_aesni_setkey_enc((unsigned char *) ctx->rk, key, keybits);
     }
 #endif
@@ -665,6 +673,17 @@ int mbedtls_aes_setkey_dec(mbedtls_aes_context *ctx, const unsigned char *key,
         goto exit;
     }
 #endif
+#if defined(MBEDTLS_AESNI_HAVE_CODE)
+    if (mbedtls_aesni_has_support(MBEDTLS_AESNI_AES)) {
+        /* The intrinsics-based implementation needs 16-byte alignment
+         * for the round key array. */
+        unsigned delta = (uintptr_t) ctx->buf & 0x0000000f;
+        if (delta != 0) {
+            size_t rk_offset = 4 - delta / 4; // 16 bytes = 4 uint32_t
+            ctx->rk = RK = ctx->buf + rk_offset;
+        }
+    }
+#endif
 
     SK = cty.rk + cty.nr * 4;
 
diff --git a/library/aesni.c b/library/aesni.c
index 2756194da8..0398d8aae4 100644
--- a/library/aesni.c
+++ b/library/aesni.c
@@ -36,7 +36,12 @@
 #endif
 /* *INDENT-ON* */
 
-#if defined(MBEDTLS_HAVE_X86_64)
+#if defined(MBEDTLS_HAVE_AESNI_INTRINSICS) || defined(MBEDTLS_HAVE_X86_64)
+
+#if defined(MBEDTLS_HAVE_AESNI_INTRINSICS)
+#include <cpuid.h>
+#include <immintrin.h>
+#endif
 
 /*
  * AES-NI support detection routine
@@ -47,17 +52,347 @@ int mbedtls_aesni_has_support(unsigned int what)
     static unsigned int c = 0;
 
     if (!done) {
+#if defined(MBEDTLS_HAVE_AESNI_INTRINSICS)
+        static unsigned info[4] = { 0, 0, 0, 0 };
+#if defined(_MSC_VER)
+        __cpuid(info, 1);
+#else
+        __cpuid(1, info[0], info[1], info[2], info[3]);
+#endif
+        c = info[2];
+#else
         asm ("movl  $1, %%eax   \n\t"
              "cpuid             \n\t"
              : "=c" (c)
              :
              : "eax", "ebx", "edx");
+#endif
         done = 1;
     }
 
     return (c & what) != 0;
 }
 
+#if defined(MBEDTLS_HAVE_AESNI_INTRINSICS)
+
+/*
+ * AES-NI AES-ECB block en(de)cryption
+ */
+int mbedtls_aesni_crypt_ecb(mbedtls_aes_context *ctx,
+                            int mode,
+                            const unsigned char input[16],
+                            unsigned char output[16])
+{
+    const __m128i *rk = (const __m128i *) (ctx->rk);
+    unsigned nr = ctx->nr; // Number of remaining rounds
+    // Load round key 0
+    __m128i xmm0;
+    memcpy(&xmm0, input, 16);
+    xmm0 ^= *rk;
+    ++rk;
+    --nr;
+
+    if (mode == 0) {
+        while (nr != 0) {
+            xmm0 = _mm_aesdec_si128(xmm0, *rk);
+            ++rk;
+            --nr;
+        }
+        xmm0 = _mm_aesdeclast_si128(xmm0, *rk);
+    } else {
+        while (nr != 0) {
+            xmm0 = _mm_aesenc_si128(xmm0, *rk);
+            ++rk;
+            --nr;
+        }
+        xmm0 = _mm_aesenclast_si128(xmm0, *rk);
+    }
+
+    memcpy(output, &xmm0, 16);
+    return 0;
+}
+
+/*
+ * GCM multiplication: c = a times b in GF(2^128)
+ * Based on [CLMUL-WP] algorithms 1 (with equation 27) and 5.
+ */
+
+static void gcm_clmul(const __m128i aa, const __m128i bb,
+                      __m128i *cc, __m128i *dd)
+{
+    /*
+     * Caryless multiplication dd:cc = aa * bb
+     * using [CLMUL-WP] algorithm 1 (p. 12).
+     */
+    *cc = _mm_clmulepi64_si128(aa, bb, 0x00); // a0*b0 = c1:c0
+    *dd = _mm_clmulepi64_si128(aa, bb, 0x11); // a1*b1 = d1:d0
+    __m128i ee = _mm_clmulepi64_si128(aa, bb, 0x10); // a0*b1 = e1:e0
+    __m128i ff = _mm_clmulepi64_si128(aa, bb, 0x01); // a1*b0 = f1:f0
+    ff ^= ee;                                        // e1+f1:e0+f0
+    ee = ff;                                         // e1+f1:e0+f0
+    ff = _mm_srli_si128(ff, 8);                      // 0:e1+f1
+    ee = _mm_slli_si128(ee, 8);                      // e0+f0:0
+    *dd ^= ff;                                       // d1:d0+e1+f1
+    *cc ^= ee;                                       // c1+e0+f1:c0
+}
+
+static void gcm_shift(__m128i *cc, __m128i *dd)
+{
+    /*
+     * Now shift the result one bit to the left,
+     * taking advantage of [CLMUL-WP] eq 27 (p. 18)
+     */
+    //                                       // *cc = r1:r0
+    //                                       // *dd = r3:r2
+    __m128i xmm1 = _mm_slli_epi64(*cc, 1);   // r1<<1:r0<<1
+    __m128i xmm2 = _mm_slli_epi64(*dd, 1);   // r3<<1:r2<<1
+    __m128i xmm3 = _mm_srli_epi64(*cc, 63);  // r1>>63:r0>>63
+    __m128i xmm4 = _mm_srli_epi64(*dd, 63);  // r3>>63:r2>>63
+    __m128i xmm5 = _mm_srli_si128(xmm3, 8);  // 0:r1>>63
+    xmm3 = _mm_slli_si128(xmm3, 8);          // r0>>63:0
+    xmm4 = _mm_slli_si128(xmm4, 8);          // 0:r1>>63
+
+    *cc = xmm1 | xmm3;                       // r1<<1|r0>>63:r0<<1
+    *dd = xmm2 | xmm4 | xmm5;                // r3<<1|r2>>62:r2<<1|r1>>63
+}
+
+static __m128i gcm_reduce1(__m128i xx)
+{
+    //                                            // xx = x1:x0
+    /* [CLMUL-WP] Algorithm 5 Step 2 */
+    __m128i aa = _mm_slli_epi64(xx, 63);          // x1<<63:x0<<63 = stuff:a
+    __m128i bb = _mm_slli_epi64(xx, 62);          // x1<<62:x0<<62 = stuff:b
+    __m128i cc = _mm_slli_epi64(xx, 57);          // x1<<57:x0<<57 = stuff:c
+    __m128i dd = _mm_slli_si128(aa ^ bb ^ cc, 8); // a+b+c:0
+    return dd ^ xx;                               // x1+a+b+c:x0 = d:x0
+}
+
+static __m128i gcm_reduce2(__m128i dx)
+{
+    /* [CLMUL-WP] Algorithm 5 Steps 3 and 4 */
+    __m128i ee = _mm_srli_epi64(dx, 1);           // e1:x0>>1 = e1:e0'
+    __m128i ff = _mm_srli_epi64(dx, 2);           // f1:x0>>2 = f1:f0'
+    __m128i gg = _mm_srli_epi64(dx, 7);           // g1:x0>>7 = g1:g0'
+
+    // e0'+f0'+g0' is almost e0+f0+g0, except for some missing
+    // bits carried from d. Now get those bits back in.
+    __m128i eh = _mm_slli_epi64(dx, 63);          // d<<63:stuff
+    __m128i fh = _mm_slli_epi64(dx, 62);          // d<<62:stuff
+    __m128i gh = _mm_slli_epi64(dx, 57);          // d<<57:stuff
+    __m128i hh = _mm_srli_si128(eh ^ fh ^ gh, 8); // 0:missing bits of d
+
+    return ee ^ ff ^ gg ^ hh ^ dx;
+}
+
+void mbedtls_aesni_gcm_mult(unsigned char c[16],
+                            const unsigned char a[16],
+                            const unsigned char b[16])
+{
+    __m128i aa, bb, cc, dd;
+
+    /* The inputs are in big-endian order, so byte-reverse them */
+    for (size_t i = 0; i < 16; i++) {
+        ((uint8_t *) &aa)[i] = a[15 - i];
+        ((uint8_t *) &bb)[i] = b[15 - i];
+    }
+
+    gcm_clmul(aa, bb, &cc, &dd);
+    gcm_shift(&cc, &dd);
+    /*
+     * Now reduce modulo the GCM polynomial x^128 + x^7 + x^2 + x + 1
+     * using [CLMUL-WP] algorithm 5 (p. 18).
+     * Currently dd:cc holds x3:x2:x1:x0 (already shifted).
+     */
+    __m128i dx = gcm_reduce1(cc);
+    __m128i xh = gcm_reduce2(dx);
+    cc = xh ^ dd; // x3+h1:x2+h0
+
+    /* Now byte-reverse the outputs */
+    for (size_t i = 0; i < 16; i++) {
+        c[i] = ((uint8_t *) &cc)[15 - i];
+    }
+
+    return;
+}
+
+/*
+ * Compute decryption round keys from encryption round keys
+ */
+void mbedtls_aesni_inverse_key(unsigned char *invkey,
+                               const unsigned char *fwdkey, int nr)
+{
+    __m128i *ik = (__m128i *) invkey;
+    const __m128i *fk = (const __m128i *) fwdkey + nr;
+
+    *ik = *fk;
+    for (--fk, ++ik; fk > (const __m128i *) fwdkey; --fk, ++ik) {
+        *ik = _mm_aesimc_si128(*fk);
+    }
+    *ik = *fk;
+}
+
+/*
+ * Key expansion, 128-bit case
+ */
+static __m128i aesni_set_rk_128(__m128i xmm0, __m128i xmm1)
+{
+    /*
+     * Finish generating the next round key.
+     *
+     * On entry xmm0 is r3:r2:r1:r0 and xmm1 is X:stuff:stuff:stuff
+     * with X = rot( sub( r3 ) ) ^ RCON.
+     *
+     * On exit, xmm1 is r7:r6:r5:r4
+     * with r4 = X + r0, r5 = r4 + r1, r6 = r5 + r2, r7 = r6 + r3
+     * and this is returned, to be written to the round key buffer.
+     */
+    xmm1 = _mm_shuffle_epi32(xmm1, 0xff);   // X:X:X:X
+    xmm1 ^= xmm0;                           // X+r3:X+r2:X+r1:r4
+    xmm0 = _mm_slli_si128(xmm0, 4);         // r2:r1:r0:0
+    xmm1 ^= xmm0;                           // X+r3+r2:X+r2+r1:r5:r4
+    xmm0 = _mm_slli_si128(xmm0, 4);         // r1:r0:0:0
+    xmm1 ^= xmm0;                           // X+r3+r2+r1:r6:r5:r4
+    xmm0 = _mm_slli_si128(xmm0, 4);         // r0:0:0:0
+    xmm1 ^= xmm0;                           // r7:r6:r5:r4
+    return xmm1;
+}
+
+static void aesni_setkey_enc_128(unsigned char *rk_bytes,
+                                 const unsigned char *key)
+{
+    __m128i *rk = (__m128i *) rk_bytes;
+
+    memcpy(&rk[0], key, 16);
+    rk[1] = aesni_set_rk_128(rk[0], _mm_aeskeygenassist_si128(rk[0], 0x01));
+    rk[2] = aesni_set_rk_128(rk[1], _mm_aeskeygenassist_si128(rk[1], 0x02));
+    rk[3] = aesni_set_rk_128(rk[2], _mm_aeskeygenassist_si128(rk[2], 0x04));
+    rk[4] = aesni_set_rk_128(rk[3], _mm_aeskeygenassist_si128(rk[3], 0x08));
+    rk[5] = aesni_set_rk_128(rk[4], _mm_aeskeygenassist_si128(rk[4], 0x10));
+    rk[6] = aesni_set_rk_128(rk[5], _mm_aeskeygenassist_si128(rk[5], 0x20));
+    rk[7] = aesni_set_rk_128(rk[6], _mm_aeskeygenassist_si128(rk[6], 0x40));
+    rk[8] = aesni_set_rk_128(rk[7], _mm_aeskeygenassist_si128(rk[7], 0x80));
+    rk[9] = aesni_set_rk_128(rk[8], _mm_aeskeygenassist_si128(rk[8], 0x1B));
+    rk[10] = aesni_set_rk_128(rk[9], _mm_aeskeygenassist_si128(rk[9], 0x36));
+}
+
+/*
+ * Key expansion, 192-bit case
+ */
+static void aesni_set_rk_192(__m128i *xmm0, __m128i *xmm1, __m128i xmm2,
+                             unsigned char *rk)
+{
+    /*
+     * Finish generating the next 6 quarter-keys.
+     *
+     * On entry xmm0 is r3:r2:r1:r0, xmm1 is stuff:stuff:r5:r4
+     * and xmm2 is stuff:stuff:X:stuff with X = rot( sub( r3 ) ) ^ RCON.
+     *
+     * On exit, xmm0 is r9:r8:r7:r6 and xmm1 is stuff:stuff:r11:r10
+     * and those are written to the round key buffer.
+     */
+    xmm2 = _mm_shuffle_epi32(xmm2, 0x55);     // X:X:X:X
+    xmm2 = _mm_xor_si128(xmm2, *xmm0);        // X+r3:X+r2:X+r1:X+r0
+    *xmm0 = _mm_slli_si128(*xmm0, 4);         // r2:r1:r0:0
+    xmm2 = _mm_xor_si128(xmm2, *xmm0);        // X+r3+r2:X+r2+r1:X+r1+r0:X+r0
+    *xmm0 = _mm_slli_si128(*xmm0, 4);         // r1:r0:0:0
+    xmm2 = _mm_xor_si128(xmm2, *xmm0);        // X+r3+r2+r1:X+r2+r1+r0:X+r1+r0:X+r0
+    *xmm0 = _mm_slli_si128(*xmm0, 4);         // r0:0:0:0
+    xmm2 = _mm_xor_si128(xmm2, *xmm0);        // X+r3+r2+r1+r0:X+r2+r1+r0:X+r1+r0:X+r0
+    *xmm0 = xmm2;                             // = r9:r8:r7:r6
+
+    xmm2 = _mm_shuffle_epi32(xmm2, 0xff);     // r9:r9:r9:r9
+    xmm2 = _mm_xor_si128(xmm2, *xmm1);        // stuff:stuff:r9+r5:r9+r4
+    *xmm1 = _mm_slli_si128(*xmm1, 4);         // stuff:stuff:r4:0
+    xmm2 = _mm_xor_si128(xmm2, *xmm1);        // stuff:stuff:r9+r5+r4:r9+r4
+    *xmm1 = xmm2;                             // = stuff:stuff:r11:r10
+
+    /* Store xmm0 and the low half of xmm1 into rk, which is conceptually
+     * an array of 24-byte elements. Since 24 is not a multiple of 16,
+     * rk is not necessarily aligned so just `*rk = *xmm0` doesn't work. */
+    memcpy(rk, xmm0, 16);
+    _mm_storeu_si64(rk + 16, *xmm1);
+}
+
+static void aesni_setkey_enc_192(unsigned char *rk,
+                                 const unsigned char *key)
+{
+    /* First round: use original key */
+    memcpy(rk, key, 24);
+    /* aes.c guarantees that rk is aligned on a 16-byte boundary. */
+    __m128i xmm0 = ((__m128i *) rk)[0];
+    __m128i xmm1 = _mm_loadl_epi64(((__m128i *) rk) + 1);
+
+    aesni_set_rk_192(&xmm0, &xmm1, _mm_aeskeygenassist_si128(xmm1, 0x01), rk + 24 * 1);
+    aesni_set_rk_192(&xmm0, &xmm1, _mm_aeskeygenassist_si128(xmm1, 0x02), rk + 24 * 2);
+    aesni_set_rk_192(&xmm0, &xmm1, _mm_aeskeygenassist_si128(xmm1, 0x04), rk + 24 * 3);
+    aesni_set_rk_192(&xmm0, &xmm1, _mm_aeskeygenassist_si128(xmm1, 0x08), rk + 24 * 4);
+    aesni_set_rk_192(&xmm0, &xmm1, _mm_aeskeygenassist_si128(xmm1, 0x10), rk + 24 * 5);
+    aesni_set_rk_192(&xmm0, &xmm1, _mm_aeskeygenassist_si128(xmm1, 0x20), rk + 24 * 6);
+    aesni_set_rk_192(&xmm0, &xmm1, _mm_aeskeygenassist_si128(xmm1, 0x40), rk + 24 * 7);
+    aesni_set_rk_192(&xmm0, &xmm1, _mm_aeskeygenassist_si128(xmm1, 0x80), rk + 24 * 8);
+}
+
+/*
+ * Key expansion, 256-bit case
+ */
+static void aesni_set_rk_256(__m128i xmm0, __m128i xmm1, __m128i xmm2,
+                             __m128i *rk0, __m128i *rk1)
+{
+    /*
+     * Finish generating the next two round keys.
+     *
+     * On entry xmm0 is r3:r2:r1:r0, xmm1 is r7:r6:r5:r4 and
+     * xmm2 is X:stuff:stuff:stuff with X = rot( sub( r7 )) ^ RCON
+     *
+     * On exit, *rk0 is r11:r10:r9:r8 and *rk1 is r15:r14:r13:r12
+     */
+    xmm2 = _mm_shuffle_epi32(xmm2, 0xff);
+    xmm2 ^= xmm0;
+    xmm0 = _mm_slli_si128(xmm0, 4);
+    xmm2 ^= xmm0;
+    xmm0 = _mm_slli_si128(xmm0, 4);
+    xmm2 ^= xmm0;
+    xmm0 = _mm_slli_si128(xmm0, 4);
+    xmm0 ^= xmm2;
+    *rk0 = xmm0;
+
+    /* Set xmm2 to stuff:Y:stuff:stuff with Y = subword( r11 )
+     * and proceed to generate next round key from there */
+    xmm2 = _mm_aeskeygenassist_si128(xmm0, 0x00);
+    xmm2 = _mm_shuffle_epi32(xmm2, 0xaa);
+    xmm2 ^= xmm1;
+    xmm1 = _mm_slli_si128(xmm1, 4);
+    xmm2 ^= xmm1;
+    xmm1 = _mm_slli_si128(xmm1, 4);
+    xmm2 ^= xmm1;
+    xmm1 = _mm_slli_si128(xmm1, 4);
+    xmm1 ^= xmm2;
+    *rk1 = xmm1;
+}
+
+static void aesni_setkey_enc_256(unsigned char *rk_bytes,
+                                 const unsigned char *key)
+{
+    __m128i *rk = (__m128i *) rk_bytes;
+
+    memcpy(&rk[0], key, 16);
+    memcpy(&rk[1], key + 16, 16);
+
+    /*
+     * Main "loop" - Generating one more key than necessary,
+     * see definition of mbedtls_aes_context.buf
+     */
+    aesni_set_rk_256(rk[0], rk[1], _mm_aeskeygenassist_si128(rk[1], 0x01), &rk[2], &rk[3]);
+    aesni_set_rk_256(rk[2], rk[3], _mm_aeskeygenassist_si128(rk[3], 0x02), &rk[4], &rk[5]);
+    aesni_set_rk_256(rk[4], rk[5], _mm_aeskeygenassist_si128(rk[5], 0x04), &rk[6], &rk[7]);
+    aesni_set_rk_256(rk[6], rk[7], _mm_aeskeygenassist_si128(rk[7], 0x08), &rk[8], &rk[9]);
+    aesni_set_rk_256(rk[8], rk[9], _mm_aeskeygenassist_si128(rk[9], 0x10), &rk[10], &rk[11]);
+    aesni_set_rk_256(rk[10], rk[11], _mm_aeskeygenassist_si128(rk[11], 0x20), &rk[12], &rk[13]);
+    aesni_set_rk_256(rk[12], rk[13], _mm_aeskeygenassist_si128(rk[13], 0x40), &rk[14], &rk[15]);
+}
+
+#else  /* MBEDTLS_HAVE_AESNI_INTRINSICS */
+
 #if defined(__has_feature)
 #if __has_feature(memory_sanitizer)
 #warning \
@@ -444,6 +779,8 @@ static void aesni_setkey_enc_256(unsigned char *rk,
          : "memory", "cc", "0");
 }
 
+#endif  /* MBEDTLS_HAVE_AESNI_INTRINSICS */
+
 /*
  * Key expansion, wrapper
  */

From 790756d4395245dcd2a94f33e5825cdce6f3a46c Mon Sep 17 00:00:00 2001
From: Tom Cosgrove <tom.cosgrove@arm.com>
Date: Mon, 13 Mar 2023 15:32:52 +0000
Subject: [PATCH 22/45] Get aesni.c compiling with Visual Studio

Clang is nice enough to support bitwise operators on __m128i, but MSVC
isn't.

Also, __cpuid() in MSVC comes from <intrin.h> (which is included via
<emmintrin.h>), not <cpuid.h>.

Signed-off-by: Tom Cosgrove <tom.cosgrove@arm.com>
---
 library/aesni.c | 49 ++++++++++++++++++++++++++-----------------------
 1 file changed, 26 insertions(+), 23 deletions(-)

diff --git a/library/aesni.c b/library/aesni.c
index 0398d8aae4..152a2acb58 100644
--- a/library/aesni.c
+++ b/library/aesni.c
@@ -39,7 +39,9 @@
 #if defined(MBEDTLS_HAVE_AESNI_INTRINSICS) || defined(MBEDTLS_HAVE_X86_64)
 
 #if defined(MBEDTLS_HAVE_AESNI_INTRINSICS)
+#if !defined(_WIN32)
 #include <cpuid.h>
+#endif
 #include <immintrin.h>
 #endif
 
@@ -85,10 +87,11 @@ int mbedtls_aesni_crypt_ecb(mbedtls_aes_context *ctx,
 {
     const __m128i *rk = (const __m128i *) (ctx->rk);
     unsigned nr = ctx->nr; // Number of remaining rounds
+
     // Load round key 0
     __m128i xmm0;
     memcpy(&xmm0, input, 16);
-    xmm0 ^= *rk;
+    xmm0 = _mm_xor_si128(xmm0, rk[0]);  // xmm0 ^= *rk;
     ++rk;
     --nr;
 
@@ -128,12 +131,12 @@ static void gcm_clmul(const __m128i aa, const __m128i bb,
     *dd = _mm_clmulepi64_si128(aa, bb, 0x11); // a1*b1 = d1:d0
     __m128i ee = _mm_clmulepi64_si128(aa, bb, 0x10); // a0*b1 = e1:e0
     __m128i ff = _mm_clmulepi64_si128(aa, bb, 0x01); // a1*b0 = f1:f0
-    ff ^= ee;                                        // e1+f1:e0+f0
+    ff = _mm_xor_si128(ff, ee);                      // e1+f1:e0+f0
     ee = ff;                                         // e1+f1:e0+f0
     ff = _mm_srli_si128(ff, 8);                      // 0:e1+f1
     ee = _mm_slli_si128(ee, 8);                      // e0+f0:0
-    *dd ^= ff;                                       // d1:d0+e1+f1
-    *cc ^= ee;                                       // c1+e0+f1:c0
+    *dd = _mm_xor_si128(*dd, ff);                    // d1:d0+e1+f1
+    *cc = _mm_xor_si128(*cc, ee);                    // c1+e0+f1:c0
 }
 
 static void gcm_shift(__m128i *cc, __m128i *dd)
@@ -152,8 +155,8 @@ static void gcm_shift(__m128i *cc, __m128i *dd)
     xmm3 = _mm_slli_si128(xmm3, 8);          // r0>>63:0
     xmm4 = _mm_slli_si128(xmm4, 8);          // 0:r1>>63
 
-    *cc = xmm1 | xmm3;                       // r1<<1|r0>>63:r0<<1
-    *dd = xmm2 | xmm4 | xmm5;                // r3<<1|r2>>62:r2<<1|r1>>63
+    *cc = _mm_or_si128(xmm1, xmm3);          // r1<<1|r0>>63:r0<<1
+    *dd = _mm_or_si128(_mm_or_si128(xmm2, xmm4), xmm5); // r3<<1|r2>>62:r2<<1|r1>>63
 }
 
 static __m128i gcm_reduce1(__m128i xx)
@@ -163,8 +166,8 @@ static __m128i gcm_reduce1(__m128i xx)
     __m128i aa = _mm_slli_epi64(xx, 63);          // x1<<63:x0<<63 = stuff:a
     __m128i bb = _mm_slli_epi64(xx, 62);          // x1<<62:x0<<62 = stuff:b
     __m128i cc = _mm_slli_epi64(xx, 57);          // x1<<57:x0<<57 = stuff:c
-    __m128i dd = _mm_slli_si128(aa ^ bb ^ cc, 8); // a+b+c:0
-    return dd ^ xx;                               // x1+a+b+c:x0 = d:x0
+    __m128i dd = _mm_slli_si128(_mm_xor_si128(_mm_xor_si128(aa, bb), cc), 8); // a+b+c:0
+    return _mm_xor_si128(dd, xx);                 // x1+a+b+c:x0 = d:x0
 }
 
 static __m128i gcm_reduce2(__m128i dx)
@@ -179,9 +182,9 @@ static __m128i gcm_reduce2(__m128i dx)
     __m128i eh = _mm_slli_epi64(dx, 63);          // d<<63:stuff
     __m128i fh = _mm_slli_epi64(dx, 62);          // d<<62:stuff
     __m128i gh = _mm_slli_epi64(dx, 57);          // d<<57:stuff
-    __m128i hh = _mm_srli_si128(eh ^ fh ^ gh, 8); // 0:missing bits of d
+    __m128i hh = _mm_srli_si128(_mm_xor_si128(_mm_xor_si128(eh, fh), gh), 8); // 0:missing bits of d
 
-    return ee ^ ff ^ gg ^ hh ^ dx;
+    return _mm_xor_si128(_mm_xor_si128(_mm_xor_si128(_mm_xor_si128(ee, ff), gg), hh), dx);
 }
 
 void mbedtls_aesni_gcm_mult(unsigned char c[16],
@@ -205,7 +208,7 @@ void mbedtls_aesni_gcm_mult(unsigned char c[16],
      */
     __m128i dx = gcm_reduce1(cc);
     __m128i xh = gcm_reduce2(dx);
-    cc = xh ^ dd; // x3+h1:x2+h0
+    cc = _mm_xor_si128(xh, dd); // x3+h1:x2+h0
 
     /* Now byte-reverse the outputs */
     for (size_t i = 0; i < 16; i++) {
@@ -247,13 +250,13 @@ static __m128i aesni_set_rk_128(__m128i xmm0, __m128i xmm1)
      * and this is returned, to be written to the round key buffer.
      */
     xmm1 = _mm_shuffle_epi32(xmm1, 0xff);   // X:X:X:X
-    xmm1 ^= xmm0;                           // X+r3:X+r2:X+r1:r4
+    xmm1 = _mm_xor_si128(xmm1, xmm0);       // X+r3:X+r2:X+r1:r4
     xmm0 = _mm_slli_si128(xmm0, 4);         // r2:r1:r0:0
-    xmm1 ^= xmm0;                           // X+r3+r2:X+r2+r1:r5:r4
+    xmm1 = _mm_xor_si128(xmm1, xmm0);       // X+r3+r2:X+r2+r1:r5:r4
     xmm0 = _mm_slli_si128(xmm0, 4);         // r1:r0:0:0
-    xmm1 ^= xmm0;                           // X+r3+r2+r1:r6:r5:r4
+    xmm1 = _mm_xor_si128(xmm1, xmm0);       // X+r3+r2+r1:r6:r5:r4
     xmm0 = _mm_slli_si128(xmm0, 4);         // r0:0:0:0
-    xmm1 ^= xmm0;                           // r7:r6:r5:r4
+    xmm1 = _mm_xor_si128(xmm1, xmm0);       // r7:r6:r5:r4
     return xmm1;
 }
 
@@ -347,26 +350,26 @@ static void aesni_set_rk_256(__m128i xmm0, __m128i xmm1, __m128i xmm2,
      * On exit, *rk0 is r11:r10:r9:r8 and *rk1 is r15:r14:r13:r12
      */
     xmm2 = _mm_shuffle_epi32(xmm2, 0xff);
-    xmm2 ^= xmm0;
+    xmm2 = _mm_xor_si128(xmm2, xmm0);
     xmm0 = _mm_slli_si128(xmm0, 4);
-    xmm2 ^= xmm0;
+    xmm2 = _mm_xor_si128(xmm2, xmm0);
     xmm0 = _mm_slli_si128(xmm0, 4);
-    xmm2 ^= xmm0;
+    xmm2 = _mm_xor_si128(xmm2, xmm0);
     xmm0 = _mm_slli_si128(xmm0, 4);
-    xmm0 ^= xmm2;
+    xmm0 = _mm_xor_si128(xmm0, xmm2);
     *rk0 = xmm0;
 
     /* Set xmm2 to stuff:Y:stuff:stuff with Y = subword( r11 )
      * and proceed to generate next round key from there */
     xmm2 = _mm_aeskeygenassist_si128(xmm0, 0x00);
     xmm2 = _mm_shuffle_epi32(xmm2, 0xaa);
-    xmm2 ^= xmm1;
+    xmm2 = _mm_xor_si128(xmm2, xmm1);
     xmm1 = _mm_slli_si128(xmm1, 4);
-    xmm2 ^= xmm1;
+    xmm2 = _mm_xor_si128(xmm2, xmm1);
     xmm1 = _mm_slli_si128(xmm1, 4);
-    xmm2 ^= xmm1;
+    xmm2 = _mm_xor_si128(xmm2, xmm1);
     xmm1 = _mm_slli_si128(xmm1, 4);
-    xmm1 ^= xmm2;
+    xmm1 = _mm_xor_si128(xmm1, xmm2);
     *rk1 = xmm1;
 }
 

From d4a239310b66f557d4be1bb44ad98fd4d55fa52a Mon Sep 17 00:00:00 2001
From: Gilles Peskine <Gilles.Peskine@arm.com>
Date: Wed, 15 Mar 2023 20:37:57 +0100
Subject: [PATCH 23/45] Improve variable names

To some extent anyway.

Signed-off-by: Gilles Peskine <Gilles.Peskine@arm.com>
---
 library/aesni.c | 190 ++++++++++++++++++++++++------------------------
 1 file changed, 95 insertions(+), 95 deletions(-)

diff --git a/library/aesni.c b/library/aesni.c
index 152a2acb58..410e1c19b6 100644
--- a/library/aesni.c
+++ b/library/aesni.c
@@ -89,29 +89,29 @@ int mbedtls_aesni_crypt_ecb(mbedtls_aes_context *ctx,
     unsigned nr = ctx->nr; // Number of remaining rounds
 
     // Load round key 0
-    __m128i xmm0;
-    memcpy(&xmm0, input, 16);
-    xmm0 = _mm_xor_si128(xmm0, rk[0]);  // xmm0 ^= *rk;
+    __m128i state;
+    memcpy(&state, input, 16);
+    state = _mm_xor_si128(state, rk[0]);  // state ^= *rk;
     ++rk;
     --nr;
 
     if (mode == 0) {
         while (nr != 0) {
-            xmm0 = _mm_aesdec_si128(xmm0, *rk);
+            state = _mm_aesdec_si128(state, *rk);
             ++rk;
             --nr;
         }
-        xmm0 = _mm_aesdeclast_si128(xmm0, *rk);
+        state = _mm_aesdeclast_si128(state, *rk);
     } else {
         while (nr != 0) {
-            xmm0 = _mm_aesenc_si128(xmm0, *rk);
+            state = _mm_aesenc_si128(state, *rk);
             ++rk;
             --nr;
         }
-        xmm0 = _mm_aesenclast_si128(xmm0, *rk);
+        state = _mm_aesenclast_si128(state, *rk);
     }
 
-    memcpy(output, &xmm0, 16);
+    memcpy(output, &state, 16);
     return 0;
 }
 
@@ -141,25 +141,23 @@ static void gcm_clmul(const __m128i aa, const __m128i bb,
 
 static void gcm_shift(__m128i *cc, __m128i *dd)
 {
-    /*
-     * Now shift the result one bit to the left,
-     * taking advantage of [CLMUL-WP] eq 27 (p. 18)
-     */
-    //                                       // *cc = r1:r0
-    //                                       // *dd = r3:r2
-    __m128i xmm1 = _mm_slli_epi64(*cc, 1);   // r1<<1:r0<<1
-    __m128i xmm2 = _mm_slli_epi64(*dd, 1);   // r3<<1:r2<<1
-    __m128i xmm3 = _mm_srli_epi64(*cc, 63);  // r1>>63:r0>>63
-    __m128i xmm4 = _mm_srli_epi64(*dd, 63);  // r3>>63:r2>>63
-    __m128i xmm5 = _mm_srli_si128(xmm3, 8);  // 0:r1>>63
-    xmm3 = _mm_slli_si128(xmm3, 8);          // r0>>63:0
-    xmm4 = _mm_slli_si128(xmm4, 8);          // 0:r1>>63
+    /* [CMUCL-WP] Algorithm 5 Step 1: shift cc:dd one bit to the left,
+     * taking advantage of [CLMUL-WP] eq 27 (p. 18). */
+    //                                        // *cc = r1:r0
+    //                                        // *dd = r3:r2
+    __m128i cc_lo = _mm_slli_epi64(*cc, 1);   // r1<<1:r0<<1
+    __m128i dd_lo = _mm_slli_epi64(*dd, 1);   // r3<<1:r2<<1
+    __m128i cc_hi = _mm_srli_epi64(*cc, 63);  // r1>>63:r0>>63
+    __m128i dd_hi = _mm_srli_epi64(*dd, 63);  // r3>>63:r2>>63
+    __m128i xmm5 = _mm_srli_si128(cc_hi, 8);  // 0:r1>>63
+    cc_hi = _mm_slli_si128(cc_hi, 8);         // r0>>63:0
+    dd_hi = _mm_slli_si128(dd_hi, 8);         // 0:r1>>63
 
-    *cc = _mm_or_si128(xmm1, xmm3);          // r1<<1|r0>>63:r0<<1
-    *dd = _mm_or_si128(_mm_or_si128(xmm2, xmm4), xmm5); // r3<<1|r2>>62:r2<<1|r1>>63
+    *cc = _mm_or_si128(cc_lo, cc_hi);         // r1<<1|r0>>63:r0<<1
+    *dd = _mm_or_si128(_mm_or_si128(dd_lo, dd_hi), xmm5); // r3<<1|r2>>62:r2<<1|r1>>63
 }
 
-static __m128i gcm_reduce1(__m128i xx)
+static __m128i gcm_reduce(__m128i xx)
 {
     //                                            // xx = x1:x0
     /* [CLMUL-WP] Algorithm 5 Step 2 */
@@ -170,7 +168,7 @@ static __m128i gcm_reduce1(__m128i xx)
     return _mm_xor_si128(dd, xx);                 // x1+a+b+c:x0 = d:x0
 }
 
-static __m128i gcm_reduce2(__m128i dx)
+static __m128i gcm_mix(__m128i dx)
 {
     /* [CLMUL-WP] Algorithm 5 Steps 3 and 4 */
     __m128i ee = _mm_srli_epi64(dx, 1);           // e1:x0>>1 = e1:e0'
@@ -206,8 +204,8 @@ void mbedtls_aesni_gcm_mult(unsigned char c[16],
      * using [CLMUL-WP] algorithm 5 (p. 18).
      * Currently dd:cc holds x3:x2:x1:x0 (already shifted).
      */
-    __m128i dx = gcm_reduce1(cc);
-    __m128i xh = gcm_reduce2(dx);
+    __m128i dx = gcm_reduce(cc);
+    __m128i xh = gcm_mix(dx);
     cc = _mm_xor_si128(xh, dd); // x3+h1:x2+h0
 
     /* Now byte-reverse the outputs */
@@ -237,27 +235,27 @@ void mbedtls_aesni_inverse_key(unsigned char *invkey,
 /*
  * Key expansion, 128-bit case
  */
-static __m128i aesni_set_rk_128(__m128i xmm0, __m128i xmm1)
+static __m128i aesni_set_rk_128(__m128i state, __m128i xword)
 {
     /*
      * Finish generating the next round key.
      *
-     * On entry xmm0 is r3:r2:r1:r0 and xmm1 is X:stuff:stuff:stuff
-     * with X = rot( sub( r3 ) ) ^ RCON.
+     * On entry state is r3:r2:r1:r0 and xword is X:stuff:stuff:stuff
+     * with X = rot( sub( r3 ) ) ^ RCON (obtained with AESKEYGENASSIST).
      *
-     * On exit, xmm1 is r7:r6:r5:r4
+     * On exit, xword is r7:r6:r5:r4
      * with r4 = X + r0, r5 = r4 + r1, r6 = r5 + r2, r7 = r6 + r3
      * and this is returned, to be written to the round key buffer.
      */
-    xmm1 = _mm_shuffle_epi32(xmm1, 0xff);   // X:X:X:X
-    xmm1 = _mm_xor_si128(xmm1, xmm0);       // X+r3:X+r2:X+r1:r4
-    xmm0 = _mm_slli_si128(xmm0, 4);         // r2:r1:r0:0
-    xmm1 = _mm_xor_si128(xmm1, xmm0);       // X+r3+r2:X+r2+r1:r5:r4
-    xmm0 = _mm_slli_si128(xmm0, 4);         // r1:r0:0:0
-    xmm1 = _mm_xor_si128(xmm1, xmm0);       // X+r3+r2+r1:r6:r5:r4
-    xmm0 = _mm_slli_si128(xmm0, 4);         // r0:0:0:0
-    xmm1 = _mm_xor_si128(xmm1, xmm0);       // r7:r6:r5:r4
-    return xmm1;
+    xword = _mm_shuffle_epi32(xword, 0xff);   // X:X:X:X
+    xword = _mm_xor_si128(xword, state);      // X+r3:X+r2:X+r1:r4
+    state = _mm_slli_si128(state, 4);         // r2:r1:r0:0
+    xword = _mm_xor_si128(xword, state);      // X+r3+r2:X+r2+r1:r5:r4
+    state = _mm_slli_si128(state, 4);         // r1:r0:0:0
+    xword = _mm_xor_si128(xword, state);      // X+r3+r2+r1:r6:r5:r4
+    state = _mm_slli_si128(state, 4);         // r0:0:0:0
+    state = _mm_xor_si128(xword, state);      // r7:r6:r5:r4
+    return state;
 }
 
 static void aesni_setkey_enc_128(unsigned char *rk_bytes,
@@ -281,39 +279,40 @@ static void aesni_setkey_enc_128(unsigned char *rk_bytes,
 /*
  * Key expansion, 192-bit case
  */
-static void aesni_set_rk_192(__m128i *xmm0, __m128i *xmm1, __m128i xmm2,
+static void aesni_set_rk_192(__m128i *state0, __m128i *state1, __m128i xword,
                              unsigned char *rk)
 {
     /*
      * Finish generating the next 6 quarter-keys.
      *
-     * On entry xmm0 is r3:r2:r1:r0, xmm1 is stuff:stuff:r5:r4
-     * and xmm2 is stuff:stuff:X:stuff with X = rot( sub( r3 ) ) ^ RCON.
+     * On entry state0 is r3:r2:r1:r0, state1 is stuff:stuff:r5:r4
+     * and xword is stuff:stuff:X:stuff with X = rot( sub( r3 ) ) ^ RCON
+     * (obtained with AESKEYGENASSIST).
      *
-     * On exit, xmm0 is r9:r8:r7:r6 and xmm1 is stuff:stuff:r11:r10
+     * On exit, state0 is r9:r8:r7:r6 and state1 is stuff:stuff:r11:r10
      * and those are written to the round key buffer.
      */
-    xmm2 = _mm_shuffle_epi32(xmm2, 0x55);     // X:X:X:X
-    xmm2 = _mm_xor_si128(xmm2, *xmm0);        // X+r3:X+r2:X+r1:X+r0
-    *xmm0 = _mm_slli_si128(*xmm0, 4);         // r2:r1:r0:0
-    xmm2 = _mm_xor_si128(xmm2, *xmm0);        // X+r3+r2:X+r2+r1:X+r1+r0:X+r0
-    *xmm0 = _mm_slli_si128(*xmm0, 4);         // r1:r0:0:0
-    xmm2 = _mm_xor_si128(xmm2, *xmm0);        // X+r3+r2+r1:X+r2+r1+r0:X+r1+r0:X+r0
-    *xmm0 = _mm_slli_si128(*xmm0, 4);         // r0:0:0:0
-    xmm2 = _mm_xor_si128(xmm2, *xmm0);        // X+r3+r2+r1+r0:X+r2+r1+r0:X+r1+r0:X+r0
-    *xmm0 = xmm2;                             // = r9:r8:r7:r6
+    xword = _mm_shuffle_epi32(xword, 0x55);   // X:X:X:X
+    xword = _mm_xor_si128(xword, *state0);    // X+r3:X+r2:X+r1:X+r0
+    *state0 = _mm_slli_si128(*state0, 4);     // r2:r1:r0:0
+    xword = _mm_xor_si128(xword, *state0);    // X+r3+r2:X+r2+r1:X+r1+r0:X+r0
+    *state0 = _mm_slli_si128(*state0, 4);     // r1:r0:0:0
+    xword = _mm_xor_si128(xword, *state0);    // X+r3+r2+r1:X+r2+r1+r0:X+r1+r0:X+r0
+    *state0 = _mm_slli_si128(*state0, 4);     // r0:0:0:0
+    xword = _mm_xor_si128(xword, *state0);    // X+r3+r2+r1+r0:X+r2+r1+r0:X+r1+r0:X+r0
+    *state0 = xword;                          // = r9:r8:r7:r6
 
-    xmm2 = _mm_shuffle_epi32(xmm2, 0xff);     // r9:r9:r9:r9
-    xmm2 = _mm_xor_si128(xmm2, *xmm1);        // stuff:stuff:r9+r5:r9+r4
-    *xmm1 = _mm_slli_si128(*xmm1, 4);         // stuff:stuff:r4:0
-    xmm2 = _mm_xor_si128(xmm2, *xmm1);        // stuff:stuff:r9+r5+r4:r9+r4
-    *xmm1 = xmm2;                             // = stuff:stuff:r11:r10
+    xword = _mm_shuffle_epi32(xword, 0xff);   // r9:r9:r9:r9
+    xword = _mm_xor_si128(xword, *state1);    // stuff:stuff:r9+r5:r9+r4
+    *state1 = _mm_slli_si128(*state1, 4);     // stuff:stuff:r4:0
+    xword = _mm_xor_si128(xword, *state1);    // stuff:stuff:r9+r5+r4:r9+r4
+    *state1 = xword;                          // = stuff:stuff:r11:r10
 
-    /* Store xmm0 and the low half of xmm1 into rk, which is conceptually
+    /* Store state0 and the low half of state1 into rk, which is conceptually
      * an array of 24-byte elements. Since 24 is not a multiple of 16,
-     * rk is not necessarily aligned so just `*rk = *xmm0` doesn't work. */
-    memcpy(rk, xmm0, 16);
-    _mm_storeu_si64(rk + 16, *xmm1);
+     * rk is not necessarily aligned so just `*rk = *state0` doesn't work. */
+    memcpy(rk, state0, 16);
+    _mm_storeu_si64(rk + 16, *state1);
 }
 
 static void aesni_setkey_enc_192(unsigned char *rk,
@@ -322,55 +321,56 @@ static void aesni_setkey_enc_192(unsigned char *rk,
     /* First round: use original key */
     memcpy(rk, key, 24);
     /* aes.c guarantees that rk is aligned on a 16-byte boundary. */
-    __m128i xmm0 = ((__m128i *) rk)[0];
-    __m128i xmm1 = _mm_loadl_epi64(((__m128i *) rk) + 1);
+    __m128i state0 = ((__m128i *) rk)[0];
+    __m128i state1 = _mm_loadl_epi64(((__m128i *) rk) + 1);
 
-    aesni_set_rk_192(&xmm0, &xmm1, _mm_aeskeygenassist_si128(xmm1, 0x01), rk + 24 * 1);
-    aesni_set_rk_192(&xmm0, &xmm1, _mm_aeskeygenassist_si128(xmm1, 0x02), rk + 24 * 2);
-    aesni_set_rk_192(&xmm0, &xmm1, _mm_aeskeygenassist_si128(xmm1, 0x04), rk + 24 * 3);
-    aesni_set_rk_192(&xmm0, &xmm1, _mm_aeskeygenassist_si128(xmm1, 0x08), rk + 24 * 4);
-    aesni_set_rk_192(&xmm0, &xmm1, _mm_aeskeygenassist_si128(xmm1, 0x10), rk + 24 * 5);
-    aesni_set_rk_192(&xmm0, &xmm1, _mm_aeskeygenassist_si128(xmm1, 0x20), rk + 24 * 6);
-    aesni_set_rk_192(&xmm0, &xmm1, _mm_aeskeygenassist_si128(xmm1, 0x40), rk + 24 * 7);
-    aesni_set_rk_192(&xmm0, &xmm1, _mm_aeskeygenassist_si128(xmm1, 0x80), rk + 24 * 8);
+    aesni_set_rk_192(&state0, &state1, _mm_aeskeygenassist_si128(state1, 0x01), rk + 24 * 1);
+    aesni_set_rk_192(&state0, &state1, _mm_aeskeygenassist_si128(state1, 0x02), rk + 24 * 2);
+    aesni_set_rk_192(&state0, &state1, _mm_aeskeygenassist_si128(state1, 0x04), rk + 24 * 3);
+    aesni_set_rk_192(&state0, &state1, _mm_aeskeygenassist_si128(state1, 0x08), rk + 24 * 4);
+    aesni_set_rk_192(&state0, &state1, _mm_aeskeygenassist_si128(state1, 0x10), rk + 24 * 5);
+    aesni_set_rk_192(&state0, &state1, _mm_aeskeygenassist_si128(state1, 0x20), rk + 24 * 6);
+    aesni_set_rk_192(&state0, &state1, _mm_aeskeygenassist_si128(state1, 0x40), rk + 24 * 7);
+    aesni_set_rk_192(&state0, &state1, _mm_aeskeygenassist_si128(state1, 0x80), rk + 24 * 8);
 }
 
 /*
  * Key expansion, 256-bit case
  */
-static void aesni_set_rk_256(__m128i xmm0, __m128i xmm1, __m128i xmm2,
+static void aesni_set_rk_256(__m128i state0, __m128i state1, __m128i xword,
                              __m128i *rk0, __m128i *rk1)
 {
     /*
      * Finish generating the next two round keys.
      *
-     * On entry xmm0 is r3:r2:r1:r0, xmm1 is r7:r6:r5:r4 and
-     * xmm2 is X:stuff:stuff:stuff with X = rot( sub( r7 )) ^ RCON
+     * On entry state0 is r3:r2:r1:r0, state1 is r7:r6:r5:r4 and
+     * xword is X:stuff:stuff:stuff with X = rot( sub( r7 )) ^ RCON
+     * (obtained with AESKEYGENASSIST).
      *
      * On exit, *rk0 is r11:r10:r9:r8 and *rk1 is r15:r14:r13:r12
      */
-    xmm2 = _mm_shuffle_epi32(xmm2, 0xff);
-    xmm2 = _mm_xor_si128(xmm2, xmm0);
-    xmm0 = _mm_slli_si128(xmm0, 4);
-    xmm2 = _mm_xor_si128(xmm2, xmm0);
-    xmm0 = _mm_slli_si128(xmm0, 4);
-    xmm2 = _mm_xor_si128(xmm2, xmm0);
-    xmm0 = _mm_slli_si128(xmm0, 4);
-    xmm0 = _mm_xor_si128(xmm0, xmm2);
-    *rk0 = xmm0;
+    xword = _mm_shuffle_epi32(xword, 0xff);
+    xword = _mm_xor_si128(xword, state0);
+    state0 = _mm_slli_si128(state0, 4);
+    xword = _mm_xor_si128(xword, state0);
+    state0 = _mm_slli_si128(state0, 4);
+    xword = _mm_xor_si128(xword, state0);
+    state0 = _mm_slli_si128(state0, 4);
+    state0 = _mm_xor_si128(state0, xword);
+    *rk0 = state0;
 
-    /* Set xmm2 to stuff:Y:stuff:stuff with Y = subword( r11 )
+    /* Set xword to stuff:Y:stuff:stuff with Y = subword( r11 )
      * and proceed to generate next round key from there */
-    xmm2 = _mm_aeskeygenassist_si128(xmm0, 0x00);
-    xmm2 = _mm_shuffle_epi32(xmm2, 0xaa);
-    xmm2 = _mm_xor_si128(xmm2, xmm1);
-    xmm1 = _mm_slli_si128(xmm1, 4);
-    xmm2 = _mm_xor_si128(xmm2, xmm1);
-    xmm1 = _mm_slli_si128(xmm1, 4);
-    xmm2 = _mm_xor_si128(xmm2, xmm1);
-    xmm1 = _mm_slli_si128(xmm1, 4);
-    xmm1 = _mm_xor_si128(xmm1, xmm2);
-    *rk1 = xmm1;
+    xword = _mm_aeskeygenassist_si128(state0, 0x00);
+    xword = _mm_shuffle_epi32(xword, 0xaa);
+    xword = _mm_xor_si128(xword, state1);
+    state1 = _mm_slli_si128(state1, 4);
+    xword = _mm_xor_si128(xword, state1);
+    state1 = _mm_slli_si128(state1, 4);
+    xword = _mm_xor_si128(xword, state1);
+    state1 = _mm_slli_si128(state1, 4);
+    state1 = _mm_xor_si128(state1, xword);
+    *rk1 = state1;
 }
 
 static void aesni_setkey_enc_256(unsigned char *rk_bytes,

From 2e8d8d1fd6a7a4c06a0e2292587b7880b830ab08 Mon Sep 17 00:00:00 2001
From: Gilles Peskine <Gilles.Peskine@arm.com>
Date: Wed, 15 Mar 2023 23:16:27 +0100
Subject: [PATCH 24/45] Fix MSVC portability

MSVC doesn't have _mm_storeu_si64. Fortunately it isn't really needed here.

Signed-off-by: Gilles Peskine <Gilles.Peskine@arm.com>
---
 library/aesni.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/library/aesni.c b/library/aesni.c
index 410e1c19b6..47795ea261 100644
--- a/library/aesni.c
+++ b/library/aesni.c
@@ -312,7 +312,7 @@ static void aesni_set_rk_192(__m128i *state0, __m128i *state1, __m128i xword,
      * an array of 24-byte elements. Since 24 is not a multiple of 16,
      * rk is not necessarily aligned so just `*rk = *state0` doesn't work. */
     memcpy(rk, state0, 16);
-    _mm_storeu_si64(rk + 16, *state1);
+    memcpy(rk + 16, state1, 8);
 }
 
 static void aesni_setkey_enc_192(unsigned char *rk,

From 563c492bf62dbd5cc23a2f82abe5f5330ab0e4f1 Mon Sep 17 00:00:00 2001
From: Gilles Peskine <Gilles.Peskine@arm.com>
Date: Wed, 15 Mar 2023 23:20:26 +0100
Subject: [PATCH 25/45] Travis: run selftest on Windows

Signed-off-by: Gilles Peskine <Gilles.Peskine@arm.com>
---
 .travis.yml | 1 +
 1 file changed, 1 insertion(+)

diff --git a/.travis.yml b/.travis.yml
index eb01a44ab1..ed2910a0d5 100644
--- a/.travis.yml
+++ b/.travis.yml
@@ -70,6 +70,7 @@ jobs:
       os: windows
       script:
         - scripts/windows_msbuild.bat v141 # Visual Studio 2017
+        - visualc/VS2013/x64/Release/selftest.exe
 
 after_failure:
 - tests/scripts/travis-log-failure.sh

From de34578353f4f0944a068dd44a66694064c4a25b Mon Sep 17 00:00:00 2001
From: Gilles Peskine <Gilles.Peskine@arm.com>
Date: Thu, 16 Mar 2023 13:06:14 +0100
Subject: [PATCH 26/45] Fix code style

Signed-off-by: Gilles Peskine <Gilles.Peskine@arm.com>
---
 library/gcm.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/library/gcm.c b/library/gcm.c
index 463ef48fcf..5994cf6e05 100644
--- a/library/gcm.c
+++ b/library/gcm.c
@@ -754,8 +754,7 @@ int mbedtls_gcm_self_test(int verbose)
     int i, j, ret;
     mbedtls_cipher_id_t cipher = MBEDTLS_CIPHER_ID_AES;
 
-    if (verbose != 0)
-    {
+    if (verbose != 0) {
 #if defined(MBEDTLS_GCM_ALT)
         mbedtls_printf("  GCM note: alternative implementation.\n");
 #else /* MBEDTLS_GCM_ALT */

From 5f1677f5820038b7df5dd369cea84f1abe73bf54 Mon Sep 17 00:00:00 2001
From: Gilles Peskine <Gilles.Peskine@arm.com>
Date: Thu, 16 Mar 2023 13:08:18 +0100
Subject: [PATCH 27/45] Fix typo in comment

Signed-off-by: Gilles Peskine <Gilles.Peskine@arm.com>
---
 library/aesni.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/library/aesni.c b/library/aesni.c
index 47795ea261..75543dfa19 100644
--- a/library/aesni.c
+++ b/library/aesni.c
@@ -136,7 +136,7 @@ static void gcm_clmul(const __m128i aa, const __m128i bb,
     ff = _mm_srli_si128(ff, 8);                      // 0:e1+f1
     ee = _mm_slli_si128(ee, 8);                      // e0+f0:0
     *dd = _mm_xor_si128(*dd, ff);                    // d1:d0+e1+f1
-    *cc = _mm_xor_si128(*cc, ee);                    // c1+e0+f1:c0
+    *cc = _mm_xor_si128(*cc, ee);                    // c1+e0+f0:c0
 }
 
 static void gcm_shift(__m128i *cc, __m128i *dd)

From 6978e739398dd1bbaca465c2b92c195ee463df31 Mon Sep 17 00:00:00 2001
From: Gilles Peskine <Gilles.Peskine@arm.com>
Date: Thu, 16 Mar 2023 13:08:42 +0100
Subject: [PATCH 28/45] Fix unaligned access if the context is moved during
 operation

Signed-off-by: Gilles Peskine <Gilles.Peskine@arm.com>
---
 library/aes.c | 43 ++++++++++++++++++++++++++++++++++++-------
 1 file changed, 36 insertions(+), 7 deletions(-)

diff --git a/library/aes.c b/library/aes.c
index 36aa7f2999..c68eddb012 100644
--- a/library/aes.c
+++ b/library/aes.c
@@ -983,6 +983,39 @@ void mbedtls_aes_decrypt(mbedtls_aes_context *ctx,
 }
 #endif /* !MBEDTLS_DEPRECATED_REMOVED */
 
+#if defined(MBEDTLS_AESNI_HAVE_CODE) || \
+    (defined(MBEDTLS_PADLOCK_C) && defined(MBEDTLS_HAVE_X86))
+/* VIA Padlock and our intrinsics-based implementation of AESNI require
+ * the round keys to be aligned on a 16-byte boundary. We take care of this
+ * before creating them, but the AES context may have moved (this can happen
+ * if the library is called from a language with managed memory), and in later
+ * calls it might have a different alignment with respect to 16-byte memory.
+ * So we may need to realign.
+ * NOTE: In the LTS branch, the context contains a pointer to within itself,
+ * so if it has been moved, things will probably go pear-shaped. We keep this
+ * code for compatibility with the development branch, in case of future changes.
+ */
+static void aes_maybe_realign(mbedtls_aes_context *ctx)
+{
+    /* We want a 16-byte alignment. Note that rk and buf are pointers to uint32_t
+     * and offset is in units of uint32_t words = 4 bytes. We want a
+     * 4-word alignment. */
+    unsigned current_offset = (unsigned)(ctx->rk - ctx->buf);
+    uintptr_t current_address = (uintptr_t)ctx->rk;
+    unsigned current_alignment = (current_address & 0x0000000f) / 4;
+    if (current_alignment != 0) {
+        unsigned new_offset = current_offset + 4 - current_alignment;
+        if (new_offset >= 4) {
+            new_offset -= 4;
+        }
+        memmove(ctx->buf + new_offset,     // new address
+                ctx->buf + current_offset, // current address
+                (ctx->nr + 1) * 16);       // number of round keys * bytes per rk
+        ctx->rk = ctx->buf + new_offset;
+    }
+}
+#endif
+
 /*
  * AES-ECB block encryption/decryption
  */
@@ -999,19 +1032,15 @@ int mbedtls_aes_crypt_ecb(mbedtls_aes_context *ctx,
 
 #if defined(MBEDTLS_AESNI_HAVE_CODE)
     if (mbedtls_aesni_has_support(MBEDTLS_AESNI_AES)) {
+        aes_maybe_realign(ctx);
         return mbedtls_aesni_crypt_ecb(ctx, mode, input, output);
     }
 #endif
 
 #if defined(MBEDTLS_PADLOCK_C) && defined(MBEDTLS_HAVE_X86)
     if (aes_padlock_ace) {
-        if (mbedtls_padlock_xcryptecb(ctx, mode, input, output) == 0) {
-            return 0;
-        }
-
-        // If padlock data misaligned, we just fall back to
-        // unaccelerated mode
-        //
+        aes_maybe_realign(ctx);
+        return mbedtls_padlock_xcryptecb(ctx, mode, input, output);
     }
 #endif
 

From 30c356c540d17d40d67a9fefc91e070f2928de00 Mon Sep 17 00:00:00 2001
From: Gilles Peskine <Gilles.Peskine@arm.com>
Date: Thu, 16 Mar 2023 14:58:46 +0100
Subject: [PATCH 29/45] Use consistent guards for padlock code

The padlock feature is enabled if
```
defined(MBEDTLS_PADLOCK_C) && defined(MBEDTLS_HAVE_X86)
```
with the second macro coming from `padlock.h`. The availability of the
macro `MBEDTLS_PADLOCK_ALIGN16` is coincidentally equivalent to
`MBEDTLS_HAVE_X86` but this is not meaningful.

Signed-off-by: Gilles Peskine <Gilles.Peskine@arm.com>
---
 library/aes.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/library/aes.c b/library/aes.c
index c68eddb012..d02319e35e 100644
--- a/library/aes.c
+++ b/library/aes.c
@@ -50,8 +50,7 @@
 #define AES_VALIDATE(cond)        \
     MBEDTLS_INTERNAL_VALIDATE(cond)
 
-#if defined(MBEDTLS_PADLOCK_C) &&                      \
-    (defined(MBEDTLS_HAVE_X86) || defined(MBEDTLS_PADLOCK_ALIGN16))
+#if defined(MBEDTLS_PADLOCK_C) && defined(MBEDTLS_HAVE_X86)
 static int aes_padlock_ace = -1;
 #endif
 
@@ -539,7 +538,7 @@ int mbedtls_aes_setkey_enc(mbedtls_aes_context *ctx, const unsigned char *key,
     }
 #endif
 
-#if defined(MBEDTLS_PADLOCK_C) && defined(MBEDTLS_PADLOCK_ALIGN16)
+#if defined(MBEDTLS_PADLOCK_C) && defined(MBEDTLS_HAVE_X86)
     if (aes_padlock_ace == -1) {
         aes_padlock_ace = mbedtls_padlock_has_support(MBEDTLS_PADLOCK_ACE);
     }
@@ -648,7 +647,7 @@ int mbedtls_aes_setkey_dec(mbedtls_aes_context *ctx, const unsigned char *key,
 
     mbedtls_aes_init(&cty);
 
-#if defined(MBEDTLS_PADLOCK_C) && defined(MBEDTLS_PADLOCK_ALIGN16)
+#if defined(MBEDTLS_PADLOCK_C) && defined(MBEDTLS_HAVE_X86)
     if (aes_padlock_ace == -1) {
         aes_padlock_ace = mbedtls_padlock_has_support(MBEDTLS_PADLOCK_ACE);
     }

From 3ba81d321783a961898ee54208bc4cfd94df3fe6 Mon Sep 17 00:00:00 2001
From: Gilles Peskine <Gilles.Peskine@arm.com>
Date: Thu, 16 Mar 2023 16:51:40 +0100
Subject: [PATCH 30/45] Remove the dependency of MBEDTLS_AESNI_C on
 MBEDTLS_HAVE_ASM

AESNI can now be implemented with intrinsics.

Signed-off-by: Gilles Peskine <Gilles.Peskine@arm.com>
---
 include/mbedtls/check_config.h | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/include/mbedtls/check_config.h b/include/mbedtls/check_config.h
index 2ab99823ed..2cb36e9e17 100644
--- a/include/mbedtls/check_config.h
+++ b/include/mbedtls/check_config.h
@@ -69,10 +69,6 @@
 #error "MBEDTLS_HAVE_TIME_DATE without MBEDTLS_HAVE_TIME does not make sense"
 #endif
 
-#if defined(MBEDTLS_AESNI_C) && !defined(MBEDTLS_HAVE_ASM)
-#error "MBEDTLS_AESNI_C defined, but not all prerequisites"
-#endif
-
 #if defined(MBEDTLS_CTR_DRBG_C) && !defined(MBEDTLS_AES_C)
 #error "MBEDTLS_CTR_DRBG_C defined, but not all prerequisites"
 #endif

From b71d40228d5373964e5e0a7107d529feb553e3ec Mon Sep 17 00:00:00 2001
From: Gilles Peskine <Gilles.Peskine@arm.com>
Date: Thu, 16 Mar 2023 17:14:59 +0100
Subject: [PATCH 31/45] Clean up AES context alignment code

Use a single auxiliary function to determine rk_offset, covering both
setkey_enc and setkey_dec, covering both AESNI and PADLOCK. For AESNI, only
build this when using the intrinsics-based implementation, since the
assembly implementation supports unaligned access.

Simplify "do we need to realign?" to "is the desired offset now equal to
the current offset?".

Signed-off-by: Gilles Peskine <Gilles.Peskine@arm.com>
---
 library/aes.c | 95 ++++++++++++++++++++++++++++++---------------------
 1 file changed, 56 insertions(+), 39 deletions(-)

diff --git a/library/aes.c b/library/aes.c
index d02319e35e..4eaa76dc61 100644
--- a/library/aes.c
+++ b/library/aes.c
@@ -511,6 +511,53 @@ void mbedtls_aes_xts_free(mbedtls_aes_xts_context *ctx)
 }
 #endif /* MBEDTLS_CIPHER_MODE_XTS */
 
+/* Some implementations need the round keys to be aligned.
+ * Return an offset to be added to buf, such that (buf + offset) is
+ * correctly aligned.
+ * Note that the offset is in units of elements of buf, i.e. 32-bit words,
+ * i.e. an offset of 1 means 4 bytes and so on.
+ */
+#if (defined(MBEDTLS_PADLOCK_C) && defined(MBEDTLS_HAVE_X86)) ||        \
+    defined(MBEDTLS_HAVE_AESNI_INTRINSICS)
+#define MAY_NEED_TO_ALIGN
+#endif
+static unsigned mbedtls_aes_rk_offset(uint32_t *buf)
+{
+#if defined(MAY_NEED_TO_ALIGN)
+    int align_16_bytes = 0;
+
+#if defined(MBEDTLS_PADLOCK_C) && defined(MBEDTLS_HAVE_X86)
+    if (aes_padlock_ace == -1) {
+        aes_padlock_ace = mbedtls_padlock_has_support(MBEDTLS_PADLOCK_ACE);
+    }
+    if (aes_padlock_ace) {
+        align_16_bytes = 1;
+    }
+#endif
+
+#if defined(MBEDTLS_AESNI_C) && defined(MBEDTLS_HAVE_AESNI_INTRINSICS)
+    if (mbedtls_aesni_has_support(MBEDTLS_AESNI_AES)) {
+        align_16_bytes = 1;
+    }
+#endif
+
+    if (align_16_bytes) {
+        /* These implementations needs 16-byte alignment
+         * for the round key array. */
+        unsigned delta = ((uintptr_t) buf & 0x0000000fU) / 4;
+        if (delta == 0) {
+            return 0;
+        } else {
+            return 4 - delta; // 16 bytes = 4 uint32_t
+        }
+    }
+#else /* MAY_NEED_TO_ALIGN */
+    (void) buf;
+#endif /* MAY_NEED_TO_ALIGN */
+
+    return 0;
+}
+
 /*
  * AES key schedule (encryption)
  */
@@ -538,27 +585,10 @@ int mbedtls_aes_setkey_enc(mbedtls_aes_context *ctx, const unsigned char *key,
     }
 #endif
 
-#if defined(MBEDTLS_PADLOCK_C) && defined(MBEDTLS_HAVE_X86)
-    if (aes_padlock_ace == -1) {
-        aes_padlock_ace = mbedtls_padlock_has_support(MBEDTLS_PADLOCK_ACE);
-    }
-
-    if (aes_padlock_ace) {
-        ctx->rk = RK = MBEDTLS_PADLOCK_ALIGN16(ctx->buf);
-    } else
-#endif
-    ctx->rk = RK = ctx->buf;
+    ctx->rk = RK = ctx->buf + mbedtls_aes_rk_offset(ctx->buf);
 
 #if defined(MBEDTLS_AESNI_HAVE_CODE)
     if (mbedtls_aesni_has_support(MBEDTLS_AESNI_AES)) {
-        /* The intrinsics-based implementation needs 16-byte alignment
-         * for the round key array. */
-        unsigned delta = (uintptr_t) ctx->buf & 0x0000000f;
-        size_t rk_offset = 0;
-        if (delta != 0) {
-            rk_offset = 4 - delta / 4; // 16 bytes = 4 uint32_t
-        }
-        ctx->rk = RK = ctx->buf + rk_offset;
         return mbedtls_aesni_setkey_enc((unsigned char *) ctx->rk, key, keybits);
     }
 #endif
@@ -647,16 +677,7 @@ int mbedtls_aes_setkey_dec(mbedtls_aes_context *ctx, const unsigned char *key,
 
     mbedtls_aes_init(&cty);
 
-#if defined(MBEDTLS_PADLOCK_C) && defined(MBEDTLS_HAVE_X86)
-    if (aes_padlock_ace == -1) {
-        aes_padlock_ace = mbedtls_padlock_has_support(MBEDTLS_PADLOCK_ACE);
-    }
-
-    if (aes_padlock_ace) {
-        ctx->rk = RK = MBEDTLS_PADLOCK_ALIGN16(ctx->buf);
-    } else
-#endif
-    ctx->rk = RK = ctx->buf;
+    ctx->rk = RK = ctx->buf + mbedtls_aes_rk_offset(ctx->buf);
 
     /* Also checks keybits */
     if ((ret = mbedtls_aes_setkey_enc(&cty, key, keybits)) != 0) {
@@ -982,8 +1003,7 @@ void mbedtls_aes_decrypt(mbedtls_aes_context *ctx,
 }
 #endif /* !MBEDTLS_DEPRECATED_REMOVED */
 
-#if defined(MBEDTLS_AESNI_HAVE_CODE) || \
-    (defined(MBEDTLS_PADLOCK_C) && defined(MBEDTLS_HAVE_X86))
+#if defined(MAY_NEED_TO_ALIGN)
 /* VIA Padlock and our intrinsics-based implementation of AESNI require
  * the round keys to be aligned on a 16-byte boundary. We take care of this
  * before creating them, but the AES context may have moved (this can happen
@@ -1000,13 +1020,8 @@ static void aes_maybe_realign(mbedtls_aes_context *ctx)
      * and offset is in units of uint32_t words = 4 bytes. We want a
      * 4-word alignment. */
     unsigned current_offset = (unsigned)(ctx->rk - ctx->buf);
-    uintptr_t current_address = (uintptr_t)ctx->rk;
-    unsigned current_alignment = (current_address & 0x0000000f) / 4;
-    if (current_alignment != 0) {
-        unsigned new_offset = current_offset + 4 - current_alignment;
-        if (new_offset >= 4) {
-            new_offset -= 4;
-        }
+    unsigned new_offset = mbedtls_aes_rk_offset(ctx->buf);
+    if (new_offset != current_offset) {
         memmove(ctx->buf + new_offset,     // new address
                 ctx->buf + current_offset, // current address
                 (ctx->nr + 1) * 16);       // number of round keys * bytes per rk
@@ -1029,16 +1044,18 @@ int mbedtls_aes_crypt_ecb(mbedtls_aes_context *ctx,
     AES_VALIDATE_RET(mode == MBEDTLS_AES_ENCRYPT ||
                      mode == MBEDTLS_AES_DECRYPT);
 
+#if defined(MAY_NEED_TO_ALIGN)
+    aes_maybe_realign(ctx);
+#endif
+
 #if defined(MBEDTLS_AESNI_HAVE_CODE)
     if (mbedtls_aesni_has_support(MBEDTLS_AESNI_AES)) {
-        aes_maybe_realign(ctx);
         return mbedtls_aesni_crypt_ecb(ctx, mode, input, output);
     }
 #endif
 
 #if defined(MBEDTLS_PADLOCK_C) && defined(MBEDTLS_HAVE_X86)
     if (aes_padlock_ace) {
-        aes_maybe_realign(ctx);
         return mbedtls_padlock_xcryptecb(ctx, mode, input, output);
     }
 #endif

From 6dec541e6871973fb21a186bad899bed43c6c1f1 Mon Sep 17 00:00:00 2001
From: Gilles Peskine <Gilles.Peskine@arm.com>
Date: Thu, 16 Mar 2023 17:21:33 +0100
Subject: [PATCH 32/45] AESNI: Overhaul implementation selection

Have clearly separated code to:
* determine whether the assembly-based implementation is available;
* determine whether the intrinsics-based implementation is available;
* select one of the available implementations if any.

Now MBEDTLS_AESNI_HAVE_CODE can be the single interface for aes.c and
aesni.c to determine which AESNI is built.

Change the implementation selection: now, if both implementations are
available, always prefer assembly. Before, the intrinsics were used if
available. This preference is to minimize disruption, and will likely
be revised in a later minor release.

Signed-off-by: Gilles Peskine <Gilles.Peskine@arm.com>
---
 include/mbedtls/aesni.h | 35 ++++++++++++++++++++++++++---------
 library/aes.c           |  4 ++--
 library/aesni.c         | 18 +++++++++---------
 3 files changed, 37 insertions(+), 20 deletions(-)

diff --git a/include/mbedtls/aesni.h b/include/mbedtls/aesni.h
index b3d49e4380..20ad6e3843 100644
--- a/include/mbedtls/aesni.h
+++ b/include/mbedtls/aesni.h
@@ -36,6 +36,9 @@
 #define MBEDTLS_AESNI_AES      0x02000000u
 #define MBEDTLS_AESNI_CLMUL    0x00000002u
 
+/* Can we do AESNI with inline assembly?
+ * (Only implemented with gas syntax, only for 64-bit.)
+ */
 #if defined(MBEDTLS_HAVE_ASM) && defined(__GNUC__) && \
     (defined(__amd64__) || defined(__x86_64__))   &&  \
     !defined(MBEDTLS_HAVE_X86_64)
@@ -44,19 +47,33 @@
 
 #if defined(MBEDTLS_AESNI_C)
 
-#if defined(MBEDTLS_HAVE_X86_64)
-#define MBEDTLS_AESNI_HAVE_CODE // via assembly
-#endif
-
+/* Can we do AESNI with intrinsics?
+ * (Only implemented with certain compilers, .)
+ */
+#undef MBEDTLS_AESNI_HAVE_INTRINSICS
 #if defined(_MSC_VER)
-#define MBEDTLS_HAVE_AESNI_INTRINSICS
+/* Visual Studio supports AESNI intrinsics since VS 2008 SP1. We only support
+ * VS 2013 and up for other reasons anyway, so no need to check the version. */
+#define MBEDTLS_AESNI_HAVE_INTRINSICS
 #endif
-#if defined(__GNUC__) && defined(__AES__)
-#define MBEDTLS_HAVE_AESNI_INTRINSICS
+/* GCC-like compilers: currently, we only support intrinsics if the requisite
+ * target flag is enabled when building the library (e.g. `gcc -mpclmul -msse2`
+ * or `clang -maes -mpclmul`). */
+#if defined(__GNUC__) && defined(__AES__) && defined(__PCLMUL__)
+#define MBEDTLS_AESNI_HAVE_INTRINSICS
 #endif
 
-#if defined(MBEDTLS_HAVE_AESNI_INTRINSICS)
-#define MBEDTLS_AESNI_HAVE_CODE // via intrinsics
+/* Choose the implementation of AESNI, if one is available. */
+#undef MBEDTLS_AESNI_HAVE_CODE
+/* To minimize disruption when releasing the intrinsics-based implementation,
+ * favor the assembly-based implementation if it's available. We intend to
+ * revise this in a later release of Mbed TLS 3.x. In the long run, we will
+ * likely remove the assembly implementation. */
+#if defined(MBEDTLS_HAVE_X86_64)
+#define MBEDTLS_AESNI_HAVE_CODE 1 // via assembly
+#endif
+#if defined(MBEDTLS_AESNI_HAVE_INTRINSICS)
+#define MBEDTLS_AESNI_HAVE_CODE 2 // via intrinsics
 #endif
 
 #if defined(MBEDTLS_AESNI_HAVE_CODE)
diff --git a/library/aes.c b/library/aes.c
index 4eaa76dc61..d7e4a7ce1b 100644
--- a/library/aes.c
+++ b/library/aes.c
@@ -518,7 +518,7 @@ void mbedtls_aes_xts_free(mbedtls_aes_xts_context *ctx)
  * i.e. an offset of 1 means 4 bytes and so on.
  */
 #if (defined(MBEDTLS_PADLOCK_C) && defined(MBEDTLS_HAVE_X86)) ||        \
-    defined(MBEDTLS_HAVE_AESNI_INTRINSICS)
+    (defined(MBEDTLS_AESNI_C) && MBEDTLS_AESNI_HAVE_CODE == 2)
 #define MAY_NEED_TO_ALIGN
 #endif
 static unsigned mbedtls_aes_rk_offset(uint32_t *buf)
@@ -535,7 +535,7 @@ static unsigned mbedtls_aes_rk_offset(uint32_t *buf)
     }
 #endif
 
-#if defined(MBEDTLS_AESNI_C) && defined(MBEDTLS_HAVE_AESNI_INTRINSICS)
+#if defined(MBEDTLS_AESNI_C) && MBEDTLS_AESNI_HAVE_CODE == 2
     if (mbedtls_aesni_has_support(MBEDTLS_AESNI_AES)) {
         align_16_bytes = 1;
     }
diff --git a/library/aesni.c b/library/aesni.c
index 75543dfa19..c909f654c6 100644
--- a/library/aesni.c
+++ b/library/aesni.c
@@ -36,9 +36,9 @@
 #endif
 /* *INDENT-ON* */
 
-#if defined(MBEDTLS_HAVE_AESNI_INTRINSICS) || defined(MBEDTLS_HAVE_X86_64)
+#if defined(MBEDTLS_AESNI_HAVE_CODE)
 
-#if defined(MBEDTLS_HAVE_AESNI_INTRINSICS)
+#if MBEDTLS_AESNI_HAVE_CODE == 2
 #if !defined(_WIN32)
 #include <cpuid.h>
 #endif
@@ -54,7 +54,7 @@ int mbedtls_aesni_has_support(unsigned int what)
     static unsigned int c = 0;
 
     if (!done) {
-#if defined(MBEDTLS_HAVE_AESNI_INTRINSICS)
+#if MBEDTLS_AESNI_HAVE_CODE == 2
         static unsigned info[4] = { 0, 0, 0, 0 };
 #if defined(_MSC_VER)
         __cpuid(info, 1);
@@ -62,20 +62,20 @@ int mbedtls_aesni_has_support(unsigned int what)
         __cpuid(1, info[0], info[1], info[2], info[3]);
 #endif
         c = info[2];
-#else
+#else /* AESNI using asm */
         asm ("movl  $1, %%eax   \n\t"
              "cpuid             \n\t"
              : "=c" (c)
              :
              : "eax", "ebx", "edx");
-#endif
+#endif /* MBEDTLS_AESNI_HAVE_CODE */
         done = 1;
     }
 
     return (c & what) != 0;
 }
 
-#if defined(MBEDTLS_HAVE_AESNI_INTRINSICS)
+#if MBEDTLS_AESNI_HAVE_CODE == 2
 
 /*
  * AES-NI AES-ECB block en(de)cryption
@@ -394,7 +394,7 @@ static void aesni_setkey_enc_256(unsigned char *rk_bytes,
     aesni_set_rk_256(rk[12], rk[13], _mm_aeskeygenassist_si128(rk[13], 0x40), &rk[14], &rk[15]);
 }
 
-#else  /* MBEDTLS_HAVE_AESNI_INTRINSICS */
+#else /* MBEDTLS_AESNI_HAVE_CODE == 1 */
 
 #if defined(__has_feature)
 #if __has_feature(memory_sanitizer)
@@ -782,7 +782,7 @@ static void aesni_setkey_enc_256(unsigned char *rk,
          : "memory", "cc", "0");
 }
 
-#endif  /* MBEDTLS_HAVE_AESNI_INTRINSICS */
+#endif  /* MBEDTLS_AESNI_HAVE_CODE */
 
 /*
  * Key expansion, wrapper
@@ -801,6 +801,6 @@ int mbedtls_aesni_setkey_enc(unsigned char *rk,
     return 0;
 }
 
-#endif /* MBEDTLS_HAVE_X86_64 */
+#endif /* MBEDTLS_AESNI_HAVE_CODE */
 
 #endif /* MBEDTLS_AESNI_C */

From e5038c666e6057242356854ae3addc340690f0da Mon Sep 17 00:00:00 2001
From: Gilles Peskine <Gilles.Peskine@arm.com>
Date: Thu, 16 Mar 2023 17:49:44 +0100
Subject: [PATCH 33/45] Document the new state of AESNI support

Signed-off-by: Gilles Peskine <Gilles.Peskine@arm.com>
---
 include/mbedtls/config.h | 26 ++++++++++++++++++++++----
 1 file changed, 22 insertions(+), 4 deletions(-)

diff --git a/include/mbedtls/config.h b/include/mbedtls/config.h
index acdb7acb36..1381c1fd16 100644
--- a/include/mbedtls/config.h
+++ b/include/mbedtls/config.h
@@ -51,7 +51,7 @@
  *      include/mbedtls/bn_mul.h
  *
  * Required by:
- *      MBEDTLS_AESNI_C
+ *      MBEDTLS_AESNI_C (on some platforms)
  *      MBEDTLS_PADLOCK_C
  *
  * Comment to disable the use of assembly code.
@@ -2344,14 +2344,32 @@
 /**
  * \def MBEDTLS_AESNI_C
  *
- * Enable AES-NI support on x86-64.
+ * Enable AES-NI support on x86-64 or x86-32.
+ *
+ * \note AESNI is only supported with certain compilers and target options:
+ * - Visual Studio 2013: supported.
+ * - GCC, x86-64, target not explicitly supporting AESNI:
+ *   requires MBEDTLS_HAVE_ASM.
+ * - GCC, x86-32, target not explicitly supporting AESNI:
+ *   not supported.
+ * - GCC, x86-64 or x86-32, target supporting AESNI: supported.
+ *   For this assembly-less implementation, you must currently compile
+ *   `library/aesni.c` and `library/aes.c` with machine options to enable
+ *   SSE2 and AESNI instructions: `gcc -msse2 -maes -mpclmul` or
+ *   `clang -maes -mpclmul`.
+ * - Non-x86 targets: this option is silently ignored.
+ * - Other compilers: this option is silently ignored.
+ *
+ * \note
+ * Above, "GCC" includes compatible compilers such as Clang.
+ * The limitations on target support are likely to be relaxed in the future.
  *
  * Module:  library/aesni.c
  * Caller:  library/aes.c
  *
- * Requires: MBEDTLS_HAVE_ASM
+ * Requires: MBEDTLS_HAVE_ASM (on some platforms, see note)
  *
- * This modules adds support for the AES-NI instructions on x86-64
+ * This modules adds support for the AES-NI instructions on x86.
  */
 #define MBEDTLS_AESNI_C
 

From 9a8bf9f85d01aa4d1d2cf66f32ea33243ae91a5a Mon Sep 17 00:00:00 2001
From: Gilles Peskine <Gilles.Peskine@arm.com>
Date: Thu, 16 Mar 2023 17:50:15 +0100
Subject: [PATCH 34/45] Announce the expanded AESNI support

Signed-off-by: Gilles Peskine <Gilles.Peskine@arm.com>
---
 ChangeLog.d/aesni.txt | 7 +++++++
 1 file changed, 7 insertions(+)
 create mode 100644 ChangeLog.d/aesni.txt

diff --git a/ChangeLog.d/aesni.txt b/ChangeLog.d/aesni.txt
new file mode 100644
index 0000000000..2d90a6e1cc
--- /dev/null
+++ b/ChangeLog.d/aesni.txt
@@ -0,0 +1,7 @@
+Features
+   * AES-NI is now supported with Visual Studio.
+   * AES-NI is now supported in 32-bit builds, or when MBEDTLS_HAVE_ASM
+     is disabled, when compiling with GCC or Clang or a compatible compiler
+     for a target CPU that supports the requisite instructions (for example
+     gcc -m32 -msse2 -maes -mpclmul). (Generic x86 builds with GCC-like
+     compilers still require MBEDTLS_HAVE_ASM and a 64-bit target.)

From 3efd3149f80874ed4459e7424c8c0eb900ba9bf5 Mon Sep 17 00:00:00 2001
From: Gilles Peskine <Gilles.Peskine@arm.com>
Date: Fri, 17 Mar 2023 17:29:58 +0100
Subject: [PATCH 35/45] Finish sentence in comment

Signed-off-by: Gilles Peskine <Gilles.Peskine@arm.com>
---
 include/mbedtls/aesni.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/mbedtls/aesni.h b/include/mbedtls/aesni.h
index 20ad6e3843..ae306f61a6 100644
--- a/include/mbedtls/aesni.h
+++ b/include/mbedtls/aesni.h
@@ -48,7 +48,7 @@
 #if defined(MBEDTLS_AESNI_C)
 
 /* Can we do AESNI with intrinsics?
- * (Only implemented with certain compilers, .)
+ * (Only implemented with certain compilers, only for certain targets.)
  */
 #undef MBEDTLS_AESNI_HAVE_INTRINSICS
 #if defined(_MSC_VER)

From 9494a99c2f231c136f8e4b10150a555f1c029282 Mon Sep 17 00:00:00 2001
From: Gilles Peskine <Gilles.Peskine@arm.com>
Date: Fri, 17 Mar 2023 17:30:29 +0100
Subject: [PATCH 36/45] Fix preprocessor conditional

This was intended as an if-else-if chain. Make it so.

Signed-off-by: Gilles Peskine <Gilles.Peskine@arm.com>
---
 include/mbedtls/aesni.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/include/mbedtls/aesni.h b/include/mbedtls/aesni.h
index ae306f61a6..95c47124bc 100644
--- a/include/mbedtls/aesni.h
+++ b/include/mbedtls/aesni.h
@@ -71,8 +71,7 @@
  * likely remove the assembly implementation. */
 #if defined(MBEDTLS_HAVE_X86_64)
 #define MBEDTLS_AESNI_HAVE_CODE 1 // via assembly
-#endif
-#if defined(MBEDTLS_AESNI_HAVE_INTRINSICS)
+#elif defined(MBEDTLS_AESNI_HAVE_INTRINSICS)
 #define MBEDTLS_AESNI_HAVE_CODE 2 // via intrinsics
 #endif
 

From 58550acba0ad599d3892cfeac49994940f1703d0 Mon Sep 17 00:00:00 2001
From: Tom Cosgrove <tom.cosgrove@arm.com>
Date: Fri, 17 Mar 2023 16:54:59 +0000
Subject: [PATCH 37/45] Fix merge errors in backporting

Signed-off-by: Tom Cosgrove <tom.cosgrove@arm.com>
---
 library/aes.c | 14 --------------
 1 file changed, 14 deletions(-)

diff --git a/library/aes.c b/library/aes.c
index d7e4a7ce1b..0ab3ea7f61 100644
--- a/library/aes.c
+++ b/library/aes.c
@@ -693,17 +693,6 @@ int mbedtls_aes_setkey_dec(mbedtls_aes_context *ctx, const unsigned char *key,
         goto exit;
     }
 #endif
-#if defined(MBEDTLS_AESNI_HAVE_CODE)
-    if (mbedtls_aesni_has_support(MBEDTLS_AESNI_AES)) {
-        /* The intrinsics-based implementation needs 16-byte alignment
-         * for the round key array. */
-        unsigned delta = (uintptr_t) ctx->buf & 0x0000000f;
-        if (delta != 0) {
-            size_t rk_offset = 4 - delta / 4; // 16 bytes = 4 uint32_t
-            ctx->rk = RK = ctx->buf + rk_offset;
-        }
-    }
-#endif
 
     SK = cty.rk + cty.nr * 4;
 
@@ -1016,9 +1005,6 @@ void mbedtls_aes_decrypt(mbedtls_aes_context *ctx,
  */
 static void aes_maybe_realign(mbedtls_aes_context *ctx)
 {
-    /* We want a 16-byte alignment. Note that rk and buf are pointers to uint32_t
-     * and offset is in units of uint32_t words = 4 bytes. We want a
-     * 4-word alignment. */
     unsigned current_offset = (unsigned)(ctx->rk - ctx->buf);
     unsigned new_offset = mbedtls_aes_rk_offset(ctx->buf);
     if (new_offset != current_offset) {

From 779199faac8c1c0d88bee91f8c7512f5c67d1e44 Mon Sep 17 00:00:00 2001
From: Tom Cosgrove <tom.cosgrove@arm.com>
Date: Fri, 17 Mar 2023 17:16:53 +0000
Subject: [PATCH 38/45] Document that MBEDTLS_AESNI_HAVE_INTRINSICS and
 MBEDTLS_AESNI_HAVE_CODE are internal macros, despite appearing in a public
 header file.

Signed-off-by: Tom Cosgrove <tom.cosgrove@arm.com>
---
 include/mbedtls/aesni.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/mbedtls/aesni.h b/include/mbedtls/aesni.h
index 95c47124bc..6741dead05 100644
--- a/include/mbedtls/aesni.h
+++ b/include/mbedtls/aesni.h
@@ -49,6 +49,9 @@
 
 /* Can we do AESNI with intrinsics?
  * (Only implemented with certain compilers, only for certain targets.)
+ *
+ * NOTE: MBEDTLS_AESNI_HAVE_INTRINSICS and MBEDTLS_AESNI_HAVE_CODE are internal
+ *       macros that may change in future releases.
  */
 #undef MBEDTLS_AESNI_HAVE_INTRINSICS
 #if defined(_MSC_VER)

From 3b53caed9f3ed0c55836ff22ca6a339f37146812 Mon Sep 17 00:00:00 2001
From: Tom Cosgrove <tom.cosgrove@arm.com>
Date: Fri, 17 Mar 2023 18:25:36 +0000
Subject: [PATCH 39/45] Remove references to MBEDTLS_AESCE_C and
 MBEDTLS_HAVE_ARM64 that aren't needed in this backport

Signed-off-by: Tom Cosgrove <tom.cosgrove@arm.com>
---
 library/aes.c | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/library/aes.c b/library/aes.c
index 0ab3ea7f61..f199270535 100644
--- a/library/aes.c
+++ b/library/aes.c
@@ -1848,11 +1848,6 @@ int mbedtls_aes_self_test(int verbose)
         if (mbedtls_aesni_has_support(MBEDTLS_AESNI_AES)) {
             mbedtls_printf("  AES note: using AESNI.\n");
         } else
-#endif
-#if defined(MBEDTLS_AESCE_C) && defined(MBEDTLS_HAVE_ARM64)
-        if (mbedtls_aesce_has_support()) {
-            mbedtls_printf("  AES note: using AESCE.\n");
-        } else
 #endif
         mbedtls_printf("  AES note: built-in implementation.\n");
 #endif /* MBEDTLS_AES_ALT */

From e0c75342fcc3deb453bb187d9ebfc3c16ef2e0d0 Mon Sep 17 00:00:00 2001
From: Tom Cosgrove <tom.cosgrove@arm.com>
Date: Sat, 18 Mar 2023 13:54:26 +0000
Subject: [PATCH 40/45] Fix another backport issue: it's VS2010/ not VS2013/

Signed-off-by: Tom Cosgrove <tom.cosgrove@arm.com>
---
 .travis.yml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/.travis.yml b/.travis.yml
index ed2910a0d5..7871fe9cda 100644
--- a/.travis.yml
+++ b/.travis.yml
@@ -70,7 +70,7 @@ jobs:
       os: windows
       script:
         - scripts/windows_msbuild.bat v141 # Visual Studio 2017
-        - visualc/VS2013/x64/Release/selftest.exe
+        - visualc/VS2010/x64/Release/selftest.exe
 
 after_failure:
 - tests/scripts/travis-log-failure.sh

From 20458c0963bd3e72cd54fb029859871193360683 Mon Sep 17 00:00:00 2001
From: Tom Cosgrove <tom.cosgrove@arm.com>
Date: Sat, 18 Mar 2023 14:48:49 +0000
Subject: [PATCH 41/45] Have selftest print more information about the AESNI
 build

Signed-off-by: Tom Cosgrove <tom.cosgrove@arm.com>
---
 library/aes.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/library/aes.c b/library/aes.c
index f199270535..414c42c1db 100644
--- a/library/aes.c
+++ b/library/aes.c
@@ -1846,7 +1846,15 @@ int mbedtls_aes_self_test(int verbose)
 #endif
 #if defined(MBEDTLS_AESNI_HAVE_CODE)
         if (mbedtls_aesni_has_support(MBEDTLS_AESNI_AES)) {
-            mbedtls_printf("  AES note: using AESNI.\n");
+            mbedtls_printf("  AES note: using AESNI via ");
+#if MBEDTLS_AESNI_HAVE_CODE == 1
+            mbedtls_printf("assembly");
+#elif MBEDTLS_AESNI_HAVE_CODE == 2
+            mbedtls_printf("intrinsics");
+#else
+            mbedtls_printf("(unknown)");
+#endif
+            mbedtls_printf(".\n");
         } else
 #endif
         mbedtls_printf("  AES note: built-in implementation.\n");

From 9149e12767515aaa9390779db27802aa2b537b8a Mon Sep 17 00:00:00 2001
From: Tom Cosgrove <tom.cosgrove@arm.com>
Date: Sat, 18 Mar 2023 14:49:07 +0000
Subject: [PATCH 42/45] Stop selftest hanging when run on CI

Signed-off-by: Tom Cosgrove <tom.cosgrove@arm.com>
---
 .travis.yml              |  2 +-
 programs/test/selftest.c | 13 +++++++++++--
 2 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/.travis.yml b/.travis.yml
index 7871fe9cda..ada8fc5c67 100644
--- a/.travis.yml
+++ b/.travis.yml
@@ -70,7 +70,7 @@ jobs:
       os: windows
       script:
         - scripts/windows_msbuild.bat v141 # Visual Studio 2017
-        - visualc/VS2010/x64/Release/selftest.exe
+        - visualc/VS2010/x64/Release/selftest.exe --ci
 
 after_failure:
 - tests/scripts/travis-log-failure.sh
diff --git a/programs/test/selftest.c b/programs/test/selftest.c
index 598c66e144..229f0d80a9 100644
--- a/programs/test/selftest.c
+++ b/programs/test/selftest.c
@@ -353,6 +353,9 @@ int main(int argc, char *argv[])
     unsigned char buf[1000000];
 #endif
     void *pointer;
+#if defined(_WIN32)
+    int ci = 0; /* ci = 1 => running in CI, so don't wait for a key press */
+#endif
 
     /*
      * The C standard doesn't guarantee that all-bits-0 is the representation
@@ -380,6 +383,10 @@ int main(int argc, char *argv[])
         } else if (strcmp(*argp, "--exclude") == 0 ||
                    strcmp(*argp, "-x") == 0) {
             exclude_mode = 1;
+#if defined(_WIN32)
+        } else if (strcmp(*argp, "--ci") == 0) {
+            ci = 1;
+#endif
         } else {
             break;
         }
@@ -450,8 +457,10 @@ int main(int argc, char *argv[])
             mbedtls_printf("  [ All tests PASS ]\n\n");
         }
 #if defined(_WIN32)
-        mbedtls_printf("  Press Enter to exit this program.\n");
-        fflush(stdout); getchar();
+        if (!ci) {
+            mbedtls_printf("  Press Enter to exit this program.\n");
+            fflush(stdout); getchar();
+        }
 #endif
     }
 

From 2c942a35ff7bf5d8be58cbf5f3321dd938b96b85 Mon Sep 17 00:00:00 2001
From: Tom Cosgrove <tom.cosgrove@arm.com>
Date: Sun, 19 Mar 2023 14:04:04 +0000
Subject: [PATCH 43/45] Fix code style nit

Signed-off-by: Tom Cosgrove <tom.cosgrove@arm.com>
---
 library/aes.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/library/aes.c b/library/aes.c
index 414c42c1db..f08a21f595 100644
--- a/library/aes.c
+++ b/library/aes.c
@@ -1005,7 +1005,7 @@ void mbedtls_aes_decrypt(mbedtls_aes_context *ctx,
  */
 static void aes_maybe_realign(mbedtls_aes_context *ctx)
 {
-    unsigned current_offset = (unsigned)(ctx->rk - ctx->buf);
+    unsigned current_offset = (unsigned) (ctx->rk - ctx->buf);
     unsigned new_offset = mbedtls_aes_rk_offset(ctx->buf);
     if (new_offset != current_offset) {
         memmove(ctx->buf + new_offset,     // new address

From 640b761e49ab20a7ad6347262fb70dfde6b689e3 Mon Sep 17 00:00:00 2001
From: Tom Cosgrove <tom.cosgrove@arm.com>
Date: Sun, 19 Mar 2023 15:07:06 +0000
Subject: [PATCH 44/45] Print out AESNI mechanism used by GCM in self-test

Signed-off-by: Tom Cosgrove <tom.cosgrove@arm.com>
---
 library/gcm.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/library/gcm.c b/library/gcm.c
index 5994cf6e05..0c958c729a 100644
--- a/library/gcm.c
+++ b/library/gcm.c
@@ -760,7 +760,15 @@ int mbedtls_gcm_self_test(int verbose)
 #else /* MBEDTLS_GCM_ALT */
 #if defined(MBEDTLS_AESNI_HAVE_CODE)
         if (mbedtls_aesni_has_support(MBEDTLS_AESNI_CLMUL)) {
-            mbedtls_printf("  GCM note: using AESNI.\n");
+            mbedtls_printf("  GCM note: using AESNI via ");
+#if MBEDTLS_AESNI_HAVE_CODE == 1
+            mbedtls_printf("assembly");
+#elif MBEDTLS_AESNI_HAVE_CODE == 2
+            mbedtls_printf("intrinsics");
+#else
+            mbedtls_printf("(unknown)");
+#endif
+            mbedtls_printf(".\n");
         } else
 #endif
         mbedtls_printf("  GCM note: built-in implementation.\n");

From b5eb8318035fcba9cd17b4fc02663e2a2bafc9b2 Mon Sep 17 00:00:00 2001
From: Tom Cosgrove <tom.cosgrove@arm.com>
Date: Mon, 20 Mar 2023 10:57:42 +0000
Subject: [PATCH 45/45] Add tests for unaligned AES contexts

Signed-off-by: Tom Cosgrove <tom.cosgrove@arm.com>
---
 tests/suites/test_suite_aes.ecb.data |   9 ++
 tests/suites/test_suite_aes.function | 118 +++++++++++++++++++++++++++
 2 files changed, 127 insertions(+)

diff --git a/tests/suites/test_suite_aes.ecb.data b/tests/suites/test_suite_aes.ecb.data
index 6349034a69..faf69c04dc 100644
--- a/tests/suites/test_suite_aes.ecb.data
+++ b/tests/suites/test_suite_aes.ecb.data
@@ -228,3 +228,12 @@ aes_decrypt_ecb:"000000000000000000000000000000000000000000000000000000000000000
 
 AES-256-ECB Decrypt NIST KAT #12
 aes_decrypt_ecb:"0000000000000000000000000000000000000000000000000000000000000000":"9b80eefb7ebe2d2b16247aa0efc72f5d":"e0000000000000000000000000000000":0
+
+AES-128-ECB context alignment
+aes_ecb_context_alignment:"000102030405060708090a0b0c0d0e0f"
+
+AES-192-ECB context alignment
+aes_ecb_context_alignment:"000102030405060708090a0b0c0d0e0f1011121314151617"
+
+AES-256-ECB context alignment
+aes_ecb_context_alignment:"000102030405060708090a0b0c0d0e0f101112131415161718191a1b1c1d1e1f"
diff --git a/tests/suites/test_suite_aes.function b/tests/suites/test_suite_aes.function
index 6b92b870b1..e96e40790d 100644
--- a/tests/suites/test_suite_aes.function
+++ b/tests/suites/test_suite_aes.function
@@ -1,5 +1,52 @@
 /* BEGIN_HEADER */
 #include "mbedtls/aes.h"
+
+/* Test AES with a copied context.
+ *
+ * enc and dec must be AES context objects. They don't need to
+ * be initialized, and are left freed.
+ */
+static int test_ctx_alignment(const data_t *key,
+                              mbedtls_aes_context *enc,
+                              mbedtls_aes_context *dec)
+{
+    unsigned char plaintext[16] = {
+        0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07,
+        0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f,
+    };
+    unsigned char ciphertext[16];
+    unsigned char output[16];
+
+    // Set key and encrypt with original context
+    mbedtls_aes_init(enc);
+    TEST_ASSERT(mbedtls_aes_setkey_enc(enc, key->x, key->len * 8) == 0);
+    TEST_ASSERT(mbedtls_aes_crypt_ecb(enc, MBEDTLS_AES_ENCRYPT,
+                                      plaintext, ciphertext) == 0);
+
+    // Set key for decryption with original context
+    mbedtls_aes_init(dec);
+    TEST_ASSERT(mbedtls_aes_setkey_dec(dec, key->x, key->len * 8) == 0);
+
+    // Wipe the original context to make sure nothing from it is used
+    memset(enc, 0, sizeof(*enc));
+    mbedtls_aes_free(enc);
+
+    // Decrypt
+    TEST_ASSERT(mbedtls_aes_crypt_ecb(dec, MBEDTLS_AES_DECRYPT,
+                                      ciphertext, output) == 0);
+    ASSERT_COMPARE(plaintext, 16, output, 16);
+
+    mbedtls_aes_free(dec);
+
+    return 1;
+
+exit:
+    /* Bug: we may be leaving something unfreed. This is harmless
+     * in our built-in implementations, but might cause a memory leak
+     * with alternative implementations. */
+    return 0;
+}
+
 /* END_HEADER */
 
 /* BEGIN_DEPENDENCIES
@@ -621,6 +668,77 @@ void aes_misc_params()
 }
 /* END_CASE */
 
+/* BEGIN_CASE */
+void aes_ecb_context_alignment(data_t *key)
+{
+    /* We test alignment multiple times, with different alignments
+     * of the context and of the plaintext/ciphertext. */
+
+    struct align0 {
+        mbedtls_aes_context ctx;
+    };
+    struct align0 *enc0 = NULL;
+    struct align0 *dec0 = NULL;
+
+    struct align1 {
+        char bump;
+        mbedtls_aes_context ctx;
+    };
+    struct align1 *enc1 = NULL;
+    struct align1 *dec1 = NULL;
+
+    /* All peak alignment */
+    ASSERT_ALLOC(enc0, 1);
+    ASSERT_ALLOC(dec0, 1);
+    if (!test_ctx_alignment(key, &enc0->ctx, &dec0->ctx)) {
+        goto exit;
+    }
+    mbedtls_free(enc0);
+    enc0 = NULL;
+    mbedtls_free(dec0);
+    dec0 = NULL;
+
+    /* Enc aligned, dec not */
+    ASSERT_ALLOC(enc0, 1);
+    ASSERT_ALLOC(dec1, 1);
+    if (!test_ctx_alignment(key, &enc0->ctx, &dec1->ctx)) {
+        goto exit;
+    }
+    mbedtls_free(enc0);
+    enc0 = NULL;
+    mbedtls_free(dec1);
+    dec1 = NULL;
+
+    /* Dec aligned, enc not */
+    ASSERT_ALLOC(enc1, 1);
+    ASSERT_ALLOC(dec0, 1);
+    if (!test_ctx_alignment(key, &enc1->ctx, &dec0->ctx)) {
+        goto exit;
+    }
+    mbedtls_free(enc1);
+    enc1 = NULL;
+    mbedtls_free(dec0);
+    dec0 = NULL;
+
+    /* Both shifted */
+    ASSERT_ALLOC(enc1, 1);
+    ASSERT_ALLOC(dec1, 1);
+    if (!test_ctx_alignment(key, &enc1->ctx, &dec1->ctx)) {
+        goto exit;
+    }
+    mbedtls_free(enc1);
+    enc1 = NULL;
+    mbedtls_free(dec1);
+    dec1 = NULL;
+
+exit:
+    mbedtls_free(enc0);
+    mbedtls_free(dec0);
+    mbedtls_free(enc1);
+    mbedtls_free(dec1);
+}
+/* END_CASE */
+
 /* BEGIN_CASE depends_on:MBEDTLS_SELF_TEST */
 void aes_selftest()
 {