|capture hardware = CW-Lite, CW-Lite 2-Part, CW-Pro
|Target Device =
|Target Architecture = XMEGA/Arm/Other
|Hardware Crypto = No
|Purchase Hardware =
}}
This tutorial will introduce you to measuring the power consumption of a device under attack. It will demonstrate how you can view the difference between a 'add' instruction and a 'mul' instruction.assembly instructions
== Prerequisites ==
== Setting Up the Example ==
In this tutorial, we will once again be working off of the <code>simpleserial-base</code> firmware. Like with the previous tutorial, you'll need to have a copy of the firmware you want to modify and be able to build for your platform. The instructions are repeated in the drop down menus below, but if you're comfortable with the previous example, feel free to skip them and to just build your new firmware. Alternatively, if you're not too attached to your code, you can just modify your firmware from [[Tutorial B1 Building a SimpleSerial Project]] and rebuild it from the same directory.
{{CollapsibleSection
|intro = === Building for CWLite with XMEGA Target ===
<li>The ''ADC Freq'' should show 4x the clock speed of your device (typically 29.5MHz), and the ''DCM Locked'' checkbox __MUST__ be checked. If the ''DCM Locked'' checkbox is NOT checked, try hitting the ''Reset ADC DCM'' button again.</li>
<li><p>At this point you can hit the ''Capture 1'' button, and see if the system works! You should end up with a window looking like this:</p>
<p>[[File:05_Low_Gain.PNG|image|1250px1083x1083px]]</p>
<p>Whilst there is a waveform, you need to adjust the capture settings. There are two main settings of importance, the analog gain and number of samples to capture.</p></li>
[[File:06_high_gain.PNG|image|1250px1083x1083px]]</ol>
<ol start="16" style="list-style-type: decimal;">
== Modifying the Target ==
=== Background on Setup (XMEGA) === While this tutorial can be performed on any supported target, results will vary between targets. Arm targets, for example, have pipelining, which complicates how long instructions take and when they happen. The sample concepts apply, but the specifics will be different.
The rest of this tutorial will focus on AtXMEGA128D4 (the CW303 XMEGA target), since correlating instructions to power consumption is typically simpler on it. We are comparing the power consumption of two different instructions, the <code>MUL</code> (multiply) instruction and the <code>NOP</code> (no operation) instruction. Some information on these two instructions:
Note that the capture clock is running at 4x the device clock. Thus a single <code>mul</code> instruction should span 8 samples on our output graph, since it takes 4 samples to cover a complete clock cycle.
==== Initial Code ====
The initial code has a power signature something like this (yours will vary based on various physical considerations, and depending if you are using an XMEGA or AVR device):
Note that the 10 <code>mul</code> instructions would be expected to take 80 samples to complete, and the 10 <code>nop</code> instructions should take 40 samples to complete. By modifying the code we can determine exactly which portion of the trace is corresponding to which operations.
==== Increase number of NOPs ====
We will then modify the code to have twenty NOP operations in a row instead of ten. The modified code looks like this:
Pay particular attention to the section between sample number 0 & sample number 180. It is in this section we can compare the two power graphs to see the modified code. We can actually 'see' the change in operation of the device! It would appear the <code>nop</code> is occuring from approximately 10-90, and the <code>mul</code> occuring from 90-170.
==== Add NOP loop after MUL ====
Finally, we will add 10 more NOPs after the 10 MULs. The code should look something like this:
</blockquote>
==== Comparison of All Three ====
The following graph lines the three options up. One can see where adding loops of different operations shows up in the power consumption.
<blockquote>[[File:nop_mul_comparison.png|image]]
</blockquote>
=== Background on Setup (Arm) ===
For the rest of this tutorial, we'll be focusing on the STM32F3, which is the microcontroller on the CW303 Arm target (though other targets should demonstrate the same principles). Since the STM32F3 is an Arm Cortex M4 device, we'll need to refer to the [http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0553a/CHDJJGFB.html Cortex M4 Instruction Set] and the [http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0439b/CHDDIGAC.html Cortex M4 Instruction Set Summary].
The first thing we'll do is replace the <code>nop</code> instructions, since from it's documentation page we can see the processor may not execute them. Instead, let's add some <code>add.w</code> (which is the 32 bit wide version of the add instruction) instructions. We'll be doing this since the <code>mul</code> instruction is always 32 bits wide and the 16 bit thumb instruction has a different power profile than the 32 bit Arm instruction. From the earlier links, we can see that both add and mul take 1 cycle each to complete.
Now we should have 10 <code>add.w</code> instructions and 10 <code>mul</code> instructions:<syntaxhighlight lang="c">
trigger_high();
asm volatile(
"add.w r0, r0" "\n\t"
"add.w r0, r0" "\n\t"
"add.w r0, r0" "\n\t"
"add.w r0, r0" "\n\t"
"add.w r0, r0" "\n\t"
"add.w r0, r0" "\n\t"
"add.w r0, r0" "\n\t"
"add.w r0, r0" "\n\t"
"add.w r0, r0" "\n\t"
"add.w r0, r0" "\n\t"
::
);
asm volatile(
"mul r0, r0" "\n\t"
"mul r0, r0" "\n\t"
"mul r0, r0" "\n\t"
"mul r0, r0" "\n\t"
"mul r0, r0" "\n\t"
"mul r0, r0" "\n\t"
"mul r0, r0" "\n\t"
"mul r0, r0" "\n\t"
"mul r0, r0" "\n\t"
"mul r0, r0" "\n\t"
::
);
trigger_low();
</syntaxhighlight>Now hit the ''Run 1'' [[File:Capture One Button.PNG|image]] button and capture a single trace. You should now have something that looks like this:
[[File:B2 STM Addmul.PNG|frameless|1155x1155px]]
We can see the <code>add.w</code> and <code>mul</code> instructions near the beginning, staring about 10 samples in and ending about 90 samples in. There's not really any difference that we can see between the two, but we can see that they take up about 80 samples (20 microcontroller clock cycles) as we expect.
Next, let's insert some <code>udiv</code> instructions. From the Cortex M4 Instruction Set Summary, we can see that <code>udiv</code> (unsigned divide) instructions take between 2 and 12 cycles to complete (effectively depending on how big the numbers we're dividing are). We'll be dividing <code>r0</code> by <code>r0</code>, meaning we expect that every instruction after the first should take 2 cycles. It should have higher power consumption too, since dividing is typically a fairly complex operation:<syntaxhighlight lang="c">
trigger_high();
asm volatile(
"add.w r0, r0" "\n\t"
"add.w r0, r0" "\n\t"
"add.w r0, r0" "\n\t"
"add.w r0, r0" "\n\t"
"add.w r0, r0" "\n\t"
"add.w r0, r0" "\n\t"
"add.w r0, r0" "\n\t"
"add.w r0, r0" "\n\t"
"add.w r0, r0" "\n\t"
"add.w r0, r0" "\n\t"
::
);
asm volatile(
"mul r0, r0" "\n\t"
"mul r0, r0" "\n\t"
"mul r0, r0" "\n\t"
"mul r0, r0" "\n\t"
"mul r0, r0" "\n\t"
"mul r0, r0" "\n\t"
"mul r0, r0" "\n\t"
"mul r0, r0" "\n\t"
"mul r0, r0" "\n\t"
"mul r0, r0" "\n\t"
::
);
asm volatile(
"udiv r0, r0" "\n\t"
"udiv r0, r0" "\n\t"
"udiv r0, r0" "\n\t"
"udiv r0, r0" "\n\t"
"udiv r0, r0" "\n\t"
"udiv r0, r0" "\n\t"
"udiv r0, r0" "\n\t"
"udiv r0, r0" "\n\t"
"udiv r0, r0" "\n\t"
"udiv r0, r0" "\n\t"
::
);
trigger_low();
</syntaxhighlight>Capture another trace and you should get something like:
[[File:B2 STM Addmuldiv.PNG|frameless|1155x1155px]]
As we expected, we can see periods of high power consumption measuring about 80 samples in total right after the <code>add.w</code> and <code>mul</code> instructions. Interestingly, the <code>udiv</code> instructions seem to be split into 2 sets of operations. As a final check, we can add some more <code>mul</code> instructions and see the <code>udiv</code> instructions move down (and also break into more sections):
[[File:B2 STM Addmulmuldiv.PNG|frameless|1155x1155px]]
== Clock Phase Adjustment ==
== Conclusion ==
In this tutorial you have learned how power analysis can tell you the operations being performed on a microcontroller. In future work we will move towards using this for breaking various forms of security on devices. In particular, [[Tutorial B3-1 Timing Analysis with Power for Password Bypass]] will examine how we can use this information to exploit a password check.
== Links ==