Prerequisite
This article assumes that the reader has installed MASM32. If you have not, it is available from http://www.masm32.com/.
Introduction
In the last article, you saw how to set up Visual Studio to compile an Assembler file with the Microsoft Assembler. In this article, I will begin to describe the language itself, and some of the instructions that it contains.
Variables
There are no variables in Assembler—or at least not in the C++ sense. In Assembler you have registers and memory addresses. Bear in mind, you’re talking the same language as the processor now.
For instance, the processor doesn’t know that you have an integer called ‘nMyInteger’. It doesn’t know you have a class called “CMyClass”. All it knows about is the ‘registers’ and that it can access memory given an address.
So what are registers? And how do I access memory?
Registers
Put simply, a register is like a variable, but just for the processor’s use. When I said that there are no variables in the sense of higher level languages, I meant what I said. There are only a set number of registers that exist on the processor chip. Think of it this way: A register is a hard-coded variable for the processor; it exists physically on the chip.
These registers can represent numbers the same size as the ‘bit’ count of the processor. In other words, in a 32-bit processor these numbers are 32 bits in size. In C++ terms, they are DWORDs.
They are also unsigned. Negative numbers (if required) are represented by 0x100000000 + (negative number). So, -1 would be represented by 0xFFFFFFFF, -2 by 0xFFFFFFFE, and so forth.
There are a quite a few registers in modern Intel processors, but there are only six you should be using in your applications:
eax - Accumulator Register ebx - Base Register ecx - Counter Register edx - Data Register esi - Source (for memory operations) register edi - Destination (for memory operations) register
The registers eax, ebx, ecx, and edx can be split into their constituent bytes by changing the way that they are referred to. For instance, for the accumulator (in other words, the a register):
al : First (lower) byte of the low word in the eax register ah : Second (higher) byte of the low word in the eax register ax : Lower word (i.e. 2-bytes) of the eax register ; (i.e. (ah << 8) + al) eax : The whole register (4-bytes)
The same naming convention goes for ebx, ecx, and edx (not esi or edi) as shown in the following image.
The names come from the origins of the processor. The ‘e’ notation means ‘extended’ register; in other words, the 32-bit flavours of each of the registers when 16-bit processors gave way to 32-bit processors. 16-bit processors only had ax, bx, cx, dx, and so forth; so, when the 32-bit processors came along, the extra 16 bits available were denoted by a preceding ‘e’. There is no way of accessing the top 16 bits of the registers directly.
The Mov (Move) Instruction
Now, start with the simplest instruction: the mov (move) instruction. The mov instruction is how you ‘move’ values about inside of the processor. For instance:
mov eax, 100
This ‘moves’ 100 into the eax register. It’s the same as saying eax=100. To define the move instruction, think of it as this:
mov (destination), (source)
The source and destination have to be the same size (in bits). Here are some examples of ‘mov’ instructions:
mov al, bl ; move the lower byte of ebx into the lower byte ; of eax mov al, 0ffh ; move 0xFF into the lower byte of eax mov ah, 0ffh ; move 0xFF into the high byte of the low word ; (2-bytes) of eax mov ax, 0ffffh ; move 0xFFFF into the low word of eax mov eax, 0ffffh ; move 0xFFFF into eax
We can move the contents of memory into a register and vice-versa by using square brackets to indicate ‘contents of’. The number of bytes moved is determined by the register name:
mov al, [esi] ; move the byte contained in the memory address ; in register esi into the lower byte of eax mov [edi], bl ; move the byte value in the lowest byte of ebx ; into the memory address in register edi mov cx, [esi] ; move the word (2-byte) value contained in the ; memory address of register esi into the lower ; word of ecx mov [edi], edx ; move the dword (4-byte) value contained in edx ; into the memory address contained in register edi
You also can include an offset when using the ‘contents of’ (square brackets) operator:
mov al, [esi + 3] ; move the byte contained in the memory address ; in register esi + 3 into the lower byte of eax mov [edi + 2], dx ; move the lower word (2-bytes) contained in ; edx into the memory address contained in the ; register edi + 2
Functions
A function is declared in the following form:
TestProc proc dwValue1:DWORD, wValue2:WORD, bValue3:BYTE ret TestProc endp
The preceding code is an example of a blank function, but it shows the basics. The name of the function is given first, followed by proc. The parameters to the function are defined in the subsequent list in the form <name>:<type>. Some of the basic types available are DWORD, WORD, and BYTE.
The end of the function is marked by a line containing the name of the function followed by endp.
The ret statement is the return statement; in other words, it marks the places where the function is to be exited. A ret statement MUST be included at the end of the function.
If the code is called from C++, the registers ebx, esi, and edi must be restored to their original values before returning from the function. The usual way of doing this is by using push and pop, which will be covered later.
The return value of the function is in the eax register. The function parameters can be accessed by name in most instructions; for example:
TestProc proc dwValue1:DWORD, dwValue2:DWORD mov eax, dwValue1 add eax, dwValue2 ret TestProc endp
This function adds dwValue1 to dwValue2 and returns the result.
To access functions in C++, you must declare a function with the same name and parameters. The size of the parameters in C++ must equal the size of the parameter defined in the Assembler code. They must also be declared as extern “C” and using the stdcall calling convention. For example, the C++ definition for the above assembler function is:
extern "C" unsigned int __stdcall TestProc(unsigned int dwValue1, unsigned int dwValue2);
If a pointer is to be passed in, it is declared as a DWORD parameter in the assembler function as pointers (in 32-bit operating systems) are 32 bits in size. Similarly, a ‘char’ would be passed as a BYTE, a ‘WCHAR’ as a WORD, and so on.
If an assembler function is designed to be exported from a static DLL, you do not need to define it; just include its name in the .def file of the DLL and then it can be used like any other C++ function declared this way.
Push, Pop, and the Stack
The processor contains a stack onto which registers, constants, and the contents of memory can be pushed and popped by using the push and pop instructions.
The stack is intended to overcome the small number of registers available. It gives an effective, quick way of saving and restoring the contents of registers.
The stack is a first-in-last-out queue of values. The push instruction adds a value to the head of the queue and the pop instruction removes the value from the head of the queue and places it in a register or memory address. For example:
TestFunction proc mov eax, 100 push eax ; Stack now contains { 100 } mov eax, 200 push eax ; Stack now contains { 200, 100 } mov eax, 300 push eax ; Stack now contains ( 300, 200, 100 } pop eax ; eax = 300, stack = { 200, 100 } pop eax ; eax = 200, stack = { 100 } pop eax ; eax = 100, stack = { } ret TestFunction endp
A typical use of the stack is to restore the values of the registers ebx, esi, and edi before exiting from a function. For example:
TestFunction proc push ebx push esi push edi ; code goes in here pop edi pop esi pop ebx ret TestFunction endp
Obviously, only the values of the registers that are being used need to be saved, but this does demonstrate the use of the push and pop instructions.
An important point to note is that, when exiting a function, the stack should always be in the same state that it was in when entering a function. Another way of saying this is that for every push statement there needs to be a corresponding pop statement before the function returns.
Flags and the Instructions that Affect and Use Them
A flag is a setting in the processor that can either be true or false. The processor contains a set of flags to indicate end states after operations. There are a number of flags, but the one that I’m going to be dealing with in this article is the ‘Zero’ flag. This flag is set by certain operations to indicate that a register has become zero. Other operations set this flag to indicate equality.
Consider the decrement dec instruction. This decrements the register or value specified. If the result is zero, the zero flag is set. For example:
TestFunction proc mov eax, 2 dec eax ; eax == 1 dec eax ; eax == 0, zero flag is set ret TestFunction endp
There are other operations that behave differently depending on the states of a particular flag. One of these operations is the jump. Its raw form is jmp, which jumps the program execution to a location in memory (usually specified with a label, the same way as a goto statement in C++). It has various forms, two of which are jz (jump if zero) and jnz (jump if not zero).
By using these instructions and the knowledge of the zero flag, you now can write loops:
LoopFunction proc xor eax, eax ; efficient way of saying eax = 0 mov ecx, 5 ; ecx is the register generally used for counters LoopStart: ; this is a label, used for labelling code positions inc eax dec ecx jnz LoopStart ; eax now equals 5 ret LoopFunction endp
Conclusion
I have covered some of the basic instructions involved in Assember and demonstrated their use. I have also explained what a register is and the registers that exist in Assembler. I have also shown how to define functions with parameters in Assembler and write definitions in C++ for them.
In the next installment of this tutorial, I will cover arithmetic operations, and some of the macros that MASM provides to ease the development of Assembler code.