Reference:
http://www.cs.virginia.edu/~evans/cs216/guides/x86.html
http://x86.renejeschke.de/
http://programminggroundup.blogspot.com/
where the main action happens. Addition, subtraction, multiplication,
comparisions, and other operations generally use general-purpose registers for
processing.
On x86 processors, there are several general-purpose registers:
special-purpose registers, including:
http://www.cs.virginia.edu/~evans/cs216/guides/x86.html
http://x86.renejeschke.de/
http://programminggroundup.blogspot.com/
registers:
general registerswhere the main action happens. Addition, subtraction, multiplication,
comparisions, and other operations generally use general-purpose registers for
processing.
On x86 processors, there are several general-purpose registers:
- %eax // System call number or Return value
- %ebx
- %ecx
- %edx
- %edi
- %esi
Working with 2 bytes for register:
- %eax
- least-significant half:
- %ax
- %al least-significant byte
- %ah most significant byte
special-purpose registers, including:
- %ebp # base pointer register. used for accessing function parameters and local variables.
- %esp # stack register, %esp, always contains a pointer to the current top of the stack, wherever it is.
- %eip # instruction pointer, point to the start of the function.
- %eflags
Jump:
There are several different types of memory locations:
- je #Jump if the values were equal
- jg #Jump if the second value was greater than the first value
- jge #Jump if the second value was greater than or equal to the first value
- jl #Jump if the second value was less than the first value
- jle #Jump if the second value was less than or equal to the first value
- jmp #Jump no matter what. This does not need to be preceeded by a comparison.
There are several different types of memory locations:
- .byte
- .int
- .long
- .asciiCharacters each take up one storage location (they are converted into bytes internally).The last character is represented by \0
Data Accessing Methods / Addressing Modes:
The general form of memory address references:
# All of the fields are optional.
# All of the addressing modes can be represented in this fashion except immediate-mode
ADDRESS_OR_OFFSET (%BASE_OR_OFFSET, %INDEX, MULTIPLIER)
(If any of the variable is left out, it is just substituted with zero in the equation.)
FINAL ADDRESS = ADDRESS_OR_OFFSET + %BASE_OR_OFFSET + MULTIPLIER * %INDEX
ADDRESS_OR_OFFSET (const)
MULTIPLIER (const)
BASE_OR_OFFSET (register)
INDEX (register)
Modes:
Every mode except immediate mode can be used as either the source or destination operand.
Immediate mode can only be a source operand.
Immediate mode can only be a source operand.
The data to access is embedded in the instruction itself.
Immediate mode is used to load direct values into registers or memory locations.
Example:
movl $12, %eax # move number 12 into %eax
Notice that to indicate immediate mode, we used a dollar sign in front of the
number.
If we did not, it would be direct addressing mode, in which case the
value located at memory location 12 would be loaded into %eax rather than
the number 12 itself.
register addressing mode (access register)
contains a register to access.
Register mode simply moves data in or out of a register.
direct addressing mode (access memory)
contains the memory address to access.
Done by only using the ADDRESS_OR_OFFSET portion.
Example:
movl ADDRESS, %eax
This loads %eax with the value at memory address ADDRESS.
indexed addressing mode (access memory)
contains a memory address to access, and also specifies an index register to offset that address.
On x86 processors, you can also specify a multiplier for the index.
Done by using the ADDRESS_OR_OFFSET and the %INDEX portion.
Can use any general-purpose register as the index register.
Can also have a constant multiplier of 1, 2, or 4 for the index register,
Example:
movl string_start(,%ecx,1), %eax
This starts at string_start, and adds 1 * %ecx to that address, and loads the value into %eax.
indirect addressing mode (access data)
contains a register that contains a pointer to where the data should be accessed.
Indirect addressing mode loads a value from the address indicated by a
register.
Example:
movl (%eax), %ebx
base pointer addressing mode (access data)
similar to indirect addressing, but you also include a number called the offset to add to the register's value before using it for lookup.
Base-pointer addressing is similar to indirect addressing, except that it adds a
constant value to the address in the register.
Example:
movl 4(%eax), %ebx
Code:
.section .data .section .text .globl _start _start: movl $1, %eax # this is the linux kernel command number (system call) for exiting a program movl $0, %ebx # this is the status number we will return to the operating system. # Change this around and it will return different things to echo $? int $0x80 # this wakes up the kernel to run the exit command
https://stackoverflow.com/questions/14361248/whats-the-difference-of-section-and-segment-in-elf-file-format
Compiled from right to left:
cmd:
# assembler
as exit.s -o exit.o
# linker
ld exit.o -o exit
Code dissect:
.section .dataAnything starting with a period isn't directly translated into a machine instruction.
Instead, it's an instruction to the assembler itself. These are called assembler
directives or pseudo-operations because they are handled by the assembler and are
not actually run by the computer.
The .section command breaks your program up into sections.
This command starts the data section, where you list any memory
storage you will need for data.
.section .text
Starts the text section. The text section of a program is where the program
instructions live.
.globl _start
_start is a symbol, which means that it is going to be replaced by something else either
during assembly or linking.
Symbols are generally used to mark locations of programs or data,
so you can refer to them by name instead of by their location number.
Symbols are used so that the assembler and linker can take care of keeping track of addresses.
.globl means that the assembler shouldn't discard this symbol after assembly,
because the linker will need it.
_start is a special symbol that always needs to be
marked with .globl because it marks the location of the start of the program.
_start:
defines the value of the _start label.
A label is a symbol followed by a colon.
Labels define a symbol's value. When the assembler is assembling the program, it
has to assign each data value and instruction an address.
Labels tell the assembler to make the symbol's value be wherever the next
instruction or data element will be.
This way, if the actual physical location of the data or instruction changes, you
don't have to rewrite any references to it - the symbol automatically gets the new
value.
movl $1, %eax
Assign %eax with value 1.
src: $1 ($ means using immediate mode. Without the dollar-sign it would do direct addressing,
loading whatever number is at address 1.)
dest: register
The reason we are moving the number 1 into %eax is because we are preparing to
call the Linux Kernel.
The number 1 is the number of the exit system call .
The system call number has to be loaded into %eax .
%eax is always required to be loaded with the system call number.
Depending on the system call, other registers may have to have values in them as well.
int $0x80
int stands for interrupt.
Code:
.section .data data_items: #These are the data items .long 3,67,34,222,45,75,54,34,44,33,22,11,66,0 .section .text .globl _start _start: movl $0, %edi # move 0 into the index register movl data_items(,%edi,4), %eax # load the first byte of data movl %eax, %ebx # since this is the first item, %eax is # the biggest start_loop: # start loop cmpl $0, %eax # check to see if we’ve hit the end je loop_exit incl %edi # load next value movl data_items(,%edi,4), %eax cmpl %ebx, %eax # compare values jle start_loop # jump to loop beginning if the new # one isn't bigger movl %eax, %ebx # move the value as the largest jmp start_loop # jump to loop beginning loop_exit: # %ebx is the status code for the exit system call # and it already has the maximum number movl $1, %eax #1 is the exit() syscall int $0x80
data_items: #These are the data items .long 3,67,34,222,45,75,54,34,44,33,22,11,66,0
movl data_items, %eaxwould move the value 3 into %eax.
Without a
.globl
declaration for data_items is because we only refer to these locations within the program.
No other file or program needs to know where they are located.
It's not an error to write .globl data_items, it's just not necessary.
A variable is a dedicated storage location used for a specific purpose, usually
given a distinct name by the programmer.
movl $0, %edi%edi as our index, and we want to start looking at the first
item, we load %edi with 0.
l in movl stands for move long.
movl data_items(,%edi,4), %eaxmovl BEGINNINGADDRESS(,%INDEXREGISTER,WORDSIZE) # indexed addressing mode
movl %eax, %ebxeven though movl stands for move, it actually copies the value,
so %eax and %ebx both contain the starting value.
cmpl $0, %eax%eflags register stores the result of comparison.
je end_loopjump if equal in %eflags.
incl %ediincl increments the value of %edi by one.
Function:
- function name
- function parameters
- local variables
- Static vs. Global variable:
- The only difference between the global and static
variables is that static variables are only used by one function,
while global variables are used by many functions.
Assembly language treats them exactly the same,
although most other languages distinguish them. - global variables
- return address
return address is a parameter which tells the function where to resume executing after the function is completed.
This is needed because functions can be called to do processing from many different parts of your program, and the function needs to be able to get back to wherever it was called from.
In most programming languages, this parameter is passed automatically when the function is called.
In assembly language, the call instruction handles passing the return address for you,
and ret handles using that address to return back to where you called the function from.
Calling convention:
The way that the variables are stored and the parameters and return values aretransferred by the computer.
In the C language calling convention, the stack is the key element for
implementing a function's local variables, parameters, and return address.
How is it implemented for return large data structure in assembly?
cdecl (c calling convention)
Application_binary_interface
stack:
pushl #pushes either a register or memory value onto the top of the stack.popl #pop values off the top.
%esp #stack register, current "top" of the stack.
Every time we push something onto the stack with pushl, %esp gets subtracted
by 4 so that it points to the new top of the stack.
the popl instruction, which adds 4 to %esp and puts the previous top value in whatever register you specified.
movl (%esp), %eax # put stack top's value into %eax
movl %esp, %eax #put stack top's address into %eax
movl 4(%esp), %eax #base pointer addressing mode. adds 4 to %esp before looking up the value being pointed to.
Function call process:
Before executing a function, a program- pushes all of the parameters for the function onto the stack in the reverse order that they are documented.
- issues a call instruction indicating which function it wishes to start.
- The call instruction does two things.
- First it pushes the address of the next instruction, which is the return address, onto the stack. 這之後在ret會用來存入
%eip中,提供下一步的instruction. - Then it modifies the instruction pointer (%eip) to point to the start of the function.
Parameter #N
...
Parameter 2
Parameter 1
Return Address <--- (%esp)
Thus
Parameter #N <--- N*4+4(%ebp)
...
Parameter 2 <---12(%ebp)
Parameter 1 <--- 8(%ebp)
Return Address <--- 4(%ebp)
Old %ebp <--- (%esp) and (%ebp)
...
Parameter 2 <--- 12(%ebp)
Parameter 1 <--- 8(%ebp)
Return Address <--- 4(%ebp)
Old %ebp <--- (%ebp) 注意這裡. 存入原本%ebp, 最後popl回存回%ebp, 在function call時%ebp為%esp的snapshot.
Local Variable 1 <--- -4(%ebp)
Local Variable 2 <--- -8(%ebp) and (%esp)
When a function is done executing, it does three things:
- Each of the parameters of the function have been pushed onto the stack, and
- finally the return address is there.
- save the current base pointer register, %ebp, by doing
pushl %ebp. - copies the stack pointer to %ebp by doing
movl %esp, %ebp
Which allows us to be able to access the function parameters
as fixed indexes from the base pointer.
Parameter #N <--- N*4+4(%ebp)
...
Parameter 2 <---12(%ebp)
Parameter 1 <--- 8(%ebp)
Return Address <--- 4(%ebp)
Old %ebp <--- (%esp) and (%ebp)
- Local variable: substract from %esp, e.g:
subl $8, %esp
...
Parameter 2 <--- 12(%ebp)
Parameter 1 <--- 8(%ebp)
Return Address <--- 4(%ebp)
Old %ebp <--- (%ebp) 注意這裡. 存入原本%ebp, 最後popl回存回%ebp, 在function call時%ebp為%esp的snapshot.
Local Variable 1 <--- -4(%ebp)
Local Variable 2 <--- -8(%ebp) and (%esp)
When a function is done executing, it does three things:
- Stores return value in %eax
- Resets the stack to what it was when it was called
- Returns control back to wherever it was called from by using
ret
instruction.
ret pops whatever value is at the top of the stack,
and sets the instruction pointer, %eip, to that value.
Before a function returns control to the code that called it,
it must restore the previous stack frame.
i.e:
From:
Parameter #N <--- N*4+4(%ebp)
...
Parameter 2 <--- 12(%ebp)
Parameter 1 <--- 8(%ebp)
Return Address <--- 4(%ebp)
Old %ebp <--- (%ebp) 注意這裡. As an anchor.
Local Variable 1 <--- -4(%ebp)
Local Variable 2 <--- -8(%ebp) and (%esp)
To:
Parameter #N <--- N*4+4(%ebp)
...
Parameter 2 <---12(%ebp)
Parameter 1 <--- 8(%ebp)
Return Address <--- 4(%ebp)
Old %ebp <--- (%esp) and (%ebp)
Without doing this, ret wouldn't work,
because in our current stack frame, the return address is not at the top of the stack.
Thus we do:
Now we can examine %eax for the return value.
The calling code also needs to pop off all of the parameters it pushed onto the stack in order to get the stack pointer back where it was
(Can simply add 4 * number of paramters to %esp using the addl
instruction, if we don’t need the values of the parameters anymore).
After a function call, the only register that is guaranteed to be left with the value it started with is
%ebp
...
Parameter 2 <--- 12(%ebp)
Parameter 1 <--- 8(%ebp)
Return Address <--- 4(%ebp)
Old %ebp <--- (%ebp) 注意這裡. As an anchor.
Local Variable 1 <--- -4(%ebp)
Local Variable 2 <--- -8(%ebp) and (%esp)
To:
Parameter #N <--- N*4+4(%ebp)
...
Parameter 2 <---12(%ebp)
Parameter 1 <--- 8(%ebp)
Return Address <--- 4(%ebp)
Old %ebp <--- (%esp) and (%ebp)
Without doing this, ret wouldn't work,
because in our current stack frame, the return address is not at the top of the stack.
Thus we do:
movl %ebp, %esp popl %ebp ret
Now we can examine %eax for the return value.
The calling code also needs to pop off all of the parameters it pushed onto the stack in order to get the stack pointer back where it was
(Can simply add 4 * number of paramters to %esp using the addl
instruction, if we don’t need the values of the parameters anymore).
After a function call, the only register that is guaranteed to be left with the value it started with is
%ebp
Code:
.section .data .section .text .globl _start _start: pushl $3 #push second argument pushl $2 #push first argument call power #call the function addl $8, %esp #move the stack pointer back pushl %eax #save the first answer before #calling the next function pushl $2 #push second argument pushl $5 #push first argument call power #call the function addl $8, %esp #move the stack pointer back popl %ebx #The second answer is already #in %eax. We saved the #first answer onto the stack, #so now we can just pop it #out into %ebx addl %eax, %ebx #add them together #the result is in %ebx movl $1, %eax #exit (%ebx is returned) int $0x80 .type power, @function power: pushl %ebp #save old base pointer movl %esp, %ebp #make stack pointer the base pointer subl $4, %esp #get room for our local storage movl 8(%ebp), %ebx #put first argument in %eax movl 12(%ebp), %ecx #put second argument in %ecx movl %ebx, -4(%ebp) #store current result power_loop_start: cmpl $1, %ecx #if the power is 1, we are done je end_power movl -4(%ebp), %eax #move the current result into %eax imull %ebx, %eax #multiply the current result by #the base number movl %eax, -4(%ebp) #store the current result decl %ecx #decrease the power jmp power_loop_start #run for the next power end_power: movl -4(%ebp), %eax #return value goes in %eax movl %ebp, %esp #restore the stack pointer popl %ebp #restore the base pointer ret
.equ #Assign names to numbers.
.equ LINUX_SYSCALL, 0x80 # then
int $LINUX_SYSCALL
# not using lib .include "linux.s" .section .data helloworld: .ascii "hello world\n" helloworld_end: .equ helloworld_len, helloworld_end - helloworld .section .text .globl _start _start: movl $STDOUT, %ebx movl $helloworld, %ecx movl $helloworld_len, %edx movl $SYS_WRITE, %eax int $LINUX_SYSCALL movl $0, %ebx movl $SYS_EXIT, %eax int $LINUX_SYSCALL # use a lib .section .data helloworld: .ascii "hello world\n\0" .section .text .globl _start _start: pushl $helloworld call printf # referred to by name within the program. pushl $0 call exit # referred to by name within the program.
cmd:
for not using lib:as helloworld-nolib.s -o helloworld-nolib.o
ld helloworld-nolib.o -o helloworld-nolib
for using lib:
as helloworld-lib.s -o helloworld-lib.o
ld -dynamic-linker /lib/ld-linux.so.2 -o helloworld-lib helloworld-lib.o -lc
-dynamic-linker /lib/ld-linux.so.2
# Before executing elf, the operating system will load the program /lib/ld-linux.so.2
#to load in external libraries and link them with the program.
#This program is known as a dynamic linker.
-lc #link to the c library
ldd ./helloworld-nolib # not a dynamic executable.
ldd ./helloworld-lib
# libc.so.6 => /lib/libc.so.6 (0x4001d000)
# /lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x400000000)
build shared lib:
as write-record.s -o write-record.o
as read-record.s -o read-record.o
ld -shared write-record.o read-record.o -o librecord.so
as write-records.s -o write-records
ld -L . -dynamic-linker /lib/ld-linux.so.2 -o write-records -lrecord write-records.o
Linking:
further reference:
During static linking, names like 'printf', 'exit' will be resolved to physical memory addresses,
and the names would be thrown away.
While dynamic linking, the name itself resides within the executable, and is resolved by
the dynamic linker when it is run.
When the program is run by the user, the dynamic linker(i.e /lib/ld-linux.so.2)
loads the shared libraries listed in our link statement,
and then finds all of the function and variable names that
were named by our program but not found at link time,
and matches them up with corresponding entries in the shared libraries it loads.
It then replaces all of the names with the addresses which they are loaded at.
This sounds time-consuming, but it only happens once - at program startup time.
Endianness:
reference:
This difference is not normally a problem because the bytes are reversed again (or not, if it is a big-endian processor) when being read back into a register, the programmer usually never notices what order the bytes are in.
The byte-switching magic happens automatically behind the scenes during register-to-memory transfers.
However, the byte order can cause problems in several instances:
- Read in several bytes at a time using movl but deal with them on a
byte-by-byte basis using the least significant byte
(i.e. - by using %al and/or shifting of the register),
this will be in a different order than they appear in memory. - If you read or write files written for different architectures,
you may have to account for whatever order they write their bytes in. - If you read or write to network sockets, you may have to account for a different
byte order in the protocol.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.