Nov 23, 2017

[assembly] note

Reference:
http://www.cs.virginia.edu/~evans/cs216/guides/x86.html
http://x86.renejeschke.de/
http://programminggroundup.blogspot.com/


registers:

general registers
    where the main action happens. Addition, subtraction, multiplication,
comparisions, and other operations generally use general-purpose registers for
processing.

On x86 processors, there are several general-purpose registers:
  • %eax  // System call number or Return value
  • %ebx
  • %ecx
  • %edx
  • %edi
  • %esi


Working with 2 bytes for register:

  • %eax
  • least-significant half:
    • %ax
      • %al least-significant byte
      • %ah most significant byte


special-purpose registers, including:
  • %ebp # base pointer register.  used for accessing function parameters and local variables. 
  • %esp # stack register, %esp, always contains a pointer to the current top of the stack, wherever it is.
  • %eip # instruction pointer, point to the start of the function.
  • %eflags

Jump:

  • je  #Jump if the values were equal
  • jg  #Jump if the second value was greater than the first value
  • jge #Jump if the second value was greater than or equal to the first value
  • jl  #Jump if the second value was less than the first value
  • jle #Jump if the second value was less than or equal to the first value
  • jmp #Jump no matter what. This does not need to be preceeded by a comparison.

There are several different types of memory locations:
  • .byte
  • .int
  • .long
  • .ascii
    Characters each take up one storage location (they are converted into bytes internally). 
    The last character is represented by \0


Data Accessing Methods / Addressing Modes:

The general form of memory address references:

# All of the fields are optional. 
# All of the addressing modes can be represented in this fashion except immediate-mode 
ADDRESS_OR_OFFSET (%BASE_OR_OFFSET, %INDEX, MULTIPLIER)

(If any of the variable is left out, it is just substituted with zero in the equation.)
FINAL ADDRESS = ADDRESS_OR_OFFSET + %BASE_OR_OFFSET + MULTIPLIER * %INDEX

ADDRESS_OR_OFFSET (const)
MULTIPLIER (const)
BASE_OR_OFFSET (register)
INDEX (register)

Modes:

Every mode except immediate mode can be used as either the source or destination operand.
Immediate mode can only be a source operand.


immediate mode
    The data to access is embedded in the instruction itself.
    Immediate mode is used to load direct values into registers or memory locations.
    Example:
    movl $12, %eax   # move number 12 into %eax

    Notice that to indicate immediate mode, we used a dollar sign in front of the
    number.
    If we did not, it would be direct addressing mode, in which case the
    value located at memory location 12 would be loaded into %eax rather than
    the number 12 itself.


register addressing mode (access register)
    contains a register to access.
    Register mode simply moves data in or out of a register.
 

direct addressing mode (access memory)
     contains the memory address to access.
     Done by only using the ADDRESS_OR_OFFSET portion.
     Example:
     movl ADDRESS, %eax
     This loads %eax with the value at memory address ADDRESS.
   

indexed addressing mode (access memory)
    contains a memory address to access, and also specifies an index register to offset that address.
    On x86 processors, you can also specify a multiplier for the index.
    Done by using the ADDRESS_OR_OFFSET and the %INDEX portion.
    Can use any general-purpose register as the index register.
    Can also have a constant multiplier of 1, 2, or 4 for the index register,
    Example:
    movl string_start(,%ecx,1), %eax
    This starts at string_start, and adds 1 * %ecx to that address, and loads the value into %eax.

 
indirect addressing mode (access data)
     contains a register that contains a pointer to where the data should be accessed.
     Indirect addressing mode loads a value from the address indicated by a
     register.
     Example:
     movl (%eax), %ebx

   
base pointer addressing mode (access data)
     similar to indirect addressing, but you also include a number called the offset to add to the register's value before using it for lookup.
     Base-pointer addressing is similar to indirect addressing, except that it adds a
constant value to the address in the register.
     Example:
     movl 4(%eax), %ebx

Code:

.section .data

.section .text

.globl _start

_start:
movl $1, %eax # this is the linux kernel command number (system call) for exiting a program
movl $0, %ebx # this is the status number we will return to the operating system. 
              # Change this around and it will return different things to echo $?

int $0x80   # this wakes up the kernel to run the exit command

https://stackoverflow.com/questions/14361248/whats-the-difference-of-section-and-segment-in-elf-file-format

Compiled from right to left:

cmd: 

# assembler 
as exit.s -o exit.o
# linker 
ld exit.o -o exit

Code dissect:

.section .data
Anything starting with a period isn't directly translated into a machine instruction.
Instead, it's an instruction to the assembler itself. These are called assembler
directives or pseudo-operations because they are handled by the assembler and are
not actually run by the computer.

The .section command breaks your program up into sections.
This command starts the data section, where you list any memory
storage you will need for data.


.section .text
Starts the text section. The text section of a program is where the program
instructions live.

.globl _start
_start is a symbol, which means that it is going to be replaced by something else either
during assembly or linking.
Symbols are generally used to mark locations of programs or data,
so you can refer to them by name instead of by their location number.
Symbols are used so that the assembler and linker can take care of keeping track of addresses.

.globl means that the assembler shouldn't discard this symbol after assembly,
because the linker will need it.

 _start is a special symbol that always needs to be
marked with .globl because it marks the location of the start of the program.


_start:
defines the value of the _start label.
A label is a symbol followed by a colon.
Labels define a symbol's value. When the assembler is assembling the program, it
has to assign each data value and instruction an address.

Labels tell the assembler to make the symbol's value be wherever the next
instruction or data element will be.
This way, if the actual physical location of the data or instruction changes, you
don't have to rewrite any references to it - the symbol automatically gets the new
value.

movl $1, %eax

Assign %eax with value 1.
src: $1 ($ means using immediate mode. Without the dollar-sign it would do direct addressing,
loading whatever number is at address 1.)
dest: register

The reason we are moving the number 1 into %eax is because we are preparing to
call the Linux Kernel.
The number 1 is the number of the exit system call .

The system call number has to be loaded into %eax .
%eax is always required to be loaded with the system call number.
Depending on the system call, other registers may have to have values in them as well.

int $0x80

int stands for interrupt.

Code:

.section .data
data_items:
#These are the data items
.long 3,67,34,222,45,75,54,34,44,33,22,11,66,0

.section .text

.globl _start
_start:
movl $0, %edi # move 0 into the index register
movl data_items(,%edi,4), %eax # load the first byte of data
movl %eax, %ebx  # since this is the first item, %eax is
                        # the biggest

start_loop:  # start loop
cmpl $0, %eax # check to see if we’ve hit the end
je loop_exit
incl %edi  # load next value
movl data_items(,%edi,4), %eax
cmpl %ebx, %eax # compare values
jle start_loop # jump to loop beginning if the new
                  # one isn't bigger
movl %eax, %ebx  # move the value as the largest

jmp start_loop # jump to loop beginning
loop_exit:
# %ebx is the status code for the exit system call
# and it already has the maximum number
movl $1, %eax #1 is the exit() syscall
int $0x80
data_items:   #These are the data items
    .long 3,67,34,222,45,75,54,34,44,33,22,11,66,0
movl data_items, %eax 
would move the value 3 into %eax.

Without a
.globl
declaration for data_items is because we only refer to these locations within the program.
No other file or program needs to know where they are located.
It's not an error to write .globl data_items, it's just not necessary.

A variable is a dedicated storage location used for a specific purpose, usually
given a distinct name by the programmer.

movl $0, %edi
%edi as our index, and we want to start looking at the first
item, we load %edi with 0.
l in movl stands for move long.

movl data_items(,%edi,4), %eax
movl BEGINNINGADDRESS(,%INDEXREGISTER,WORDSIZE)  # indexed addressing mode

movl %eax, %ebx
even though movl stands for move, it actually copies the value,
so %eax and %ebx both contain the starting value.

cmpl $0, %eax
%eflags register stores the result of comparison.

je end_loop
jump if equal in %eflags.

incl %edi
incl increments the value of %edi by one.

Function:

  • function name
  • function parameters
  • local variables
    • Static vs. Global variable:
      • The only difference between the global and static
        variables is that static variables are only used by one function,
        while global variables are used by many functions.
        Assembly language treats them exactly the same,
        although most other languages distinguish them.
  • global variables
  • return address
    return address is a parameter which tells the function where to resume executing after the function is completed.
    This is needed because functions can be called to do processing from many different parts of your program, and the function needs to be able to get back to wherever it was called from.
    In most programming languages, this parameter is passed automatically when the function is called.
    In assembly language, the call instruction handles passing the return address for you,
    and ret handles using that address to return back to where you called the function from.

Calling convention:

The way that the variables are stored and the parameters and return values are
transferred by the computer.
In the C language calling convention, the stack is the key element for
implementing a function's local variables, parameters, and return address.

How is it implemented for return large data structure in assembly?

cdecl (c calling convention)

Application_binary_interface
alt text
alt text

stack:

pushl #pushes either a register or memory value onto the top of the stack.
popl #pop values off the top.
%esp #stack register, current "top" of the stack.

Every time we push something onto the stack with pushl, %esp gets subtracted
by 4 so that it points to the new top of the stack.

the popl instruction, which adds 4 to %esp and puts the previous top value in whatever register you specified.

movl (%esp), %eax      # put stack top's value into %eax
movl %esp, %eax        #put stack top's address into %eax
movl 4(%esp), %eax   #base pointer addressing mode. adds 4 to %esp before looking up the value being pointed to.


Function call process:

Before executing a function, a program
  1. pushes all of the parameters for the function onto the stack in the reverse order that they are documented.
  2. issues a call instruction indicating which function it wishes to start.
  3. The call instruction does two things. 
    1. First it pushes the address of the next instruction, which is the return address, onto the stack. 這之後在ret會用來存入
      %eip中,提供下一步的instruction.
    2. Then it modifies the instruction pointer (%eip) to point to the start of the function. 

Parameter #N
...
Parameter 2
Parameter 1
Return Address <--- (%esp)

Thus
  • Each of the parameters of the function have been pushed onto the stack, and
  • finally the return address is there.
  • save the current base pointer register, %ebp, by doing
    pushl %ebp. 
  • copies the stack pointer to %ebp by doing
    movl %esp, %ebp
    Which allows us to be able to access the function parameters
    as fixed indexes from the base pointer.

Parameter #N <--- N*4+4(%ebp)
...
Parameter 2 <---12(%ebp)
Parameter 1 <---  8(%ebp)
Return Address <---  4(%ebp)
Old %ebp <--- (%esp) and (%ebp)
  • Local variable: substract from %esp, e.g:
    subl $8, %esp
Parameter #N <--- N*4+4(%ebp)
...
Parameter 2 <--- 12(%ebp)
Parameter 1 <--- 8(%ebp)
Return Address <--- 4(%ebp)
Old %ebp <--- (%ebp) 注意這裡. 存入原本%ebp, 最後popl回存回%ebp, 在function call時%ebp為%esp的snapshot.
Local Variable 1 <--- -4(%ebp)
Local Variable 2 <--- -8(%ebp) and (%esp)


When a function is done executing, it does three things:
  • Stores return value in %eax
  • Resets the stack to what it was when it was called
  • Returns control back to wherever it was called from by using
    ret
    instruction.
    ret pops whatever value is at the top of the stack,
    and sets the instruction pointer, %eip, to that value.

Before a function returns control to the code that called it,
it must restore the previous stack frame. 
i.e:
From:
Parameter #N <--- N*4+4(%ebp)
...
Parameter 2 <--- 12(%ebp)
Parameter 1 <--- 8(%ebp)
Return Address <--- 4(%ebp)
Old %ebp <--- (%ebp) 注意這裡. As an anchor.
Local Variable 1 <--- -4(%ebp)
Local Variable 2 <--- -8(%ebp) and (%esp)

To:
Parameter #N <--- N*4+4(%ebp)
...
Parameter 2 <---12(%ebp)
Parameter 1 <---  8(%ebp)
Return Address <---  4(%ebp)
Old %ebp <--- (%esp) and (%ebp)

Without doing this, ret wouldn't work,
because in our current stack frame, the return address is not at the top of the stack.

Thus we do:
movl %ebp, %esp

popl %ebp

ret

Now we can examine %eax for the return value.

The calling code also needs to pop off all of the parameters it pushed onto the stack in order to get the stack pointer back where it was
(Can simply add 4 * number of paramters to %esp using the addl
instruction, if we don’t need the values of the parameters anymore).

After a function call, the only register that is guaranteed to be left with the value it started with is
%ebp



Code:

.section .data
.section .text

.globl _start
_start:
pushl $3    #push second argument
pushl $2    #push first argument

call power        #call the function

addl $8, %esp   #move the stack pointer back

pushl %eax  #save the first answer before
                 #calling the next function
                 
pushl $2    #push second argument
pushl $5    #push first argument
call power  #call the function
addl $8, %esp   #move the stack pointer back

popl %ebx   #The second answer is already 
                #in %eax. We saved the
                #first answer onto the stack,
                #so now we can just pop it
                #out into %ebx

addl %eax, %ebx     #add them together
                          #the result is in %ebx

movl $1, %eax   #exit (%ebx is returned)

int $0x80


.type power, @function
power:
pushl %ebp #save old base pointer
movl %esp, %ebp #make stack pointer the base pointer
subl $4, %esp #get room for our local storage
movl 8(%ebp), %ebx #put first argument in %eax
movl 12(%ebp), %ecx #put second argument in %ecx
movl %ebx, -4(%ebp) #store current result

power_loop_start:
cmpl $1, %ecx #if the power is 1, we are done
je end_power
movl -4(%ebp), %eax #move the current result into %eax
imull %ebx, %eax  #multiply the current result by
                        #the base number

movl %eax, -4(%ebp) #store the current result
decl %ecx #decrease the power
jmp power_loop_start #run for the next power

end_power:
movl -4(%ebp), %eax #return value goes in %eax
movl %ebp, %esp #restore the stack pointer
popl %ebp #restore the base pointer
ret


.equ  #Assign names to numbers.
.equ LINUX_SYSCALL, 0x80 # then
int $LINUX_SYSCALL



# not using lib
.include "linux.s"

.section .data

helloworld:
.ascii "hello world\n"
helloworld_end:


.equ helloworld_len, helloworld_end - helloworld
.section .text
.globl _start
_start:

movl $STDOUT, %ebx
movl $helloworld, %ecx
movl $helloworld_len, %edx
movl $SYS_WRITE, %eax
int $LINUX_SYSCALL

movl $0, %ebx
movl $SYS_EXIT, %eax
int $LINUX_SYSCALL


# use a lib
.section .data

helloworld:
.ascii "hello world\n\0"

.section .text
.globl _start
_start:
pushl $helloworld
call printf #  referred to by name within the program.
pushl $0
call exit #  referred to by name within the program.

cmd:

for not using lib:
as helloworld-nolib.s -o helloworld-nolib.o
ld helloworld-nolib.o -o helloworld-nolib

for using lib:
as helloworld-lib.s -o helloworld-lib.o
ld -dynamic-linker /lib/ld-linux.so.2 -o helloworld-lib helloworld-lib.o -lc

-dynamic-linker /lib/ld-linux.so.2
# Before executing elf, the operating system will load the program /lib/ld-linux.so.2
#to load in external libraries and link them with the program.
#This program is known as a dynamic linker.

-lc #link to the c library

ldd ./helloworld-nolib # not a dynamic executable.

ldd ./helloworld-lib
# libc.so.6 => /lib/libc.so.6 (0x4001d000)
# /lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x400000000)

build shared lib:
as write-record.s -o write-record.o
as read-record.s -o read-record.o
ld -shared write-record.o read-record.o -o librecord.so

as write-records.s -o write-records
ld -L . -dynamic-linker /lib/ld-linux.so.2 -o write-records -lrecord write-records.o

Linking:

further reference:
During static linking, names like 'printf', 'exit' will be resolved to physical memory addresses,
and the names would be thrown away.

While dynamic linking, the name itself resides within the executable, and is resolved by
the dynamic linker when it is run.

When the program is run by the user, the dynamic linker(i.e /lib/ld-linux.so.2)
loads the shared libraries listed in our link statement,
and then finds all of the function and variable names that
were named by our program but not found at link time,
and matches them up with corresponding entries in the shared libraries it loads.
It then replaces all of the names with the addresses which they are loaded at.
This sounds time-consuming, but it only happens once - at program startup time.

Endianness:

reference:



This difference is not normally a problem because the bytes are reversed again (or not, if it is a big-endian processor) when being read back into a register, the programmer usually never notices what order the bytes are in. 

The byte-switching magic happens automatically behind the scenes during register-to-memory transfers.

However, the byte order can cause problems in several instances:
  • Read in several bytes at a time using movl but deal with them on a
    byte-by-byte basis using the least significant byte
    (i.e. - by using %al and/or shifting of the register),
    this will be in a different order than they appear in memory.
  • If you read or write files written for different architectures,
    you may have to account for whatever order they write their bytes in.
  • If you read or write to network sockets, you may have to account for a different
    byte order in the protocol.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.