Ataraxia through Epoché: [assembly] note

Reference:
http://www.cs.virginia.edu/~evans/cs216/guides/x86.html
http://x86.renejeschke.de/
http://programminggroundup.blogspot.com/

registers:

general registers
where the main action happens. Addition, subtraction, multiplication,
comparisions, and other operations generally use general-purpose registers for
processing.

On x86 processors, there are several general-purpose registers:

%eax // System call number or Return value
%ebx
%ecx
%edx
%edi
%esi

Working with 2 bytes for register:

%eax

least-significant half:

%ax

%al least-significant byte

%ah most significant byte

special-purpose registers, including:

%ebp # base pointer register. used for accessing function parameters and local variables.
%esp # stack register, %esp, always contains a pointer to the current top of the stack, wherever it is.
%eip # instruction pointer, point to the start of the function.
%eflags

Jump:

je #Jump if the values were equal

jg #Jump if the second value was greater than the first value

jge #Jump if the second value was greater than or equal to the first value

jl #Jump if the second value was less than the first value

jle #Jump if the second value was less than or equal to the first value

jmp #Jump no matter what. This does not need to be preceeded by a comparison.

There are several different types of memory locations:

.byte

.int

.long

.ascii

Characters each take up one storage location (they are converted into bytes internally).

The last character is represented by \0

Data Accessing Methods / Addressing Modes:

The general form of memory address references:

# All of the fields are optional.

# All of the addressing modes can be represented in this fashion except immediate-mode

ADDRESS_OR_OFFSET (%BASE_OR_OFFSET, %INDEX, MULTIPLIER)

(If any of the variable is left out, it is just substituted with zero in the equation.)

FINAL ADDRESS = ADDRESS_OR_OFFSET + %BASE_OR_OFFSET + MULTIPLIER * %INDEX

ADDRESS_OR_OFFSET (const)

MULTIPLIER (const)

BASE_OR_OFFSET (register)

INDEX (register)

Modes:

Every mode except immediate mode can be used as either the source or destination operand.
Immediate mode can only be a source operand.

immediate mode
The data to access is embedded in the instruction itself.
Immediate mode is used to load direct values into registers or memory locations.
Example:
movl $12, %eax # move number 12 into %eax

Notice that to indicate immediate mode, we used a dollar sign in front of the
number.
If we did not, it would be direct addressing mode, in which case the
value located at memory location 12 would be loaded into %eax rather than
the number 12 itself.

register addressing mode (access register)
contains a register to access.
Register mode simply moves data in or out of a register.

direct addressing mode (access memory)
contains the memory address to access.
Done by only using the ADDRESS_OR_OFFSET portion.
Example:
movl ADDRESS, %eax
This loads %eax with the value at memory address ADDRESS.

indexed addressing mode (access memory)
contains a memory address to access, and also specifies an index register to offset that address.
On x86 processors, you can also specify a multiplier for the index.
Done by using the ADDRESS_OR_OFFSET and the %INDEX portion.
Can use any general-purpose register as the index register.
Can also have a constant multiplier of 1, 2, or 4 for the index register,
Example:
movl string_start(,%ecx,1), %eax
This starts at string_start, and adds 1 * %ecx to that address, and loads the value into %eax.

indirect addressing mode (access data)
contains a register that contains a pointer to where the data should be accessed.
Indirect addressing mode loads a value from the address indicated by a
register.
Example:
movl (%eax), %ebx

base pointer addressing mode (access data)
similar to indirect addressing, but you also include a number called the offset to add to the register's value before using it for lookup.
Base-pointer addressing is similar to indirect addressing, except that it adds a
constant value to the address in the register.
Example:
movl 4(%eax), %ebx

Code:

.section .data

.section .text

.globl _start

_start:
movl $1, %eax # this is the linux kernel command number (system call) for exiting a program
movl $0, %ebx # this is the status number we will return to the operating system. 
              # Change this around and it will return different things to echo $?

int $0x80   # this wakes up the kernel to run the exit command

https://stackoverflow.com/questions/14361248/whats-the-difference-of-section-and-segment-in-elf-file-format

Compiled from right to left:

cmd:

# assembler

as exit.s -o exit.o

# linker

ld exit.o -o exit

Code dissect:

.section .data
Anything starting with a period isn't directly translated into a machine instruction.
Instead, it's an instruction to the assembler itself. These are called assembler
directives or pseudo-operations because they are handled by the assembler and are
not actually run by the computer.

The .section command breaks your program up into sections.
This command starts the data section, where you list any memory
storage you will need for data.

.section .text
Starts the text section. The text section of a program is where the program
instructions live.

.globl _start
_start is a symbol, which means that it is going to be replaced by something else either
during assembly or linking.
Symbols are generally used to mark locations of programs or data,
so you can refer to them by name instead of by their location number.
Symbols are used so that the assembler and linker can take care of keeping track of addresses.

.globl means that the assembler shouldn't discard this symbol after assembly,
because the linker will need it.

_start is a special symbol that always needs to be
marked with .globl because it marks the location of the start of the program.

_start:
defines the value of the _start label.
A label is a symbol followed by a colon.
Labels define a symbol's value. When the assembler is assembling the program, it
has to assign each data value and instruction an address.

Labels tell the assembler to make the symbol's value be wherever the next
instruction or data element will be.
This way, if the actual physical location of the data or instruction changes, you
don't have to rewrite any references to it - the symbol automatically gets the new
value.

movl $1, %eax

Assign %eax with value 1.
src: $1 ($ means using immediate mode. Without the dollar-sign it would do direct addressing,
loading whatever number is at address 1.)
dest: register

The reason we are moving the number 1 into %eax is because we are preparing to
call the Linux Kernel.
The number 1 is the number of the exit system call .

The system call number has to be loaded into %eax .
%eax is always required to be loaded with the system call number.
Depending on the system call, other registers may have to have values in them as well.

int $0x80

int stands for interrupt.

Code:

.section .data
data_items:
#These are the data items
.long 3,67,34,222,45,75,54,34,44,33,22,11,66,0

.section .text

.globl _start
_start:
movl $0, %edi # move 0 into the index register
movl data_items(,%edi,4), %eax # load the first byte of data
movl %eax, %ebx  # since this is the first item, %eax is
                        # the biggest

start_loop:  # start loop
cmpl $0, %eax # check to see if we’ve hit the end
je loop_exit
incl %edi  # load next value
movl data_items(,%edi,4), %eax
cmpl %ebx, %eax # compare values
jle start_loop # jump to loop beginning if the new
                  # one isn't bigger
movl %eax, %ebx  # move the value as the largest

jmp start_loop # jump to loop beginning
loop_exit:
# %ebx is the status code for the exit system call
# and it already has the maximum number
movl $1, %eax #1 is the exit() syscall
int $0x80

data_items:   #These are the data items
    .long 3,67,34,222,45,75,54,34,44,33,22,11,66,0

movl data_items, %eax

would move the value 3 into %eax.

Without a
.globl
declaration for data_items is because we only refer to these locations within the program.
No other file or program needs to know where they are located.
It's not an error to write .globl data_items, it's just not necessary.

A variable is a dedicated storage location used for a specific purpose, usually
given a distinct name by the programmer.

movl $0, %edi

%edi as our index, and we want to start looking at the first
item, we load %edi with 0.
l in movl stands for move long.

movl data_items(,%edi,4), %eax

movl BEGINNINGADDRESS(,%INDEXREGISTER,WORDSIZE) # indexed addressing mode

movl %eax, %ebx

even though movl stands for move, it actually copies the value,
so %eax and %ebx both contain the starting value.

cmpl $0, %eax

%eflags register stores the result of comparison.

je end_loop

jump if equal in %eflags.

incl %edi

incl increments the value of %edi by one.

Function:

function name
function parameters
local variables

Static vs. Global variable:

The only difference between the global and static
variables is that static variables are only used by one function,
while global variables are used by many functions.
Assembly language treats them exactly the same,
although most other languages distinguish them.

global variables
return address
return address is a parameter which tells the function where to resume executing after the function is completed.
This is needed because functions can be called to do processing from many different parts of your program, and the function needs to be able to get back to wherever it was called from.
In most programming languages, this parameter is passed automatically when the function is called.
In assembly language, the call instruction handles passing the return address for you,
and ret handles using that address to return back to where you called the function from.

Calling convention:

The way that the variables are stored and the parameters and return values are
transferred by the computer.
In the C language calling convention, the stack is the key element for
implementing a function's local variables, parameters, and return address.

How is it implemented for return large data structure in assembly?

cdecl (c calling convention)

Application_binary_interface
alt text

stack:

pushl #pushes either a register or memory value onto the top of the stack.
popl #pop values off the top.
%esp #stack register, current "top" of the stack.

Every time we push something onto the stack with pushl, %esp gets subtracted
by 4 so that it points to the new top of the stack.

the popl instruction, which adds 4 to %esp and puts the previous top value in whatever register you specified.

movl (%esp), %eax # put stack top's value into %eax
movl %esp, %eax #put stack top's address into %eax
movl 4(%esp), %eax #base pointer addressing mode. adds 4 to %esp before looking up the value being pointed to.

Function call process:

Before executing a function, a program

pushes all of the parameters for the function onto the stack in the reverse order that they are documented.
issues a call instruction indicating which function it wishes to start.
The call instruction does two things.

First it pushes the address of the next instruction, which is the return address, onto the stack. 這之後在ret會用來存入
%eip中，提供下一步的instruction.
Then it modifies the instruction pointer (%eip) to point to the start of the function.

Parameter #N
...
Parameter 2
Parameter 1
Return Address <--- (%esp)

Thus

Each of the parameters of the function have been pushed onto the stack, and
finally the return address is there.
save the current base pointer register, %ebp, by doing
pushl %ebp.
copies the stack pointer to %ebp by doing
movl %esp, %ebp
Which allows us to be able to access the function parameters
as fixed indexes from the base pointer.

Parameter #N <--- N*4+4(%ebp)
...
Parameter 2 <---12(%ebp)
Parameter 1 <--- 8(%ebp)
Return Address <--- 4(%ebp)
Old %ebp <--- (%esp) and (%ebp)

Local variable: substract from %esp, e.g:
subl $8, %esp

Parameter #N <--- N*4+4(%ebp)
...
Parameter 2 <--- 12(%ebp)
Parameter 1 <--- 8(%ebp)
Return Address <--- 4(%ebp)
Old %ebp <--- (%ebp) 注意這裡. 存入原本%ebp, 最後popl回存回%ebp, 在function call時%ebp為%esp的snapshot.
Local Variable 1 <--- -4(%ebp)
Local Variable 2 <--- -8(%ebp) and (%esp)

When a function is done executing, it does three things:

Stores return value in %eax
Resets the stack to what it was when it was called
Returns control back to wherever it was called from by using
ret
instruction.
ret pops whatever value is at the top of the stack,
and sets the instruction pointer, %eip, to that value.

Before a function returns control to the code that called it,

it must restore the previous stack frame.

i.e:

From:

Parameter #N <--- N*4+4(%ebp)
...
Parameter 2 <--- 12(%ebp)
Parameter 1 <--- 8(%ebp)
Return Address <--- 4(%ebp)
Old %ebp <--- (%ebp) 注意這裡. As an anchor.
Local Variable 1 <--- -4(%ebp)
Local Variable 2 <--- -8(%ebp) and (%esp)

To:
Parameter #N <--- N*4+4(%ebp)
...
Parameter 2 <---12(%ebp)
Parameter 1 <--- 8(%ebp)
Return Address <--- 4(%ebp)
Old %ebp <--- (%esp) and (%ebp)

Without doing this, ret wouldn't work,
because in our current stack frame, the return address is not at the top of the stack.

Thus we do:

movl %ebp, %esp

popl %ebp

ret

Now we can examine %eax for the return value.

The calling code also needs to pop off all of the parameters it pushed onto the stack in order to get the stack pointer back where it was
(Can simply add 4 * number of paramters to %esp using the addl
instruction, if we don’t need the values of the parameters anymore).

After a function call, the only register that is guaranteed to be left with the value it started with is
%ebp

Code:

.section .data
.section .text

.globl _start
_start:
pushl $3    #push second argument
pushl $2    #push first argument

call power        #call the function

addl $8, %esp   #move the stack pointer back

pushl %eax  #save the first answer before
                 #calling the next function
                 
pushl $2    #push second argument
pushl $5    #push first argument
call power  #call the function
addl $8, %esp   #move the stack pointer back

popl %ebx   #The second answer is already 
                #in %eax. We saved the
                #first answer onto the stack,
                #so now we can just pop it
                #out into %ebx

addl %eax, %ebx     #add them together
                          #the result is in %ebx

movl $1, %eax   #exit (%ebx is returned)

int $0x80


.type power, @function
power:
pushl %ebp #save old base pointer
movl %esp, %ebp #make stack pointer the base pointer
subl $4, %esp #get room for our local storage
movl 8(%ebp), %ebx #put first argument in %eax
movl 12(%ebp), %ecx #put second argument in %ecx
movl %ebx, -4(%ebp) #store current result

power_loop_start:
cmpl $1, %ecx #if the power is 1, we are done
je end_power
movl -4(%ebp), %eax #move the current result into %eax
imull %ebx, %eax  #multiply the current result by
                        #the base number

movl %eax, -4(%ebp) #store the current result
decl %ecx #decrease the power
jmp power_loop_start #run for the next power

end_power:
movl -4(%ebp), %eax #return value goes in %eax
movl %ebp, %esp #restore the stack pointer
popl %ebp #restore the base pointer
ret

.equ #Assign names to numbers.
.equ LINUX_SYSCALL, 0x80 # then
int $LINUX_SYSCALL

# not using lib
.include "linux.s"

.section .data

helloworld:
.ascii "hello world\n"
helloworld_end:


.equ helloworld_len, helloworld_end - helloworld
.section .text
.globl _start
_start:

movl $STDOUT, %ebx
movl $helloworld, %ecx
movl $helloworld_len, %edx
movl $SYS_WRITE, %eax
int $LINUX_SYSCALL

movl $0, %ebx
movl $SYS_EXIT, %eax
int $LINUX_SYSCALL


# use a lib
.section .data

helloworld:
.ascii "hello world\n\0"

.section .text
.globl _start
_start:
pushl $helloworld
call printf #  referred to by name within the program.
pushl $0
call exit #  referred to by name within the program.

cmd:

for not using lib:
as helloworld-nolib.s -o helloworld-nolib.o
ld helloworld-nolib.o -o helloworld-nolib

for using lib:
as helloworld-lib.s -o helloworld-lib.o
ld -dynamic-linker /lib/ld-linux.so.2 -o helloworld-lib helloworld-lib.o -lc

-dynamic-linker /lib/ld-linux.so.2
# Before executing elf, the operating system will load the program /lib/ld-linux.so.2
#to load in external libraries and link them with the program.
#This program is known as a dynamic linker.

-lc #link to the c library

ldd ./helloworld-nolib # not a dynamic executable.

ldd ./helloworld-lib
# libc.so.6 => /lib/libc.so.6 (0x4001d000)
# /lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x400000000)

build shared lib:
as write-record.s -o write-record.o
as read-record.s -o read-record.o
ld -shared write-record.o read-record.o -o librecord.so

as write-records.s -o write-records
ld -L . -dynamic-linker /lib/ld-linux.so.2 -o write-records -lrecord write-records.o

Linking:

further reference:

http://vsdmars.blogspot.com/2015/09/linking-notes.html

During static linking, names like 'printf', 'exit' will be resolved to physical memory addresses,

and the names would be thrown away.

While dynamic linking, the name itself resides within the executable, and is resolved by

the dynamic linker when it is run.

When the program is run by the user, the dynamic linker(i.e /lib/ld-linux.so.2)

loads the shared libraries listed in our link statement,

and then finds all of the function and variable names that

were named by our program but not found at link time,

and matches them up with corresponding entries in the shared libraries it loads.

It then replaces all of the names with the addresses which they are loaded at.

This sounds time-consuming, but it only happens once - at program startup time.

Endianness:

reference:

https://en.wikipedia.org/wiki/Endianness

This difference is not normally a problem because the bytes are reversed again (or not, if it is a big-endian processor) when being read back into a register, the programmer usually never notices what order the bytes are in.

The byte-switching magic happens automatically behind the scenes during register-to-memory transfers.

However, the byte order can cause problems in several instances:

Read in several bytes at a time using movl but deal with them on a
byte-by-byte basis using the least significant byte
(i.e. - by using %al and/or shifting of the register),
this will be in a different order than they appear in memory.
If you read or write files written for different architectures,
you may have to account for whatever order they write their bytes in.
If you read or write to network sockets, you may have to account for a different
byte order in the protocol.

Ataraxia through Epoché

Nov 23, 2017

[assembly] note

registers:

Data Accessing Methods / Addressing Modes:

The general form of memory address references:

Modes:

Code:

cmd:

Code dissect:

Code:

Function:

Calling convention:

stack:

Function call process:

Code:

cmd:

Linking:

Endianness:

No comments:

Post a Comment