Skip to content

A custom assembler with macro support and a two-pass assembly process, implementing efficient code translation, error handling, and file management in C.

License

Notifications You must be signed in to change notification settings

talfig/Assembler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ”§ Assembler: The Code Converter

Welcome to the ultimate assembler built in C, designed to transform your assembly code into machine language with precision and elegance.
🎨 Custom Assembly Language | 🌟 Macro Magic | πŸ’‘ Detailed Error Reporting

GitHub stars GitHub forks Repo views License

Contributions Welcome Made with Love

🌌 Creator: Tal Figenblat 🌌

Table of Contents
  1. Project Background and Objectives
  2. Hardware
  3. Features
  4. Supported Opcodes
  5. Opcodes Specification
  6. Assembly Language Syntax
  7. Addressing Methods (Modes)
  8. Addressing Methods for Operations
  9. First Word Encoding
  10. Addressing Methods Encoding
  11. Types of Statements in Assembly Language
  12. Directive Statements
  13. Instruction Fields
  14. Instruction Statement Formats
  15. The A,R,E Field Encoding
  16. Macro Handling
  17. Assembler with Two Passes
  18. Object File Format
  19. Example Program
  20. Installation
  21. Usage
  22. Error Handling
  23. Directory Structure
  24. Appendix
  25. License
  26. Contact

πŸ“ Project Background and Objectives

As is known, there are many programming languages, and a large number of programs written in different languages can run on the same computer. How does the computer recognize so many languages? The answer is simple: the computer actually understands only one language: machine code, which is stored in memory as a sequence of binary digits. This code is divided by the Central Processing Unit (CPU) into small segments with meaning: instructions, addresses, and data.

In fact, computer memory is an array of bits, usually grouped into fixed-size units (bytes, words). There is no physical difference, visible to an unskilled eye, between the part of memory where a program is located and the rest of the memory.

The CPU can perform a variety of simple operations called machine instructions, using the registers within the CPU and the computer memory. Examples include transferring a number from memory to a register or back, adding 1 to a number in a register, checking if a number in a register is zero, and performing addition and subtraction between two registers.

Machine instructions and their combinations make up the program as it is loaded into memory. Each source program (the program as written by the programmer) will eventually be translated into this final form by a special software tool.

The CPU executes code in the format of machine language. This is a sequence of bits representing a binary encoding of a series of machine instructions making up the program. This code is not readable to users and thus it is not convenient to write or read programs directly in machine language. Assembly language allows representing machine instructions in a more symbolic and user-friendly manner. However, it still needs to be translated into machine code for the program to run on the computer. This translation is done by a tool called an assembler.

Each CPU model (i.e., each computer architecture) has its own specific machine language and, correspondingly, its own specific assembly language. Therefore, each assembler (translation tool) is dedicated and unique to each CPU.

The task of the assembler is to create a file containing machine code from a given source file written in assembly language. This is the first step in the process of getting a program ready to run on computer hardware. The subsequent steps are linking and loading, but these are not covered in this context.

The goal of this project is to write an assembler (i.e., a program that translates into machine code) for a specifically defined assembly language for this project.

Executable Code Process

Executable Code Process

πŸ–₯️ Hardware

  • The computer in this project is comprised of a CPU (Central Processing Unit) - a work unit containing registers and RAM. Some of the memory also serves as a stack.

  • The CPU contains 8 general registers: r0, r1, r2, r3, r4, r5, r6, r7. Each register is 15 bits in size, with the least significant bit labeled as bit 0 and the most significant bit as bit 14. The content of the registers is initialized to 0. Register names are always written with a lowercase 'r'.

  • Additionally, the CPU contains a register named PSW (program status word), which contains several flags representing the status of the program at any given moment. These flags are typically used to indicate conditions like carry or overflow after arithmetic operations. For more details on the PSW, please refer to the Program Status Word (PSW) section in the appendix.

  • The memory size is 4096 bytes, addressed from 0-4095, with each address being 15 bits long. The memory content is treated as "words", similar to registers.

  • The machine works with a big-endian system, meaning the most significant byte is stored at the smallest memory address. The numbers are stored using 2's complement (negative numbers), and characters are encoded in ASCII.

πŸš€ Features

  • Custom Assembly Language: Dive into our unique assembly language with a defined set of opcodes.
  • Macro Efficiency: Create and use powerful macros to streamline your assembly code.
  • Smart Error Handling: Get detailed, compiler-like error messages to troubleshoot with ease.

πŸ•ΉοΈ Supported Opcodes

Our assembler brings to life a variety of opcodes for your coding pleasure:

  • (0) mov – Move data
  • (1) cmp – Compare values
  • (2) add – Addition
  • (3) sub – Subtraction
  • (4) lea – Load effective address
  • (5) clr – Clear data
  • (6) not – Bitwise NOT
  • (7) inc – Increment
  • (8) dec – Decrement
  • (9) jmp – Jump to address
  • (10) bne – Branch if not equal
  • (11) red – Read input
  • (12) prn – Print output
  • (13) jsr – Jump to subroutine
  • (14) rts – Return from subroutine
  • (15) stop – Halt execution

🧩 Opcodes Specification

The CPU in this project includes a Program Counter (PC), an internal register (not a general-purpose register) that contains the memory address of the current instruction being executed. Instructions are divided into three groups based on the number of operands they require.

☝️ First Group: Two-Operand Opcodes

These instructions use two operands:

  • mov: Copies the content of the source operand to the destination operand.
    • Example: mov A, r1 – Copies the value from memory address A to register r1.
  • cmp: Compares the value of the source operand with the destination operand. The comparison results affect the flags in the PSW.
    • Example: cmp A, r1 – Compares the value at memory address A with the value in register r1. The Z flag in the PSW is set if they are equal.
  • add: Adds the value of the source operand to the destination operand and stores the result in the destination.
    • Example: add A, r0 – Adds the value from memory address A to register r0 and stores the result in r0.
  • sub: Subtracts the value of the source operand from the destination operand and stores the result in the destination.
    • Example: sub #2, r1 – Subtracts 2 from the value in register r1 and stores the result in r1.
  • lea: Loads the effective address of the source operand into the destination register.
    • Example: lea HELLO, r1 – Loads the address of HELLO into register r1.

✌️ Second Group: One-Operand Opcodes

These instructions require only one operand:

  • clr: Clears the content of the operand, setting it to zero.
    • Example: clr r1 – Clears the content of register r1.
  • not: Inverts all bits of the operand.
    • Example: not r1 – Inverts all bits in register r1.
  • inc: Increments the content of the operand by one.
    • Example: inc r1 – Increments the value in register r1 by 1.
  • dec: Decrements the content of the operand by one.
    • Example: dec C – Decrements the value of the label C by 1.
  • jmp: Unconditionally jumps to the address specified by the operand.
    • Example: jmp LINE – Jumps to the address labeled LINE.
  • bne: Branches to the specified address if the Z flag in the PSW is not set.
    • Example: bne LINE – Branches to LINE if the Z flag is clear.
  • red: Reads a character from standard input and stores it in the operand.
    • Example: red r1 – Reads a character from stdin and stores it in register r1.
  • prn: Prints the content of the operand to standard output.
    • Example: prn r1 – Prints the value in register r1 to stdout.
  • jsr: Jumps to a subroutine, saving the return address on the stack.
    • Example: jsr FUNC – Calls subroutine FUNC and saves the return address.

🀟 Third Group: No-Operand Opcodes

These instructions do not require any operands:

  • rts: Returns from a subroutine by popping the return address from the stack into the Program Counter (PC).
    • Example: rts – Returns from the current subroutine.
  • stop: Stops program execution.
    • Example: stop – Halts the execution of the program.

✍️ Assembly Language Syntax

Write your assembly code with these cool features:

πŸ—ƒοΈ Macros

Define macros with macr <macro_name>, include a series of instructions, and close with endmacr. This helps to simplify repetitive code.

Example:

macr myMacro
    mov r1, r2
    add r3, r4
endmacr

πŸ—’οΈ Directives

Use special commands like .entry, .extern, and more to manage your code's structure.

Example:

.extern start
.entry end
.string "abcd"
.data 1, -5, 0

πŸ—ΊοΈ Instructions

Use our supported opcodes to perform operations in your assembly code. Each opcode corresponds to a specific machine instruction.

Example:

mov r1, r2
add r3, r4
jmp start

❓ Variables (Labels)

Define and use variables in your code.

Example:

var: .data 3

Each instruction and operation is carefully designed to give you complete control over your assembly code, allowing you to write efficient and functional programs.

πŸ“Œ Addressing Methods (Modes)

Understanding the addressing methods used in our assembler is key to writing effective assembly code. Here’s a breakdown of the supported addressing methods:

Our assembler supports four addressing methods, labeled as 0, 1, 2, and 3.

  • (0) Immediate Addressing:
    • Format: #number
    • In this mode, the operand is a constant value. For example, mov #5, r1 loads the value 5 directly into register r1.
  • (1) Direct Addressing:
    • Format: label
    • This mode uses a direct reference to a memory location. For example, mov label, r2 moves the value stored at label into register r2.
  • (2) Indirect Addressing:
    • Format: *register
    • This method accesses memory indirectly through a register. For example, mov *r3, r4 moves the value pointed by r3 into r4.
  • (3) Indexed Addressing:
    • Format: register
    • This method accesses the value of a base register. For instance, mov r5, r6 loads the value from the memory location of r5 into r6.

Each addressing method allows for flexible data manipulation, enabling you to write efficient and powerful assembly code.

βš™οΈ Addressing Methods for Operations

Our assembler supports the following operations and the corresponding addressing methods:

Operation Source Operand Addressing Methods Destination Operand Addressing Methods
mov 0,1,2,3 1,2,3
cmp 0,1,2,3 0,1,2,3
add 0,1,2,3 1,2,3
sub 0,1,2,3 1,2,3
lea 1 1,2,3
clr - 1,2,3
not - 1,2,3
inc - 1,2,3
dec - 1,2,3
jmp - 1,2
bne - 1,2
red - 1,2,3
prn - 0,1,2,3
jsr - 1,2
rts - -
stop - -

πŸ‘¨β€πŸ’» First Word Encoding

In our assembler project, instruction encoding is done in the first word of the machine instruction. Here's a breakdown of how the encoding works:

βž• Opcodes (Bits 14-11)

  • The opcode is represented by bits 14-11 of the first word in the instruction. Each opcode corresponds symbolically to an assembly operation name, and these operation names are always written in lowercase.

🏁 Source Operand (Bits 10-7)

  • These bits encode the addressing method for the source operand. Each addressing method has a dedicated bit:
    • If the source operand is provided in this method, the corresponding bit is set to 1.
    • Otherwise, the bit is set to 0.
  • If the instruction does not have a source operand, all four bits are cleared to 0.

🚩 Destination Operand (Bits 6-3)

  • Similar to the source operand, these bits encode the addressing method for the destination operand:
    • A bit is set to 1 if the destination operand is provided in this method.
    • Otherwise, the bit is set to 0.
  • If the instruction does not have a destination operand, all four bits are cleared to 0.

✈️ A,R,E Field (Bits 2-0)

  • This field characterizes the role of the A, R, E bits in the machine code:
    • The A bit is always set to 1 in the first word of every instruction.
    • The R and E bits are set to 0.
  • This field is added to each word in the instruction's encoding.

Further information about the A,R,E field will be provided later. However, If you’d like to explore this information right away, you can refer to The A,R,E Field Encoding section for more details.

For a detailed breakdown of how the bits are allocated, see the table below:

14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Opcode Source operand Destination operand The field A,R,E
MSB Third bit Second bit LSB Method 3 Method 2 Method 1 Method 0 Method 3 Method 2 Method 1 Method 0 A R E

🧠 Addressing Methods Encoding

Addressing methods often require additional words in the machine code for each operand.

When an instruction contains two operands, the additional word for the source operand is encoded first, followed by the word for the destination operand. There is a special case where both operands are encoded using a single shared word.

⚑ Immediate Addressing

  • Operand Representation: The operand itself, which is a 12-bit two's complement integer, is contained in bits 14-3 of the word.
  • A,R,E Bits: In immediate addressing, the A bit is set to 1, and the other two bits (R, E) are set to 0.

🏹 Direct Addressing

  • Operand Representation: The operand is a memory address, with the word at this address in memory being the operand. The address is represented as a 12-bit unsigned number in bits 14-3 of the word.
  • A,R,E Bits:
    • If the address is internal (i.e., within the current source file), the R bit is set to 1, and the A and E bits are set to 0.
    • If the address is external (i.e., from another source file), the E bit is set to 1, and the A and R bits are set to 0.

πŸ“² Indirect Register Addressing

  • Operand Representation: Accesses memory through a pointer in a register. The content of the register is a memory address, and the word at this address is the operand. The address is represented as a 15-bit unsigned number in the register.
  • A,R,E Bits: In indirect register addressing, the A bit is set to 1, and the other two bits (R, E) are set to 0.
  • Register Coding:
    • If the operand is a destination, bits 5-3 of the word contain the register number used as a pointer.
    • If the operand is a source, bits 8-6 of the word contain the register number used as a pointer.
    • If there are two operands using indirect register addressing, both registers share the same word, with bits 5-3 containing the destination register and bits 8-6 containing the source register.

πŸšͺ Direct Register Addressing

  • Operand Representation: The operand is a direct register.
  • A,R,E Bits: In direct register addressing, the A bit is set to 1, and the other two bits (R, E) are set to 0.
  • Register Coding:
    • If the register is a destination, bits 5-3 of the word contain the register number.
    • If the register is a source, bits 8-6 of the word contain the register number.
    • If there are two operands using either direct register or indirect register addressing, both registers share the same word, with bits 5-3 containing the destination register and bits 8-6 containing the source register.

πŸ—‘οΈ Unused Bits

  • Any bits in the instruction word that are not used should be set to 0.

πŸ“š Types of Statements in Assembly Language

The maximum length of a line in the source file is 80 characters (excluding the newline character \n).

Assembly language typically includes four types of statements:

Statement Type Explanation
Empty Statement A line that contains only whitespace (spaces, tabs, etc.). Essentially, it does nothing and will be ignored during compilation.
Comment Statement A line beginning with a semicolon ; which is used to comment out code or add explanations in the assembly source. These are also ignored by the assembler.
Instruction Statement A line containing valid assembly code that the assembler will translate into machine language. These lines typically contain an opcode and operands.
Directive Statement A line that starts with a dot . followed by a directive keyword (e.g., .entry, .extern). These lines instruct the assembler on how to process the subsequent code.

🧰 Directive Statements

πŸ“Š ".data" Directive

  • The .data instruction allocates space in the data image to store the specified integer values.
  • Parameters: One or more legal integers separated by commas.

Example:

.data 7, -57, +17, 9

In this example, the assembler allocates four consecutive words in the data image for the numbers. If a label is defined, it will be associated with the address of the first value.

XYZ: .data 7, -57, +17, 9

Here, XYZ is a label associated with the address of the first value (7). This label can be referenced in the program.

πŸ“‘ ".string" Directive

  • The .string instruction allocates space in the data image to store a string.
  • Parameters: A single legal string enclosed in double quotes.

Example:

STR: .string "abcdef"

The string "abcdef" is stored in the data image with each character in a separate word, followed by a 0 to indicate the end of the string. The label STR refers to the address of the first character.

πŸ“₯ ".entry" Directive

  • The .entry instruction identifies a label that can be referenced from other assembly source files.
  • Parameters: A single label name defined in the current source file.

Example:

.entry HELLO

This instruction marks the label HELLO as available for external reference.

🌐 ".extern" Directive

  • The .extern instruction indicates that a label is defined in another source file.
  • Parameters: A single label name that is defined externally.

Example:

.extern HELLO

This indicates that the label HELLO is defined in another source file and will be linked accordingly.

πŸ›‘ Instruction Fields

🏷️ Labels

  • A label is a symbolic representation of an address in memory.
  • Syntax:
    • Maximum length: 31 characters.
    • Format: Ends with a colon :, which must be directly attached to the label name without spaces.
    • Usage: After a label, there must be an instruction or directive. Labels cannot stand alone.

Example:

Hello: .data 1, 2
X: mov r1, r2
He3: .string "abcd"

Labels are case-sensitive and must be unique within the same file.

Note: A label defined at the beginning of a line with .entry or .extern directives is meaningless and will be ignored by the assembler.

πŸ”’ Numbers

  • Legal numbers are decimal integers that can be positive or negative.

Example:

123, -57, +17

πŸ”  Strings

  • A legal string is a sequence of printable ASCII characters enclosed in double quotes.

Example:

"hello world"

πŸ“‹ Instruction Statement Formats

2️⃣ Two-Operand Instruction

  • Format: label: opcode source-operand, target-operand

Example:

HELLO: add r7, B

1️⃣ One-Operand Instruction

  • Format: label: opcode target-operand

Example:

HELLO: bne XYZ

0️⃣ No-Operand Instruction

  • Format: label: opcode

Example:

END: stop

πŸ” The A,R,E Field Encoding

In every machine code instruction (not data), the assembler inserts specific information into the A, R, E field to facilitate the linking and loading process. This field contains three bits: A, R, and E, which indicate how the word should be treated when the program is loaded into memory for execution. The assembler initially generates code as if it were to be loaded at a start address. The information in these bits allows the code to be relocated to any address in memory without requiring reassembly.

πŸ“‘ The A,R,E Bits

  • A (Absolute):
    • If the A bit is set to 1, it means the word's content is independent of the memory location where the program is loaded during execution (e.g., an immediate operand).
  • R (Relocatable):
    • If the R bit is set to 1, it indicates that the word's content depends on the actual memory location where the program will be loaded (e.g., an internal label address).
  • E (External):
    • If the E bit is set to 1, the word's content is dependent on the value of an external symbol (e.g., a label defined in another source file).

These bits are set according to the addressing modes used and the location of the symbols within the program, ensuring that the final machine code is adaptable for different memory layouts during execution.

For comprehensive details on the linking and loading stage, please see the Linking and Loading section in the appendix.

πŸ“ Macro Handling

When the assembler receives an assembly program, it first expands all macros before proceeding with the assembly process. If there are macros, the assembler generates an expanded program, which is then assembled into machine code. Here's how a sample program looks before and after macro expansion:

Before Macro Expansion:

MAIN: add r3, LIST
LOOP: prn #48
macr m_macr
cmp r3, #-6
bne END
endmacr
lea STR, r6
inc r6
mov *r6, K
sub rl, r4
m_macr
dec K
jmp LOOP
END: stop
STR: .string "abcd"
LIST: .data 6, -9
K: .data 31

After Macro Expansion:

MAIN: add r3, LIST
LOOP: prn #48
lea STR, r6
inc r6
mov *r6, K
sub rl, r4
cmp r3, #-6
bne END
dec K
jmp LOOP
END: stop
STR: .string "abcd"
LIST: .data 6, -9
K: .data 31

πŸ” Assembler with Two Passes

In the first pass of the assembler, the program reads the assembly code to determine the symbols (labels) appearing in the program, assigns an address to each symbol, and builds the symbol table. In the second pass, using the symbol table, the assembler generates the actual machine code.

Our assembler handles the following instructions during the second pass: mov, jmp, prn, sub, cmp, inc, bne, stop. The corresponding machine code is generated based on the opcode and the operands provided.

Additionally, the assembler should replace the symbols K, STR, LIST, MAIN, LOOP, END with the memory addresses where each corresponding data or instruction is located.

πŸ“ˆ First Pass

In the first pass, rules are required to determine the address to be assigned to each symbol. The basic principle is to count the memory locations occupied by the instructions. If each instruction is loaded into memory at the location following the previous instruction, such counting will indicate the address of the next instruction. This counting is performed by the assembler and is maintained in the instruction counter (IC). The initial value of IC is 100 (decimal), so the machine code of the first instruction is constructed to load into memory starting from address 100. The IC is updated with each instruction line that allocates space in memory. After the assembler determines the length of the instruction, the IC is increased by the number of cells (words) occupied by the instruction, and thus it points to the next available cell.

As mentioned, to encode instructions in machine language, the assembler maintains a table that contains a corresponding code for each operation name. During translation, the assembler replaces each operation name with its code, and each operand is replaced with its corresponding encoding. However, this replacement process is not so simple. The instructions use various addressing methods for operands. The same operation can have different meanings in each addressing method, and therefore different encodings are applied depending on the addressing methods. For example, the mov operation can refer to copying the content of a memory cell to a register or copying the content of one register to another, and so on. Each such possibility of mov might have a different encoding.

The assembler needs to scan the entire instruction line and decide on the encoding based on the operands. Typically, the encoding is divided into a field for the operation name and additional fields containing information about the addressing methods. All fields together require one or more words in the machine code.

When the assembler encounters a label at the beginning of the line, it recognizes it as a label definition and assigns it an address, which is the current content of the IC. Thus, all labels receive their addresses at the time of definition. These labels are entered into the symbol table, which, in addition to the label name, contains the address and additional attributes. When a label is referred to in the operand of any instruction, the assembler can retrieve the corresponding address from the symbol table.

An instruction can also refer to a symbol that has not yet been defined in the program but will be defined later. For example, consider a branch instruction to an address defined by the label A that appears later in the code:

bne A
...
A: ...

πŸ“‰ Second Pass

As seen in the first pass, the assembler cannot construct the machine code of operands using symbols that have not yet been defined. Only after the assembler has scanned the entire program, so that all symbols have already been entered into the symbol table, can the assembler complete the machine code of all operands.

To achieve this, the assembler performs a second pass over the source code. During this pass, it updates the machine code for operands by substituting the symbols with their corresponding values from the symbol table. By the end of this second pass, the entire program is fully translated into machine code.

πŸ–¨οΈ Input and Output Files of the Assembler

When running the assembler, command line arguments should include a list of source file names (one or more). These are text files containing programs written in the assembly language defined for this project.

The assembler processes each source file separately and generates the following output files:

  • .am file: Contains the source file after the pre-assembler stage (after macro expansion).
  • .ob file: Contains the machine code.
  • .ext file: Includes details of all locations (addresses) in the machine code where an external symbol (declared with the .extern directive) is used.
  • .ent file: Includes details of all symbols declared as entry points (declared with the .entry directive).

If the source file does not contain any .extern directives, the assembler will not create an .ext file. Similarly, if there are no .entry directives, an .ent file will not be generated.

Source file names must have the .as extension. For example, hello.as, x.as, and y.as are valid names. When passing these file names as arguments to the assembler, the extension is omitted.

Example:

./assembler x y hello

πŸ–ΌοΈ Object File Format

The assembler constructs a memory image where the encoding of the first instruction from the assembly file is placed at address 100 (in decimal) in memory. The encoding of the second instruction is placed at the address following the first instruction (depending on the number of words in the first instruction), and so on until the last instruction.

Immediately after the encoding of the last instruction, the assembler places the encoding of the data created by .data, .string, and other data directives into the memory image. The data will be placed in the order it appears in the source file. An operand of an instruction referring to a symbol defined in the same file will be encoded to point to the appropriate location in the memory image created by the assembler.

Note that variables appear in the memory image after the instructions. This is why it is necessary to update the symbol table, at the end of the first pass, with the values of symbols defining data (symbols of type .data).

An object file fundamentally contains the described memory image. An object file is composed of text lines as follows:

  • The first line of the object file is the "header," which contains two decimal numbers: the total length of the instruction section (in memory words) followed by the total length of the data section (in memory words). There is one space between the two numbers.

  • The following lines in the file contain the memory image. Each line contains two values: the address of a memory word and the content of that word. The address is written in decimal, padded to four digits (including leading zeros), and the content is written in octal, padded to five digits (including leading zeros). There is one space between the two values on each line.

🏠 Entries File Format

The entries file is composed of text lines. Each line contains the name of a symbol defined as an entry and its value, as found in the symbol table. The values are represented in decimal format.

The order of the labels in the file does not matter.

🌍 Externals File Format

The externals file is also composed of text lines. Each line contains the name of a symbol defined as external and an address in machine code where an operand referring to this symbol is encoded. It is possible that there are multiple addresses in the machine code referring to the same external symbol. Each such reference will have a separate line in the externals file. The addresses are represented in decimal format.

Like in the Entries file, the order of the labels in the file does not matter.

πŸ“œ Example Program

Here’s a quick demo of an assembly program in action:

Pre-assembler program:

; file example.as 
.entry LIST 
.extern fn1 
MAIN: add r3, LIST 
jsr fn1 
LOOP: prn #48 
 lea STR, r6 
 inc r6 
 mov *r6, L3 
 sub r1, r4 
 cmp r3, #-6 
 bne END 
 add r7, *r6 
 clr K 
 sub L3, L3 
.entry MAIN 
 jmp LOOP 
END: stop 
STR: .string "abcd" 
LIST: .data 6, -9 
 .data -100 
K: .data 31 
.extern L3

Below is the full binary encoding table obtained from the source file, followed by the output file formats.

Decimal Address Source Code Explanation Binary Machine Code
0100 MAIN: add r3, LIST First word of instruction 001010000010100
0101 Source register 3 000000011000100
0102 Address of label LIST 000010001001010
0103 jsr fn1 110100000010100
0104 Address of label fn1 (external) 000000000000001
0105 LOOP: prn #48 110000000001100
0106 Immediate value 48 000000110000100
0107 lea STR, r6 010000101000100
0108 Address of label STR 000010000100010
0109 Target register 6 000000000110100
0110 inc r6 011100001000100
0111 Target register 6 000000000110100
0112 mov *r6, L3 000001000010100
0113 Source register 6 000000110000100
0114 Address of label L3 (external) 000000000000001
0115 sub r1, r4 001110001000100
0116 Source register 1 and target register 4 000000001100100
0117 cmp r3, #-6 000110000001100
0118 Source register 3 000000011000100
0119 Immediate value -6 111111111010100
0120 bne END 101000000010100
0121 Address of label END 000010000011010
0122 add r7, *r6 001010000100100
0123 Source register r0 and target register 6 000000111110100
0124 clr K 010100000010100
0125 Address of label K 000010001100010
0126 sub L3, L3 001100100010100
0127 Address of label L3 (external) 000000000000001
0128 Address of label L3 (external) 000000000000001
0129 jmp LOOP 100100000010100
0130 Address of label LOOP 000001101001010
0131 END: stop 111100000000100
0132 STR: .string "abcd" Ascii code 'a' 000000001100001
0133 Ascii code 'b' 000000001100010
0134 Ascii code 'c' 000000001100011
0135 Ascii code 'd' 000000001100100
0136 Ascii code '\0' (end of string) 000000000000000
0137 LIST: .data 6, -9 Integer 6 000000000000110
0138 Integer -9 111111111110111
0139 .data -100 Integer -100 111111110011100
0140 K: .data 31 Integer 31 000000000011111

Object file:

  32 9
0100 12024
0101 00304
0102 02112
0103 64024
0104 00001
0105 60014
0106 00604
0107 20504
0108 02042
0109 00064
0110 34104
0111 00064
0112 01024
0113 00604
0114 00001
0115 16104
0116 00144
0117 06014
0118 00304
0119 77724
0120 50024
0121 02032
0122 12044
0123 00764
0124 24024
0125 02142
0126 14424
0127 00001
0128 00001
0129 44024
0130 01512
0131 74004
0132 00141
0133 00142
0134 00143
0135 00144
0136 00000
0137 00006
0138 77767
0139 77634
0140 00037

Entries file:

LIST 0137
MAIN 0100

Externals file:

fn1 0104
L3 0114
L3 0127
L3 0128

πŸ› οΈ Installation

Ready to build? Follow the appropriate steps according to your operating system:

🐧 Linux

1. Open your terminal.

2. Clone this repository with:

git clone https://github.com/talfig/Assembler.git

3. Ensure you have a C compiler installed.

Run the following command to check if GCC is installed:

gcc --version

If you prefer Clang, you can check its installation with:

clang --version

☁️ GitHub Codespaces

1. Go to your GitHub repository.

2. Click on the Code button and select Create codespace on main.

3. Once the Codespace is ready, open the terminal within the Codespace.

4. (Optional) Customize the Codespace environment for Ubuntu by adding a .devcontainer/devcontainer.json file with the following content:

{
  "name": "Ubuntu Development Environment",
  "image": "mcr.microsoft.com/vscode/devcontainers/base:ubuntu",
  "features": {
    "docker-from-docker": "latest"
  },
  "extensions": [
    "ms-vscode.cpptools",
    "golang.go"
  ]
}

This configuration sets up an Ubuntu-based environment with the necessary tools installed.


After setting up your environment, Navigate to the project directory where the Makefile is located using:

cd Assembler

Replace Assembler with the actual name of the directory if it differs.

Move the Makefile from the Build directory to the Assembler directory:

mv Build/Makefile .

Finally, Run the make command to compile the project and build the assembler:

make

The make command will read the Makefile and execute the specified build instructions to compile the code and link the object files into an executable.

After the build process is complete, you should see the assembler executable in the project directory.

🎯 Usage

To run the assembler on a certain file:

./assembler <source_file>

Replace <source_file> with the path to your assembly code file. To process multiple files, list each file separated by spaces:

./assembler <source_file1> <source_file2> ...

⚠️ Error Handling

Bumped into issues? No worries! Our assembler offers descriptive error messages such as:

  • Memory allocation failure: Indicates a problem with allocating the necessary memory resources.
  • Unrecognized commands: Shows when the assembler encounters commands it doesn’t understand.
  • Syntax errors: Flags errors in the syntax of your assembly code.
  • File errors: It usually means the file could not be found or accessed. Check the file path and ensure you have the necessary permissions.

πŸ“ Directory Structure

The project is organized as follows:

  • πŸ“ Build

    Contains build scripts and configuration files.

    • Makefile
    • CMakeLists.txt
  • πŸ“ HeaderFiles

    Contains header files for the project.

    • first_pass.h
    • second_pass.h
    • ...
  • πŸ“ InvalidInputs

    Contains input files that are expected to cause errors.

    • broken_program1.as
    • broken_program2.as
    • ...
  • πŸ“ InvalidOutputs

    Contains output files corresponding to invalid inputs.

    • πŸ“ broken_program1
      • broken_program1.txt
    • πŸ“ broken_program2
      • broken_program2.txt
    • ...
  • πŸ“ SourceFiles

    Contains the source code files.

    • assembler.c
    • first_pass.c
    • ...
  • πŸ“ ValidInputs

    Contains input files that should be processed correctly.

    • fibonacci.as
    • text_encryption.as
    • ...
  • πŸ“ ValidOutputs

    Contains output files corresponding to valid inputs.

    • πŸ“ fibonacci
      • fibonacci.am
      • fibonacci.ob
      • fibonacci.ent
      • fibonacci.ext
    • πŸ“ text_encryption
      • text_encryption.am
      • text_encryption.ob
      • text_encryption.ent
      • text_encryption.ext
    • ...

πŸ“˜ Appendix

🚦 Program Status Word (PSW)

The Program Status Word (PSW) is a special register in the CPU that contains flags and control bits reflecting the state of the processor. These flags are typically affected by the execution of arithmetic and logic instructions. The PSW is used to determine the outcome of conditional operations and to control the flow of the program.

Common Flags in the PSW:

  • Z (Zero) Flag:

    • Set to 1 if the result of an operation is zero.
    • Cleared (0) if the result is non-zero.
  • N (Negative) Flag:

    • Set if the result of an operation is negative.
  • C (Carry) Flag:

    • Set if an arithmetic operation generates a carry out of or a borrow into the high-order bit.
  • V (Overflow) Flag:

    • Set if an arithmetic operation results in an overflow, meaning the result is too large to be represented in the destination operand.

The flags in the PSW are often used by conditional branch instructions to make decisions based on the result of a previous operation. For example:

  • bne (branch if not equal): This instruction checks the Z flag; if the Z flag is 0 (indicating that the previous operation did not result in zero), the program will branch to the address specified by the operand.

🧲 Linking and Loading

When working with assembly language, the process of transforming your code into a fully functioning executable program involves more than just the work of the assembler. Two additional critical steps, linking and loading, ensure that your program is correctly integrated with other modules and loaded into memory for execution.

πŸ”— Linking

Linking is the process of combining multiple object files generated by the assembler into a single executable file. Here's how it works:

  • Object Files: When you assemble your source code, the assembler generates object files. These files contain the machine code for each module of your program, along with necessary metadata such as symbol tables and relocation information.

  • Symbol Resolution: During linking, the linker resolves references between object files. For instance, if one module references a function or variable defined in another, the linker connects these references by identifying their memory addresses.

  • Relocation: The linker adjusts memory addresses within object files to create a single continuous memory space. This step ensures that all modules work together seamlessly, with correct address references.

  • Executable File Creation: After resolving symbols and performing relocation, the linker combines all object files into a single executable file. This file can then be loaded into memory for execution.

⏳ Loading

Loading is the final step before your program runs on the CPU. The loader is responsible for placing the executable file into the computer's memory. Here's what happens:

  • Memory Allocation: The loader allocates memory space for the program based on the executable file's specifications. This includes allocating space for code, data, and stack segments.

  • Loading into Memory: The loader copies the executable's code and data into the allocated memory space. It also initializes registers and the program counter (PC) to point to the start of the program.

  • Dynamic Linking (Optional): In some systems, dynamic linking occurs at this stage. This involves linking to shared libraries that are not included in the executable file but are available in the system. The loader resolves references to these external libraries and loads them into memory as needed.

  • Program Execution: Once loading is complete, the CPU begins executing the program by reading instructions from memory, starting at the program counter's address.

Β© License

This project is licensed under the GNU Affero General Public License v3.0 (AGPL) - see the LICENSE file for details.

πŸ“¬ Contact

Gratitude Image

About

A custom assembler with macro support and a two-pass assembly process, implementing efficient code translation, error handling, and file management in C.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages