Initial Commit

2025-10-14 12:37:06 +00:00 · 2024-09-07 18:00:09 +06:00
commit 0f9a53f75a
3352 changed files with 1563708 additions and 0 deletions
--- a/thirdparty/capstone/docs/ARCHITECTURE.md
+++ b/thirdparty/capstone/docs/ARCHITECTURE.md
@@ -0,0 +1,118 @@
+# Capstone Architecture overview
+
+## Architecture of Capstone
+
+TODO
+
+## Architecture of a Module
+
+An architecture module is split into two components.
+
+1. The disassembler logic, which decodes bytes to instructions.
+2. The mapping logic, which maps the result from component 1 to
+a Capstone internal representation and adds additional detail.
+
+### Component 1 - Disassembler logic
+
+The disassembler logic consists exclusively of code from LLVM.
+It uses:
+
+- Generated state machines, enums and the like for instruction decoding.
+- Handwritten disassembler logic for decoding instruction operands
+and controlling the decoding procedure.
+
+### Component 2 - Mapping logic
+
+The mapping component has three different task:
+
+1. Serving as programmable interface for the Capstone core to the LLVM code.
+2. Mapping LLVM decoded instructions to a Capstone instruction.
+3. Adding additional detail to the Capstone instructions
+(e.g. operand `read/write` attributes etc.).
+
+### Instruction representation
+
+There exist two structs which represent an instruction:
+
+- `MCInst`: The LLVM representation of an instruction.
+- `cs_insn`: The Capstone representation of an instruction.
+
+The `MCInst` is used by the disassembler component for storing the decoded instruction.
+The mapping component on the other hand, uses the `MCInst` to populate the `cs_insn`.
+
+The `cs_insn` is meant to be used by the Capstone core.
+It is distinct from the `MCInst`. It uses different instruction identifiers, other operand representation
+and holds more details about an instruction.
+
+### Disassembling process
+
+There are two steps in disassembling an instruction.
+
+1. Decoding bytes to a `MCInst`.
+2. Decoding the assembler string for the `MCInst` AND mapping it to a `cs_insn` in the same step.
+
+Here is a boiled down explanation about these steps.
+
+**Step 1**
+
+```
+                                                                 ARCH_LLVM_getInstr(
+                                     ARCH_getInstr(bytes)   ┌───┐   bytes)           ┌─────────┐            ┌──────────┐
+                                    ┌──────────────────────►│ A ├──────────────────► │         ├───────────►│          ├────┐
+                                    │                       │ R │                    │ LLVM    │            │ LLVM     │    │ Decode
+                                    │                       │ C │                    │         │            │          │    │ Instr.
+                                    │                       │ H │                    │         │decode(Op0) │          │◄───┘
+┌────────┐ disasm(bytes) ┌──────────┴──┐                    │   │                    │ Disass- │ ◄──────────┤ Decoder  │
+│CS Core ├──────────────►│ ARCH Module │                    │   │                    │ embler  ├──────────► │ State    │
+└────────┘               └─────────────┘                    │ M │                    │         │            │ Machine  │
+                                    ▲                       │ A │                    │         │decode(Op1) │          │
+                                    │                       │ P │                    │         │ ◄──────────┤          │
+                                    │                       │ P │                    │         ├──────────► │          │
+                                    │                       │ I │                    │         │            │          │
+                                    │                       │ N │                    │         │            │          │
+                                    └───────────────────────┤ G │◄───────────────────┤         │◄───────────┤          │
+                                                            └───┘                    └─────────┘            └──────────┘
+```
+
+In the first decoding step the instruction bytes get forwarded to the
+decoder state machine.
+After the instruction was identified, the state machine calls decoder functions
+for each operand to extract the operand values from the bytes.
+
+The disassembler and the state machine are equivalent to what `llvm-objdump` uses
+(in fact they use the same files, except we translated them from C++ to C).
+
+**Step 2**
+
+```
+                                      ARCH_printInst(        ARCH_LLVM_printInst(
+                                         MCInst,                MCInst,
+                                         asm_buf)      ┌───┐    asm_buf)         ┌────────┐            ┌──────────┐
+                                      ┌───────────────►│ A ├───────────────────► │        ├───────────►│          ├──────┐
+                                      │                │ R │                     │ LLVM   │            │ LLVM     │      │ Decode
+                                      │                │ C │                     │        │            │          │      │ Mnemonic
+                                      │                │ H │ add_cs_detail(Op0)  │        │ print(Op0) │          │◄─────┘
+                                      │                │   │ ◄───────────────────┤        │ ◄──────────┤          │
+           printer(MCInst,            │                │   ├───────────────────► │        ├──────────► │ Asm-     │
+┌────────┐         asm_buf)┌──────────┴──┐             │   │                     │ Inst   │            │ Writer   │
+│CS Core ├────────────────►│ ARCH Module │             │   │                     │ Printer│            │ State    │
+└────────┘                 └─────────────┘             │ M │                     │        │            │ Machine  │
+                                      ▲                │ A │ add_cs_detail(Op1)  │        │ print(Op1) │          │
+                                      │                │ P │ ◄───────────────────┤        │ ◄──────────┤          │
+                                      │                │ P ├───────────────────► │        ├──────────► │          │
+                                      │                │ I │                     │        │            │          │
+                                      │                │ N │                     │        │            │          │
+                                      └────────────────┤ G │◄────────────────────┤        │◄───────────┤          │
+                                                       └───┘                     └────────┘            └──────────┘
+```
+
+The second decoding step passes the `MCInst` and a buffer to the printer.
+
+After determining the mnemonic, each operand is printed by using
+functions defined in the `InstPrinter`.
+
+Each time an operand is printed, the mapping component is called
+to populate the `cs_insn` with the operand information and details.
+
+Again the `InstPrinter` and `AsmWriter` are translated code from LLVM,
+so they mirror the behavior of `llvm-objdump`.