neuviemeporte/mzretools
GitHub: neuviemeporte/mzretools
针对 MS-DOS MZ 可执行文件的逆向工程工具集,支持例程映射、指令级代码比对和重构验证。
Stars: 58 | Forks: 6
# mzretools
这是一套用于辅助逆向工程 MS-DOS 游戏的实用工具集。它们可以分析包含 8086 opcodes 的 MZ 可执行文件,并提供对逆向工程有用的信息。目前不支持 .com 文件和较新的 Intel CPU。
通常,这些工具是为了配合 IDA 使用而编写的,但也可能适用于其他工作流程。其理念是支持创建指令级完全一致的重构,同时允许内部可执行文件布局存在差异(例程/数据偏移量不需要匹配)。以下是所有工具如何协同使用的示例:

关于我如何在自己的项目中使用这些工具的详细博文可以在[这里](https://neuviemeporte.github.io/f15-se2/2024/05/25/sausage.html)找到。
该项目包含一个使用 GTest 编写的单元测试套件,默认情况下在每次构建时运行。构建系统是 CMake。
欢迎提供补丁和改进建议。希望这能对大家有所帮助。
# 构建
克隆仓库,然后执行:
```
# 下载用于单元测试的 google test framework
$ git submodule init
$ git submodule update
$ mkdir build && cd build
$ cmake ..
# 这将构建工具可执行文件并运行测试套件
$ make
```
构建后,也可以通过执行 `runtest` 二进制文件来独立运行测试套件。它还支持一些未记录的标志用于调试输出,请参阅 `test_main.cpp`。
# 工具
这仍处于开发阶段,我正在根据自己的逆向工程需求添加功能。目前,有用的工具阵列包括以下内容:
## mzhdr
显示 .exe 文件 MZ 头中的信息:
```
ninja@dell:debug$ ./mzhdr bin/hello.exe
--- bin/hello.exe MZ header (28 bytes)
[0x0] signature = 0x5a4d ('MZ')
[0x2] last_page_size = 0x43 (67 bytes)
[0x4] pages_in_file = 15 (7680 bytes)
[0x6] num_relocs = 4
[0x8] header_paragraphs = 32 (512 bytes)
[0xa] min_extra_paragraphs = 227 (3632 bytes)
[0xc] max_extra_paragraphs = 65535
[0xe] ss:sp = 207:800
[0x12] checksum = 0x156e
[0x16] cs:ip = 0:20
[0x18] reloc_table_offset = 0x1e
[0x1a] overlay_number = 0
--- extra data (overlay info?)
0x01, 0x00
--- relocations:
[0]: 173:1a, linear: 0x174a, file offset: 0x194a, file value = 0x16f
[1]: 0:2b, linear: 0x2b, file offset: 0x22b, file value = 0x16f
[2]: 0:b5, linear: 0xb5, file offset: 0x2b5, file value = 0x16f
[3]: 173:ae, linear: 0x17de, file offset: 0x19de, file value = 0x16f
--- load module @ 0x200, size = 0x1a43 / 6723 bytes
```
还有一些额外的开关使其在脚本中非常有用,例如仅打印加载模块的偏移量/大小(`-l`, `-s`)。使用 `-p`,它可以像在提供的段中加载一样修补可执行文件的重定位,并将结果写入文件。这对于与运行时获取的内存转储进行比较以检测损坏和/或其他问题很有用,另请参阅 `exeimgdump.sh`。
## mzmap
扫描并解释可执行文件中的指令,跟踪跳转/调用目标和返回指令,以试图确定子例程的边界和潜在变量的偏移量。它可以进行有限的寄存器/堆栈值跟踪,以找出依赖于寄存器的调用和跳转。可达块要么归因于子例程的主体,要么被标记为断开的块。映射以文本格式保存到文件中。
```
mzmap v1.0.0
usage: mzmap [options] [file.exe[:entrypoint]] file.map
Scans a DOS MZ executable trying to find routines and variables, saves output into an editable map file
Without an exe file, prints a summary of an existing map file
There is limited support for using an IDA .lst file as a map file,
and printing a .lst file will also convert and save it as a .map file
Options:
--verbose: show more detailed information, including compared instructions
--debug: show additional debug information
--overwrite: overwrite output file.map if already exists
--brief: only show uncompleted and unclaimed areas in map summary
--format: format printed routines in a way that's directly writable back to the map file
--nocpu: omit CPU-related information like instruction decoding from debug output
--noanal: omit analysis-related information from debug output
--linkmap file use a linker map from Microsoft C to seed initial location of routines
--load segment: override default load segment (0x0)
ninja@dell:debug$ ./mzmap bin/hello.exe hello.map --verbose
Loading executable bin/hello.exe at segment 0x1000
Analyzing code within extents: 1000:0000-11a4:0003/001a44
Done analyzing code
Dumping visited map of size 0x1a43 starting at 0x10000 to routines.visited
Dumping visited map of size 0x1a43 starting at 0x10000 to search_queue.dump
Building routine map from search queue contents (39 routines)
--- Routine map containing 39 routines
STACK/0x1207
CODE/0x1000
DATA/0x116f
1000:0000-1000:000f/000010: unclaimed_1
1000:0010-1000:001f/000010: routine_7
1000:0020-1000:00d1/0000b2: start
1000:00d2-1000:0195/0000c4: routine_4
1000:00d2-1000:0195/0000c4: main
1000:0262-1000:0267/000006: chunk
1000:0196-1000:01ac/000017: routine_8
1000:01ad-1000:01f1/000045: routine_9
1000:01f2-1000:021e/00002d: routine_13
1000:021f-1000:022d/00000f: routine_10
[...]
1000:1574-1000:15e1/00006e: routine_34
1000:15e2-1000:1637/000056: routine_35
1000:1638-1000:165d/000026: unclaimed_13
1000:165e-1000:1680/000023: routine_19
1000:1681-1000:1a42/0003c2: unclaimed_14
Saving routine map (size = 39) to hello.map
```
## mzdiff
将两个可执行文件作为输入,并逐条比较它们的指令以验证它们是否匹配,这在尝试使用高级编程语言重现游戏的源代码时非常有用。编译重构版本后,该工具可以立即检查生成的代码是否与原始代码匹配。它会考虑数据布局差异,因此如果一个可执行文件在一个内存偏移量处访问值,而另一个在不同偏移量处访问它,则两者之间的映射会被保存,只要其使用是一致的,就不会被计入不匹配。它可以选择将 mzmap 生成的映射作为输入,从而能够为比较的子例程分配有意义的名称,以及排除某些子例程的比较,如标准库函数、汇编子例程或因其他原因不符合比较条件的例程。
该工具还能够使用 `--data segname` 比较数据段内容。你需要两个可执行文件的映射文件,尽管目标文件可以是简化的;它只需要具有指定名称的数据段的地址(从 mzdiff 代码比较运行中获得的 `.map.tgt` 文件应该可以满足此目的)。该工具将逐字节扫描并比较两个文件中指定数据段位置的内容,如果发现不匹配则停止并报错,在这种情况下,它还将尝试确定不匹配发生在哪个变量中(如果引用映射中存在变量信息),并且还会显示不匹配位置周围两个可执行文件之间的十六进制差异。这不是什么高深的技术,但它省去了将数据段提取为二进制文件、使用 `xxd` 获取十六进制转储、将它们加载到 WinMerge 中进行视觉检查,然后确定已识别的十六进制转储偏移量对应于可执行文件中的哪个位置的麻烦。
```
mzdiff v1.0.8
usage: mzdiff [options] reference.exe[:entrypoint] target.exe[:entrypoint]
Compares two DOS MZ executables instruction by instruction, accounting for differences in code layout
Options:
--map ref.map map file of reference executable (recommended, otherwise functionality limited)
--tmap tgt.map map file of target executable (only used for output and in data comparison)
--verbose show more detailed information, including compared instructions
--debug show additional debug information
--dbgcpu include CPU-related debug information like instruction decoding
--idiff ignore differences completely
--nocall do not follow calls, useful for comparing single functions
--asm descend into routines marked as assembly in the map, normally skipped
--nostat do not display comparison statistics at the end
--rskip count ignore up to 'count' consecutive mismatching instructions in the reference executable
--tskip count ignore up to 'count' consecutive mismatching instructions in the target executable
--ctx count display up to 'count' context instructions after a mismatch (default 10)
--loose non-strict matching, allows e.g for literal argument differences
--variant treat instruction variants that do the same thing as matching
--data segname compare data segment contents instead of code
The optional entrypoint spec tells the tool at which offset to start comparing, and can be different
for both executables if their layout does not match. It can be any of the following:
':0x123' for a hex offset
':0x123-0x567' for a hex range
':[ab12??ea]' for a hexa string to search for and use its offset. The string must consist of an even amount
of hexa characters, and '??' will match any byte value.
In case of a range being specified as the spec, the search will stop with a success on reaching the latter offset.
If the spec is not present, the tool will use the CS:IP address from the MZ header as the entrypoint.
ninja@dell:debug$ ./mzdiff original.exe:10 my.exe:10 --verbose --loose --sdiff 1 --map original.map
Comparing code between reference (entrypoint 1000:0010/010010) and other (entrypoint 1000:0010/010010) executables
Reference location @1000:0010/010010, routine 1000:0010-1000:0482/000473: main, block 010010-010482/000473, other @1000:0010/010010
1000:0010/010010: push bp == 1000:0010/010010: push bp
1000:0011/010011: mov bp, sp == 1000:0011/010011: mov bp, sp
1000:0013/010013: sub sp, 0x1c =~ 1000:0013/010013: sub sp, 0xe 🟡 The 'loose' option ignores the stack layout difference
1000:0016/010016: push si == 1000:0016/010016: push si
1000:0017/010017: mov word [0x628c], 0x0 ~= 1000:0017/010017: mov word [0x418], 0x0 🟡 same value placed at different offsets in the two .exe-s
1000:001d/01001d: mov word [0x628a], 0x4f2 ~= 1000:001d/01001d: mov word [0x416], 0x4f2
1000:0023/010023: mov word [0x45fc], 0x0 ~= 1000:0023/010023: mov word [0x410], 0x0
[...]
1000:0111/010111: or ax, ax == 1000:010e/01010e: or ax, ax
1000:0113/010113: jnz 0x11c (down) ~= 1000:0110/010110: jnz 0x102 (up)
1000:0115/010115: call far 0x16b50c7f ~= 1000:0112/010112: call far 0x10650213
1000:011a/01011a: jmp 0x11e (down) != 1000:0117/010117: cmp byte [0x40c], 0x78 🔴 the 'skip difference = 1' option allows one mismatching instruction
1000:011c/01011c: jmp 0x105 (up) != 1000:0117/010117: cmp byte [0x40c], 0x78 🔴 but the next one still doesn't match
ERROR: Instruction mismatch in routine main at 1000:011c/01011c: jmp 0x105 != 1000:0117/010117: cmp byte [0x40c], 0x78
```
输出显示 `==` 表示完全匹配,`~=` 和 `=~` 表示第一个或第二个操作数存在“软”差异,`!=` 表示不匹配。其理念是,只要工具发现差异,就对重构过程进行迭代,直到重构的代码与原始代码完美匹配,并留有因不同布局导致偏移值差异的余地。
## mzsig
从可执行文件中提取例程签名并将其保存到文件中。签名是表示例程指令的十六进制值字符串,但去掉了地址和立即值 —— `mov [bx+0x1234], 0xabcd` 大致变为 `mov|mem|immediate`。其理念是去除特定于某个可执行文件的任何信息,但保留例程足够的“特征”,以便有望在不同的可执行文件中识别它。
```
mzsig v1.0.0
Usage:
mzsig [options] exe_file map_file output_file
Extracts routines from exe_file at locations specified by map_file and saves their signatures to output_file
This is useful for finding routine duplicates from exe_file in other executables using mzdup
mzsig [options] signature_file
Displays the contents of the signature file
Options:
--verbose show information about extracted routines
--debug show additional debug information
--overwrite overwrite output file if exists
--min count ignore routines smaller than 'count' instructions (default: 10)
--max count ignore routines larger than 'count' instructions (0: no limit, default: 0)
You can prevent specific routines from having their signatures extracted by annotating them with 'ignore' in the map file,
see the map file format documentation for more information.
ninja@RYZEN:f15se2-re$ mzsig ../ida/egame.exe map/egame.map egame.sig
Loaded map file map/egame.map: 8 segments, 400 routines, 1023 variables
Extracted signatures from 294 routines, saving to egame.sig
```
当传递单个参数时,该工具可用于打印已创建的签名文件中包含的抽象指令:
```
ninja@RYZEN:f15se2-re$ mzsig egame.sig
Loaded signatures from egame.sig, 294 routines
routine_7: 100 instructions
push bp
mov bp, sp
sub sp, i16
mov [bp+off16], i16
mov [bp+off16], i16
les bx, [bp+off16]
mov ax, es:[bx]
mov [off16], ax
[...]
```
## mzdup
此实用程序可用于在一个可执行文件中查找来自另一个可执行文件的已知例程的潜在重复项。它对从 `mzsig` 获得的模糊指令字符串使用编辑距离,并通过阈值定位潜在重复项,从而节省逆向已知例程的时间。考虑进行重复搜索的例程必须具有的最小指令数,以及例程仍被视为重复项的最大距离,都可以通过命令行选项进行配置。
```
mzdup v1.0.0
Usage: mzdup [options] signature_file target_exe target_map
Attempts to identify duplicate routines matching the signatures from the input file in the target executable
Updated target_map with duplicates marked will be saved to target_map.dup
Options:
--verbose: show more detailed information about processed routines
--debug: show additional debug information
--minsize count: don't search for duplicates of routines smaller than 'count' instructions (default: 15)
--maxdist ratio: how many instructions relative to its size can a routine differ by to still be reported as a duplicate (default: 10%)
ninja@RYZEN:f15se2-re$ mzdup egame.sig ../ida/start.exe map/start.map
Searching for duplicates of 294 signatures among 255 candidates, minimum instructions: 15, maximum distance ratio: 10%
Processed 294 signatures, ignored 51 as too short
Tried to find matches for 255 target exe routines (7500 instructions, 100%)
Found 37 (unique: 37) matching routines (1415 instructions, 18%)
Unable to find 218 matching routines (82%)
Saving code map (routines = 255) to map/start.map.dup, reversing relocation by 0x1000
```
在上面的用例中,该工具确定(使用例程长度 10% 的编辑距离容差阈值)`start.exe` 的已知例程中约有 18% 与 `egame.exe` 中的某些例程匹配。可以检查带有 `.dup` 后缀的更新映射文件,以查看哪些确切的例程被归类为潜在重复项,并且使用 `--verbose` 运行该工具也将显示计算出的编辑距离。可以降低最小考虑例程大小(--minsize),或增加容差(--maxdist)以使搜索更加激进,但有获得更多误报的风险。
## mzptr
该工具接受可执行二进制文件及其映射,然后尝试在可执行文件的二进制映像中查找指向已知变量的潜在指针。这些很重要,因为数值通常会被忽略,仅被视为数值而不是指向其他位置的指针。这通过暴力破解数据引用节省了一些精力,但仍然需要一些手动工作来检查获得的结果并确定找到的引用是否是真正的指针。
```
ninja@RYZEN:f15se2-re$ mzptr
mzptr v0.9.6
Usage: mzptr [options] exe_file exe_map
Searches the executable for locations where offsets to known data objects could potentially be stored.
Results will be written to standard output; these should be reviewed and changed to references to variables during executable reconstruction.
Options:
--verbose: show more detailed information about processed routines
--debug: show additional debug information
ninja@RYZEN:f15se2-re$ mzptr ../ida/start.exe map/start.map
Search complete, found 528 potential references, unique: 132
Printing reference counts per variable, counts higher than 1 or 2 are probably false positives due to a low/non-characteristic offset
word_16BE2/16b5:0092/016be2: 1 reference @ Data1:0xa8
page1Num/16b5:0530/017080: 1 reference @ Data1:0x546
page2Num/16b5:0548/017098: 1 reference @ Data1:0x55e
unk_170B0/16b5:0560/0170b0: 1 reference @ Data1:0x576
aLibya/16b5:00c2/016c12: 1 reference @ Data1:0x578
aVietnam/16b5:00d5/016c25: 1 reference @ Data1:0x57c
[...]
aRookie/16b5:016e/016cbe: 1 reference @ Data1:0x58c
aPilot/16b5:0175/016cc5: 1 reference @ Data1:0x58e
aVeteran/16b5:017b/016ccb: 1 reference @ Data1:0x590
aAce/16b5:0183/016cd3: 1 reference @ Data1:0x592
aDemo/16b5:0187/016cd7: 1 reference @ Data1:0x594
[...]
aMsRunTimeLibra/16b5:0008/016b58: 20 references
unk_16B56/16b5:0006/016b56: 34 references
aOnc_2/16b5:0700/017250: 39 references
unk_16B57/16b5:0007/016b57: 45 references
byte_16B54/16b5:0004/016b54: 87 references
crt0_16B52/16b5:0002/016b52: 118 references
```
结果按找到的引用计数排序;低引用计数的项目比高引用计数的项目更合理,例如 `aMsRunTimeLibra` 字符串,其偏移量仅在 20 个位置被“找到”,仅仅是因为它是 `0x8` 这样一个小的公共值。
具有相同引用计数的项目进一步按匹配发生的偏移量排序,这有助于查看形成指针数组的相邻位置,例如位于 `0x58c` 的难度级别字符串数组。
## lst2ch.py
此 Python 脚本将解析 IDA 生成的列表 `.LST` 文件并生成带有例程和数据声明的 C 头文件,以便将它们插入 C 源代码重构中。当 IDA 中的例程名称或例程参数更改时,它节省了更新头文件的手动工作。它还可以输出带有数据定义的 C 源文件,但这目前更多是一个原型。它将在使用两种独立方法迭代列表时验证数据段的运行大小。它与后续工具 `lst2asm.py` 共享一个 JSON 配置文件,以指定列表的布局和需要对其执行的转换。以下是我在重构工作中使用的示例配置文件:
```
{
// stuff to put at the beginning of the assembly file generated by lst2asm
"preamble": [
".8086",
".MODEL SMALL",
"DGROUP GROUP startData,startBss",
"ASSUME DS:DGROUP",
"__acrtused = 9876h",
"PUBLIC __acrtused"
],
"include": "lst/start.inc",
"in_segments": [ "startCode1", "startCode2", "startData", "startStack" ],
"out_segments": [
{ "seg": "startCode1", "class": "CODE" },
{ "seg": "startCode2", "class": "CODE" },
{ "seg": "startData", "class": "DATA" },
{ "seg": "startBss", "class": "BSS" },
{ "seg": "startStack", "class": "STACK" }
],
"code_segments": [ "startCode1", "startCode2" ],
"data_segments": [ "startData", "startBss" ],
// the sum of the size of all the items in the data segment, used to verify
// that the assembly file contains all the data
"data_size": "0x7b10",
// locations and contents of input lines that should be replaced in the assembly output
"replace": [
// can modify segment definitions on the fly
{ "seg": "startCode1", "off": "0x0", "from": "startCode1 segment",
"to": "startCode1 segment word public 'CODE'" },
{ "seg": "startData", "off": "0x0", "from": "startData segment",
"to": "startData segment word public 'DATA'" },
// modify instructions into literal bytes to enforce specific encoding
{ "seg": "startCode1", "off": "0x2ff", "to": "db 05h, 48h, 0" },
{ "seg": "startCode1", "off": "0x376a", "to": "db 3Dh, 10h, 0" },
// modify segment termination to force closing of a different segment
{ "seg": "startData", "off": "0x7afc", "from": "ends", "to": "startBss ends" }
],
// locations and contents of lines that should be inserted into the assembly output
"insert": [
// modify location in the data segment to create a BSS segment there
{ "seg": "startData", "off": "0x4585", "from": "db", "to": [
"startData ends",
"startBss segment byte public \"BSS\""
]
}
],
// blocks that should be eliminated from the input, usually because they have been
// already ported to C code
"extract": [
// a bunch of initial zeros at the beginning of the code segment caused by .DOSSEG
{ "seg": "startCode1", "begin": "0x10", "end": "0x482", "from": "main",
"to": "endp" },
{ "seg": "startCode1", "begin": "0x4a0", "end": "0x510", "from": "initGraphics",
"to": "endp" },
{ "seg": "startCode1", "begin": "0x511", "end": "0x543", "from": "cleanup",
"to": "endp" },
{ "seg": "startCode1", "begin": "0x54a", "end": "0x560", "from": "clearKeybuf",
"to": "endp" },
],
// list of procs that should be copied into the output assembly from the input, if this list is empty,
// then all procs will be preserved (copied).
// Otherwise (list not empty), procs not in this list will just have a single "retn" instruction inside,
// so that they are marked as needing porting because the comparison tool will trip on them.
// Use this to mark routines which cannot or are not expected to be ported for some reason.
"preserves" : [
"installCBreakHandler", "setupOverlaySlots", "setTimerIrqHandler", "timerIrqHandler", "loadOverlay",
"restoreTimerIrqHandler", "copyJoystickData", "restoreCbreakHandler"
],
// list of symbols to declare as external in the output assembly (because they were already ported to C code)
"externs": [ "main", "waitMdaCgaStatus", "openFileWrapper", "closeFileWrapper", "cleanup",
"choosePilotPrompt", "pilotToGameData", "clearBriefing", "seedRandom", "gameDataToPilot", "saveHallfame"
],
// list of symbols to declare as public in the output assembly, so that they are visible from C code
"publics": [
"byte_17412", "byte_17422", "word_173DE", "asc_174AC", "byte_1741A", "asc_174AF", "ranks", "pilotNameInputColors",
"aMenterYourName", "copyJoystickData", "installCBreakHandler", "loadOverlay", "restoreCbreakHandler",
"setTimerIrqHandler"
],
// stuff to write at the beginning of the C header file generated by lst2ch
"header_preamble": [
"#ifndef F15_SE2_START",
"#define F15_SE2_START",
"#include \"inttype.h\"",
"#include \"struct.h\"",
"#include \"comm.h\"",
"",
"#include ",
"",
"#if !defined(MSDOS) && !defined(__MSDOS__)",
"#define far",
"#endif",
"",
"#define __int32 long",
"#define __int8 char",
"#define __cdecl",
"#define __far far"
],
// likewise, stuff to write at the end
"header_coda": "#endif // F15_SE2_START"
}
```
乍一看这很晦涩,但不需要利用所有功能就能使其发挥作用。
## lst2asm.py
与前一个脚本类似,这会解析 IDA 列表,但输出是根据配置文件中的指令进行转换的汇编代码,其目的是使用与 MASM 兼容的汇编程序进行汇编,并与正在重构的 C 源代码链接以形成完整的游戏可执行文件,其中部分重写为 C,部分在重构工作期间仍保留为汇编。
## dosbuild.sh
这是一个 bash 脚本,它封装了在无头 DosBox 中运行 DOS 构建工具链(编译器、汇编器、链接器)的过程,以便在 Makefiles 中使用,从而可以在 Linux 命令行中轻松构建 DOS 代码,而无需进入真正的 DOS。
## 其他
还有一堆用于处理杂项任务的 bash 脚本。
标签:8086汇编, Bash脚本, CMake, Google Test, IDA辅助, MS-DOS, MZ可执行文件, 二进制分析, 云安全监控, 云安全运维, 云资产清单, 代码重建, 可执行文件解析, 复古计算, 应用安全, 游戏逆向, 逆向工具, 逆向工程, 静态分析