My Own Private Binary: An Idiosyncratic Introduction to Linux Kernel Modules

277 points by spudlyo 6 months ago

spudlyo 6 months ago

This is a long essay, and here is my pitch as to why you should read the whole thing if you have any interest in subjects like C programming, binary formats, kernel modules, or assembler.

Breadbox, the author, wants to make smaller binary executables. He explains about ELF binaries, a.out binaries, old MSDOS .COM binaries, and how the later had no metadata, and could be very small. He then explains how you can dynamically load code that deals with new executable binary formats into the Linux kernel, and how this process works. He walks through some sample C for building a "Hello World" kernel module. He then walks you through ~1 page of code for a kernel module that registers a new binary format, sets up some callbacks, and if conditions are right, will vm_mmap() the code into memory and call start_thread() on it.

Yay, it works! He has a tiny binary. This is where most articles would end, but Breadbox goes deeper. What if you want a stack and a heap? What if you want to access argc, argv, and envp? What if you want to append code at the end that automatically calls the exit syscall? All these details are covered, and I think it's glorious.

While this all may seem like pretty dry stuff, there is humor sprinkled throughout, which makes it more fun to read.

klank 6 months ago

Thanks for sharing, brought back memories of using debug.exe to meticulously type in hex copied from a magazine to generate a .com executable.
Ah, the pre-internet was glorious.
- spudlyo 6 months ago
  
  Very cool! I remember using debug.exe to write directly to 0xB0000 which was mapped my Hercules Monochrome card's video buffer. Rocking dual monitors in the late 80s was pretty cool. Later running the CodeView debugger TUI in your external monochrome monitor was quite a luxury.
ryao 6 months ago

I had this itch once in college. I think I got down to a few thousand bytes. He has outdone my college attempt by orders of magnitude.
hoistbypetard 6 months ago

I really enjoyed reading it, and I hadn't seen it before. Thank you for sharing it here.
out-of-ideas 6 months ago

i truly appreciate your tl;dr!
- RaoulP 6 months ago
  
  As did I. Would be great if this catches on!

jmholla 6 months ago

This article is fantastic.

And it pairs well with another article on the front page. [0]

Which I bring up because they disagree on a particular point. And that is how a script without a shebang gets run as a script.

> This is done by registering a set of callback functions, and these callbacks get invoked when the kernel is asked to execute a binary file. The kernel invokes the callbacks on this list, and the first one that claims to recognize the file takes responsibility for getting it properly loaded into memory. If nobody on the list accepts it, then as a last resort the kernel will attempt to treat it as a shell script without a shebang line. And if that doesn't fly, then you'll get that "Exec format error" message described above.

But the article I linked to says the shell actually handles it. And based off of its research (terribly reproduced below), I'm inclined to believe it.

    echo echo Hello world > test.sh
    chmod +x test.sh
    strace ./test.sh
    strace sh -c ./test.sh

You'll see the first one errors with `ENOEXEC`, but the second one does not. Also, in my head, I don't know how the kernel would know what shell to choose, or that it should even expect to have access to a shell.

[0]: https://news.ycombinator.com/item?id=43646698

breadbox 6 months ago

So I read the other article, and I saw that bit that disagreed with my essay. My first thought was, "Oh, of course that's how it works. How did I get that so wrong?" My only excuse was that this essay was originally a tech talk and I was under a deadline. (But I really should have caught it when I wrote it up as an essay.)
So I was going to go edit my essay, when I learned that my essay was also posted on Hacker News. And now I discover that someone has already called out my error before I could fix it. Sigh.
Anyway, I just thought I should acknowledge this before I go to fix it.
- jmholla 6 months ago
  
  Thank you so much for responding! I really appreciate you clearing this up for me.
  And don't beat yourself up too much. This was a phenomenal article and it gave me the courage to dig into the kernal code myself.
  - breadbox 6 months ago
    
    And kudos to you for following up with your own exploration. It's so often the case that the most interesting stuff is hidden in what other people are wrong about.
mukesh610 6 months ago

Both articles are correct, from me reading them. When you invoke a shell script directly, it gets passed to the kernel to try and execve. The kernel returns ENOEXEC when it detects it doesn't have a shebang. The shell catches the error, and then as a last resort, tries opening the file and interpreting its instructions.
I might be wrong, so do correct me if so.
- jmholla 6 months ago
  
  I'll quote the line more explicitly from this article:
  > If nobody on the list accepts it, then as a last resort the kernel will attempt to treat it as a shell script without a shebang line.
  They said that the kernel is responsible for invoking the shell. I honestly think this was just a brain fart and the author meant to put shell and not kernel. With both words flying around in your head, it's an easy mistake to make.
  But, the again, the article goes on to talk about how it decides to even try that last step:
  > Interesting side note: The kernel decides whether or not to try to parse a file as a shell script by whether or not it contains a line break in the first few hundred bytes — specifically if it contains a line break before the first zero byte. Thus a data file that just happens to have a "\n" near the top can produce some odd-looking error messages if you try to execute it.
  So I don't know.
  - jmholla 6 months ago
    
    I decided to do a bit more testing to make sure that the newline in the script wasn't causing the kernel to do anything different. What I noticed is the output of strace is identical between the different variations of the strace invocation, with one difference, with a new line, there's an extra read call, but that's just for the shell to see what's left to run.
    I guess my next step is to look at the kernel source itself. I'll probably end up doing that in a bit.
    
    jmholla 6 months ago
    
    So, I've dug into the kernel code. I can't find anywhere that has a fallback mechanism. When it fails, the errors bubble up. I might not be looking in all the correct places, but I believe the shell is responsible for attempting to execute the process.
    I also put together two version of the same call to a shebangless script in Python, one with `shell=True` and the other without. It's only the one that calls into the shell that successfully runs the script. The strace outputs corroborate my theory.
    Without shell=True (truncated)
    [pid 961626] execve("./sh.sh", ["./sh.sh"], 0x7fff7bae94a0 /* 66 vars */) = -1 ENOEXEC (Exec format error)
    With shell=True (truncated)
    [pid 961623] execve("/bin/sh", ["/bin/sh", "-c", "./sh.sh"], 0x7ffd75009e50 /* 66 vars */) = 0 [pid 961624] execve("./sh.sh", ["./sh.sh"], 0x5980a07c70a8 /* 66 vars */) = -1 ENOEXEC (Exec format error) [pid 961624] execve("/bin/sh", ["/bin/sh", "./sh.sh"], 0x5980a07c70a8 /* 66 vars */) = 0

Veserv 6 months ago

You can do better than 2 bytes. Use the same epilogue, but store a copy of the "binary" just before the stack pointer and offset the instruction pointer from the start of the binary by 1 byte. If you use the binary consisting of literally a one-byte value, 0x2A (i.e. 42), then your first instruction will be the first instruction of the epilogue which will pop the "binary" into RDI setting RDI to 42. There are maybe some details in the alignment, padding, and instruction choice in the loader to make that work "generically", but that strategy should work and give you a 1-byte solution.

edit: Actually, just define your binary format so that the first byte is copied to the stack and all subsequent bytes are copied to text with the epilogue appended to it.

edit: You could also define it so that the first byte is copied into the first argument register/RDI if you want to shrink loaded RAM usage to just 4 bytes of code and 1 byte of data.

This is of course assuming it is a "generic" binary format that is not literally just encoding the contents of the tiny program. Otherwise you could do 0 bytes and just have the loader pre-fill RAX with 60 and RDI with 42 and insert a one instruction epilogue consisting of syscall. You could technically still call that a "generic" binary format since any actual binary you attempt to load will just blow away those pre-filled GPR values.

ryao 6 months ago

COM files on Windows are always 16-bit. His CON files appear to be the native bit width of the kernel. This means unlike on Windiwsm his COM files cannot execute on both 32-bit and 64-bit versions of the kernel. That one imperfection aside, this is a fantastic achievement.

rep_lodsb 6 months ago

The zero-byte program should work on either :)

It's also possible to detect which mode the CPU is in:

    bits 16
    mov  ax,start16   ;may load EAX instead,
    jmp  ax           ;skipping this 2-byte instruction

    bits 32
    dec  eax          ;REX prefix in long mode,
    mov  eax,start32  ;may load RAX,
    jmp  eax          ;skipping these 4 bytes
    nop
    nop

    bits 64
    jmp  start64

You can even be compatible with CP/M-80 by putting this at the start:

    add  bx,start8    ;8080: ADD C, JMP start8
    nop               ;immediate may be 16 or 32 bits
    nop

breadbox 6 months ago

A valid point! Clearly the correct solution would be for the kernel module to check if the filename contains the substring "32", and if so it should load it as a 32-bit binary.
- ycombinatrix 6 months ago
  
  Have you considered building your program into the binfmt module and only running 0 byte executables?

HeliumHydride 6 months ago

The appendix to this is also good, and goes over things like getting linker scripts to create binaries using objdump and writing C wrappers for syscalls: https://www.muppetlabs.com/~breadbox/txt/mopb-app.html

stmw 6 months ago

This is a very good read and excellent in that we hope everyone knows about these things -- how computers actually work and how efficient and simple things can be -- but some readons probably don't, and this wonderfully accessible write-up is a good way to learn. And for those who know most of these details it is wonderfully refreshing.

spudlyo 6 months ago

I really like picking up arcane UNIX/Linux knowledge. In the 90s, I was asked in an interview by a wizened old UNIX greybeard what the brk system call did, and what I would think if I saw it pop up frequently in the strace output of a program I was trying to diagnose. I did not then know the answer to that question, and I subsequently bombed the rest of that interview. If I would have read this article I could have told him stood for "program break" and mentioned that it was an end-of-heap marker, and that I should expect that the program was calling malloc (which was then implemented with brk) a lot. I probably would have still bombed the other interviews, but I could have at least momentarily impressed the crusty old sysadmin.
Nowadays however, interviewers are rarely impressed with what arcane knowledge you may or may not have, regardless of how hard won the experiences were that taught it to you.
I was reminded of this when reading the "Demystifying the shebang" article today on HN when I saw it in some strace output, which along with the other similarities got me to thinking about this article.

setheron 6 months ago

This is amazing and I wish I had access to this resource months ago when I explored a new binary format as well.

amstan 6 months ago

> Traditionally, programs will place their code into non-writeable memory, and store variable data in memory that is writeable but not executable. And that's definitely the safer way to do things, but we can't be bothered with all that.

Woah, I have a feeling this does something even more. If the program modifies its own instructions, the kernel will probably save those modifications in the file too.

compressedgas 6 months ago

That would be the behavior with the mmap(2) flag MAP_SHARED. The module built in the article uses MAP_PRIVATE. Any changes to the contents of a private mapping do not effect other processes or the file.

rkagerer 6 months ago

Also interesting - how to make a single, small executable that can run natively on Windows, Linux, Mac, etc:

https://news.ycombinator.com/item?id=32648359

https://github.com/jart/cosmopolitan

https://en.m.wikipedia.org/wiki/Fat_binary

bhawks 6 months ago

| For example, one time while working on my kernel module, I accidentally put --i instead of ++i in the iterator of my for loop. I inserted that module into my kernel to test it, and my mouse cursor disappeared, and my music stopped playing … and then it was time to reboot my computer

Id recommend using QEmu for the type of work the author is doing. It makes iteration much faster.

breadbox 6 months ago

Not as much fun that way.

nazgulsenpai 6 months ago

The first kernel module I developed was based on a blog post[0] from Oracle of all people.

0: https://blogs.oracle.com/linux/post/introduction-to-netfilte...

p4bl0 6 months ago

Very nice read, thanks for sharing! I will immediately give the link to my systems & networks students. Just a few weeks ago I taught them how to write basic kernel modules. This is a very cool addendum to that class :).

Liftyee 6 months ago

As an amateur Linux user I've long thought of these .ko files and many other binaries as "magic", but no more! This article presents the concepts very naturally so it was easy to absorb.

rurban 6 months ago

I just would name the kernel modules properly. comexec and calmexec. Or crownexec