A Bit of RISC

I recently picked up Hacker's Delight and having a lot of fun with it. It's a collection of bit-manipulation tricks collected by hackers over many years. You can flip open pretty much anywhere in the book and start learn something really cool.

There's something about seeing bit strings and assembly code in a book that really catches my attention, this one goes a bit further by even describing a complete RISC (Reduced Instruction Set Computer) we can implement to play around with the various tricks.

As an exercise and for fun, I'd like to employ some Lisp-fu here and implement a small VM specifically designed for mangling bits.

As a fair warning, I'm not a mathematician and I don't write proofs often. If I get something wrong or there is a better way of doing things, let me know!

You can find most of the code from the book here.

1. Design

1.1. Data Representation

We'll be sticking with a 32-bit word length as recommended in the Preface. We will also represent registers as integers whenever possible, instead of say a bit-vector. Without going into too much detail, it's much more efficient to do bitwise ops on integers instead of bit-vectors in Lisp.

We need a minimum of 16 general purpose registers, typically of word length, with R0 reserved for a constant 0. To address 16 different registers we actually only need 4 bits - 5-bits if we needed 32, 6 for 64, etc.

Floating-point support and special purpose registers are not required.

1.2. Instructions

The Hacker's Delight RISC architecture is described in two tables, denoted basic RISC and full RISC respectively.

Most instructions take two source registers RA and RB with a destination register RT. The actual general-purpose registers are labelled R0 (containg the constant 0) through R15.

A 3-Address machine is assumed and some instructions take 16-bit signed or unsigned immediate values - denoted I and Iu respectively.

Opcode Mnemonic Operands Description
add,sub,mul,div,divu,rem,remu RT,RA,RB RT <- (op RA RB)
addi,muli RT,RA,I RT <- (op RA I), I is a 16-bit signed immediate-value
addis RT,RA,I RT <- (+ RA (|| I 0x0000))
and,or,xor RT,RA,RB RT <- (op RA RB)
andi,ori,xori RT,RA,Iu As above, expect the last operand is a 16-bit unsigned immediate-value
beq,bne,blt,ble,bgt,bge RT,target Branch to target if (op RT)
bt,bf RT,target Branch true/false, same as bne/beq resp
cmpeq,cmpne,cmplt,cmple,cmpgt,cmpge,cmpltu,cmpleu,cmpgtu,cmpgeu RT,RA,RB RT <- (if (op RA RB) 1 0)
cmpieq,cmpine,cmpilt,cmpile,cmpigt,cmpige RT,RA,I Like cmpeq except second comparand is a 16-bit signed immediate-value
cmpiequ,cmpineu,cmpiltu,cmpileu,cmpigtu,cmpigeu RT,RA,I Like cmpltu except second comparand is a 16-bit unsigned immediate-value
ldbu,ldh,ldhu,ldw RT,d(RA) Load an unsigned-byte, signed-halfword, unsigned-halfword, or word into RT from (+ RA d) where d is a 16-bit signed immediate-value
mulhs,mulhu RT,RA,RB RT gets the high-order 32 bits of (* RA RB)
not RT,RA RT <- bitwise one's-complement of RA
shl,shr,shrs RT,RA,RB RT <- RA shifted left or right by rightmost six bits of RB; 0-fill except for shrs, which is sign-fill (shift amount treated modulo 64)
shli,shri,shrsi RT,RA,Iu RT <- RA shifted left or right by 5-bit immediate field
stb,sth,stw RS,d(RA) Store a byte,halfword,word from RS into memory at location (+ RA d) where d is a 16-bit signed immediate-value
Opcode Mnemonic Operands Description
abs,nabs RT,RA RT <- (op RA)
andc,eqv,nand,nor,orc RT,RA,RB RT <- (op RA RB)
extr RT,RA,I,L extract bits I through I+L-1 of RA and place them right-adjusted in RT, with 0-fill
extrs RT,RA,I,L Like extr, but sign-fill
ins RT,RA,I,L Insert bits 0 through L-1 of RA into bits I through I+L-1 of RT
nlz RT,RA RT gets count of leading 0's in RA (0 to 32)
pop RT,RA RT gets the number of 1-bits in RA (0 to 32)
ldb RT,d(RA) Load a signed byte into RT from memory at location (+ RA d) where d is a 16-bit signed immediate value
moveq,movne,movlt,movle,movgt,movge RT,RA,RB RT <- RA rotate-shifted left or right by the rightmost 5-bits of RB
shlr,shrr RT,RA,RB RT <- RA rotate-shifted left or right by the rightmost 5-bits of RB
shlri,shrri RT,RA,Iu RT <- RA rotate-shifted left or right by the 5-bit immediate field
trpeq,trpne,trplt,trple,trpgt,trpge,trpltu,trpleu,trpgtu,trpgeu RA,RB Trap (interrupt) if (op RA RB)
trpieq,trpine,trpilt,trpile,trpigt,trpige RA,I Trap if (op RA I) where I is a 16-bit signed immediate-value
trpiequ,trpineu,trpiltu,trpileu,trpigtu,trpigeu RA,Iu Trap if (op RA Iu) where Iu is a 16-bit unsigned immediate-value

There is also some extensions, which are like macros that usually expand to a single instruction.

Extended Mnemonic Expansion Description
b target beq R0,target Unconditional branch
li RT,I (addi,addis,ori) Load immediate
mov RT,RA ori RT,RA,0 Move register RA to RT
neg RT,RA sub RT,R0,RA Negate (two's-complement)
subi RT,RA,I addi RT,RA,-I Subtract immediate

All of these instructions are available on x86,arm,riscv and the likes so no real surprises. We will implement the basic set in Lisp, mapping instructions directly to Lisp functions using macros.

1.3. Execution Model

We'll build this machine in Lisp and use plenty intrinsics from SBCL. As a starting point I followed Paul Khuong's excellent blog post: SBCL: The ultimate assembly code breadboard.

Some things to keep in mind for our machine:

  • every instruction requires at most two register reads and one register write - good for compilers
  • every instruction counts as a single cycle
  • we pay no attention to instruction-level parallelism

2. The HAKMEM VM

(ql:quickload :prelude)
(in-package :std-user)
(defpackage :hakmem
  (:use :cl :std :std-user)
  (:import-from :sb-assem :inst)
  (:import-from :sb-vm :immediate-constant :registers :zero :ea))
(in-package :hakmem)
;; (in-package :sb-x86-64-asm)
;; (in-readtable :std)
(declaim (optimize (speed 3) (safety 1)))

(eval-always
  (defconstant +word-size+ 32 "default word size and register length."))
(declaim (type (unsigned-byte #.+word-size+) +ro+))
(defconstant +r0+ 0 "constant value for register 0")
(defvar *stack* (make-array 8 :initial-contents (list sb-vm::r8-tn
                                                      sb-vm::r9-tn
                                                      sb-vm::r10-tn
                                                      sb-vm::r11-tn
                                                      sb-vm::r12-tn
                                                      sb-vm::r13-tn
                                                      sb-vm::r14-tn
                                                      sb-vm::r15-tn)))
(defvar *stack-pointer*)

(defvar *rax* sb-vm::rax-tn)
(defvar *rbx* sb-vm::rax-tn)
(defvar *rcx* sb-vm::rax-tn)
(defvar *rdx* sb-vm::rax-tn)

;; (@ 0) returns the (current) register for TOS, (@ 1) returns
;; the one just below, etc.
(defun @ (i)
  (aref *stack* (mod (+ i *stack-pointer*) (length *stack*))))

(defvar *code-base* sb-vm::rsi-tn)
(defvar *virtual-ip* sb-vm::rdi-tn)
(sb-x86-64-asm::get-gpr :qword 4)
;; (sb-vm::immediate-constant-sc 10000)
;; arena vector or list?
(defvar *instructions* (make-hash-table :test #'equal))

(defvar *primitive-code-offset* (* 64 67))

(defstruct code-page
  (alloc 0) ;; next free byte
  (code (make-array *primitive-code-offset* :element-type 'octet)))

(defun emit-code (pages emitter)
  ;; there must be as many code pages as there are stack slots
  (assert (= (length *stack*) (length pages)))
  ;; find the rightmost starting point, and align to 16 bytes
  (let* ((alloc (logandc2 (+ 15 (reduce #'max pages :key #'code-page-alloc))
                          15))
         (bytes (loop for i below (length pages)
                      for page = (elt pages i)
                      collect (let ((segment (sb-assem:make-segment))
                                    (*stack-pointer* i))
                                ;; assemble the variant for this value
                                ;; of *stack-pointer* in a fresh code
                                ;; segment
                                (sb-assem:assemble (segment)
                                  ;; but first, insert padding
                                  (sb-vm::emit-long-nop segment (- alloc (code-page-alloc page)))
                                  (funcall emitter))
                                ;; tidy up any backreference
                                (sb-assem:finalize-segment segment)
                                ;; then get the (position-independent) machine
                                ;; code as a vector of bytes
                                (sb-assem:segment-contents-as-vector segment)))))
    ;; finally, copy each machine code sequence to the right code page
    (map nil (lambda (page bytes)
               (let ((alloc (code-page-alloc page)))
                 (replace (code-page-code page) bytes :start1 alloc)
                 (assert (<= (+ alloc (length bytes)) (length (code-page-code page))))
                 (setf (code-page-alloc page) (+ alloc (length bytes)))))
         pages bytes)
    ;; and return the offset for that code sequence
    alloc))

(defun emit-all-code (&rest emitters)
  (let ((pages (loop repeat (length *stack*)
                     for page = (make-code-page)
                     ;; prefill everything with one-byte NOPs
                     do (fill (code-page-code page) #x90)
                     collect page)))
    (values (mapcar (lambda (emitter)
                      (emit-code pages emitter))
                    emitters)
            pages)))

(defun next (&optional offset)
  (setf offset (or offset 0)) ; accommodate primops that frob IP
  (let ((rotation (mod *stack-pointer* (length *stack*))))
    (inst movzx *rax* (make-ea :dword :base *virtual-ip*
                                      :disp offset))
    (unless (= -4 offset)
      (inst add *virtual-ip* (+ 4 offset)))
    (if (zerop rotation)
        (inst add *rax* *code-base*)
        (inst lea *rax* (make-ea :qword :base *code-base*
                                        :index *rax*
                                        :disp (* rotation *primitive-code-offset*))))
    (inst jmp *rax*)))

(defun swap ()
  (inst xchg (@ 0) (@ 1)) ; exchange top of stack and stack[1]
  (next))

;; todo
(defun %parse-reg3 (rt ra rb))
(defun %parse-reg2i (rt ra i))
(defun %parse-reg2ui (rt ra ui))
(defmacro def-inst (name args &body body)
    ;; todo: compose a function based on regs+args+body
    `(let ((sc *scratch*)
           (r0 +r0+)
           (ra 0)
           (rb 0)
           (rt 0))
       (declare (ignorable sc r0 ra rb rt))
       (setf (gethash ',name *instructions*) (lambda ,args (progn ,@body)))))

(defmacro def-prim (name op)
  `(def-inst ,name () (setf rt (,op ra rb))))
(def-prim add +)
(def-prim sub -)
(def-prim mul *)
(def-prim div /)
;; divu
(def-prim rem mod)
;; remu
(def-prim cmpeq =)
(def-prim cmpne /=)
(def-prim cmplt <)
(def-prim cmple <=)
(def-prim cmpgt >)
(def-prim cmpge >=)
;; ltu leu gtu geu
(def-inst addi (i)
  (setf rt (+ ra i)))
(def-inst muli (i)
  (setf rt (* ra i)))
(def-prim and logand)
(def-prim or logior)
(def-prim xor logxor)

(defun get-inst (i) (gethash i *instructions*))

(defmacro %inst (i &body args)
  `(funcall (get-inst ',i) ,@args))

(defun list-instructions (&optional (tbl *instructions*))
  (hash-table-alist tbl))

(list-instructions)
((XOR . #<FUNCTION (LAMBDA ()) {101CF0367B}>)
 (OR . #<FUNCTION (LAMBDA ()) {101CF036AB}>)
 (AND . #<FUNCTION (LAMBDA ()) {101CF036DB}>)
 (MULI . #<FUNCTION (LAMBDA (I)) {101CF0370B}>)
 (ADDI . #<FUNCTION (LAMBDA (I)) {101CF0373B}>)
 (CMPGE . #<FUNCTION (LAMBDA ()) {101CF0376B}>)
 (CMPGT . #<FUNCTION (LAMBDA ()) {101CF0379B}>)
 (CMPLE . #<FUNCTION (LAMBDA ()) {101CF037CB}>)
 (CMPLT . #<FUNCTION (LAMBDA ()) {101CF037FB}>)
 (CMPNE . #<FUNCTION (LAMBDA ()) {101CF0382B}>)
 (CMPEQ . #<FUNCTION (LAMBDA ()) {101CF0385B}>)
 (REM . #<FUNCTION (LAMBDA ()) {101CF0388B}>)
 (DIV . #<FUNCTION (LAMBDA ()) {101CF038DB}>)
 (MUL . #<FUNCTION (LAMBDA ()) {101CF0391B}>)
 (SUB . #<FUNCTION (LAMBDA ()) {101CF0394B}>)
 (ADD . #<FUNCTION (LAMBDA ()) {101CF0397B}>))