changeset 120:1c1a0150fdda

Add documentation of bytecode and parsing scheme Instead of the Color Basic style of minimal encoding of the program, lwbasic is going to use a full bytecode encoding and raise syntax errors are parse time. This requires a substantially different approach compared to the more na?ve approach previously implemented.
author William Astle <lost@l-w.ca>
date Sun, 31 Dec 2023 17:42:39 -0700
parents a6a53e5c04bd
children 5d5472b11ccd
files README.txt docs/Interpreter Operation.txt
diffstat 2 files changed, 301 insertions(+), 16 deletions(-) [+]
line wrap: on
line diff
--- a/README.txt	Fri Dec 29 01:40:39 2023 -0700
+++ b/README.txt	Sun Dec 31 17:42:39 2023 -0700
@@ -4,6 +4,8 @@
 intended to mess with the interpreter internals are unlikely to work, and,
 in fact, may be completely detrimental.
 
+Additional detailed documentation can be found under docs/.
+
 There are two versions of LWBasic. One is for the Coco 1 and 2. The other
 is for the Coco 3. The primary differences between the two are in the
 startup code and features that rely on the existence of Coco 3 hardware. The
@@ -48,27 +50,16 @@
 Numbers
 =======
 
-LWBasic has three numeric types: a 32 bit signed integer, stored as two's
-complement, a decimal floating point type stored in packed BCD with
-10 digits of precision and a base 10 exponent range of -63 to +63, and a
-double precision decimal floating point type with 20 decimal digits of
-precision and a base 10 exponent range from -2047 to +2047.
+LWBasic has two numeric types: a 32 bit signed integer, stored as two's
+complement and a decimal floating point type stored in packed BCD with
+10 digits of precision and a base 10 exponent range of -63 to +63
 
 The BCD format using 48 bits is stored as follows:
 
-Offset	Size	Content
+Offset	Bits	Content
 0	1	Sign bit - 1 for negative
 1	7	Decimal exponent with a bias of 64; 0 indicates a value of 0
-8	40	10 BCD digits of the significand
-
-*** Planned but not implememted
-The BCD double format using 96 bits is stored as follows:
-
-Offset	Size	Content
-0	1	sign bit - 1 for negative
-1	3	<reserved>
-4	12	Decimal exponent with a bias of 2048; 0 indicates value of 0
-16	80	20 BCD digits of the significand
+8	40	10 packed BCD digits of the significand
 
 It is worth noting the reason for using the BCD format instead of binary
 floating point. Because interactions with the computer are typically in base
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/docs/Interpreter Operation.txt	Sun Dec 31 17:42:39 2023 -0700
@@ -0,0 +1,294 @@
+This document is intended to descibe the operation of the interpreter
+including program text management, parsing, and execution.
+
+In general, LWBasic preserves the line oriented nature of Color Basic but
+extends it somewhat to be more flexible and more efficient to interpret. The
+primary way it does this is to split the parsing and execution processes and
+to pre-parse numeric and string constants. By doing this, it removes a lot
+of complexity from the interpretation loop. Parsing is done when a line is
+entered into the program meaning that syntax errors can be detected
+immediately instead of at run time. It also means the interpretation loop
+does not have to do slow processing like finding the end of a statement.
+This will be most noticeable for things like IF statements.
+
+Parsing transforms the program into a byte code. This byte code will often
+end up being larger than the original program text. The byte code consists
+of a sequence of line structures which consist of a pointer to the next line
+followed by a 16 bit binary line number. If the pointer is NULL, the end of
+the program has been reached.
+
+Each line consists of a sequence of "operations" which with zero or more
+operands. Each of these is described below. Each section below starts with a
+header line containing the operation code, which may be more than 8 bits, a
+symbolic abbreviation for the operation, and an English short description.
+Following that is a longer description of the operation code and the
+encoding of its parameters.
+
+It should be noted that various syntactic particles do not get encoded in
+the final result even though they are keywords. The list of those is: TAB,
+TO, SUB, THEN, ELSE, STEP, OFF, FN, USING, AS, ERR, ERROR, BRK, BREAK, RGB,
+CMP. Thus they will appear in the keyword tables but not in the encoded
+program.
+
+Note also that some keywords serve as both commands and functions. In those
+cases, there will be separate operation codes.
+
+00 EOL End of Line
+
+This operation signals the end of a program line. Interpretation will
+continue with the next program line, or end if no further lines exist.
+
+
+01 CONST0 Zero constant
+
+Exactly what it says on the tin. This evaluates to an integer constant zero.
+Because zero values are common, having a dedicated code for this is
+beneficial for overal byte code compactness.
+
+02 CONST1 One constnat
+
+Exactly what it says on the tin. Because one is a very common constant,
+encoding it specifically in a single byte seems sensible as a means to keep
+the byte code size smaller.
+
+03 INT8 8 bit signed integer constant
+
+This is a signed 8 bit integer constant. Most constants in programs are
+small integers. By encoding these specially, we keep the byte code more
+compact. This saves three bytes over encoding integers at 32 bits.
+
+04 INT16 16 bit signed integer constant
+
+This encoding is present to avoid taking up 32 bits for the integer data
+when 16 bits will do. Again, this is intended to keep the byte code a bit
+more compact. This saves two bytes over encoding integers at 32 bits.
+
+05 INT32 32 bit signed integer constant
+
+Exactly what it says on the tin.
+
+06 BCD48 BCD Floating Point
+
+This is a 48 bit BCD floating point value where the first byte contains the
+sign bit and 7 bit exponent (stored with a bias of 64). The remaining five
+bytes contain the 10 BCD digits of the significand.
+
+07 BCD16 BCD Floating Point (2 significant digits)
+08 BCD24 BCD Floating Point (4 significant digits)
+09 BCD32 BCD Floating Point (6 significant digits)
+
+Because many numbers will only need a small number of significant digits,
+encodings for numbers needing only two or four significant digits are
+provided. These are intended to keep the byte code more compact.
+
+0A STRING String constant
+
+This encodes a string constant whose length fits in an 8 bit unsigned byte.
+The first byte is the length, which may be zero, with the remaning bytes
+being the string data. The string data may contain any binary values.
+
+0B LSTRING Long string constant
+
+This is exactly like STRING above but uses a 16 bit length field for
+encoding very long strings. This will not normally occur in programs but is
+included in case it is required.
+
+1D VARS Scalar variable reference
+
+This is a reference to a scalar variable. It is followed by a variable type
+(integer, floating point, string) (upper 3 bits) and length (lower 5 bits)
+byte followed by the variable name *without* a type sigil. Note that this
+encoding is also used in the DIM command. Note that type 0 indicates an
+unspecified type (no sigil) which will be looked up at runtime and defaults
+to floating point.
+
+1E VARA Array variable reference
+
+This is exactly like VARS except following the variable name string, a
+sequence of expressions specifying the subscript values follows. The
+sequence of expressions begins with a count (8 bits) followed by the
+expressions. The expression count is required to allow skipping over the
+subscript references without having to know how many dimensions an array
+has. Further, it is not possible to know how many dimensions are required at
+parse time. Note that this encoding is also used in the DIM command. Note
+that type 0 indicates an unspecified type (no sigil) which will be looked up
+at runtime and defaults to floating point.
+
+1F EXPR Expression
+
+This indicates an expression to be evaluated. It is followed by a sequence
+of terms and operators to be evaluated. The expression is stored in postfix
+order and will be evaluated using an expression evaluation stack. Each
+operation will fetch zero or more operands from the evaluation stack, do its
+calculation, and then push its result back onto the evaluation stack. When
+an "end of expression" operator is encountered, the result is popped from
+the stack and left in the result destination. Note that an end of expression
+operator is required because unary operators exist.
+
+Note that an expression will be converted back to infix notation when
+listed using parentheses only as required to account for operator
+precedence. This means that an expression entered with parentheses may be
+listed back out without parentheses.
+
+Postfix notation is used to store expressions because it avoids having to
+deal with operator precedence at run time.
+
+20 EOE End of expression operator
+
+This signifies the end of an expression and triggers the expression
+evaluator to return its result.
+
+21 NEG Negation
+22 ADD Addition
+23 SUB Subtraction
+24 MUL Mulltiplication
+25 DIV Division
+26 MOD Modulus
+27 NOT Boolean not
+28 AND Boolean and
+29 OR Boolean or
+2A XOR Boolean exlusive or
+2B COM Bitwise complement
+2C LAND Bitwise and
+2D LOR Bitwise or
+2E LXOR Bitwise exclusive or
+2F CONCAT String concatenation
+30 EQ Equality comparison
+31 NE Inequality comparison
+32 GT Greater than comparison
+33 LT Less than comparison
+34 GE Greater than or equal comparison
+35 LE Less than or equal comparison
+36 EXP Exponentiation
+
+These are the basic arithmetic, boolean, and logical operators.
+
+40...7F: built in functions
+
+40 SGN
+41 INT
+42 ABS
+43 USR
+44 RND
+45 SIN
+46 PEEK
+47 LEN
+48 STR$
+49 VAL
+4A ASC
+4B CHR$
+4C EOF
+4D JOYSTK
+4E LEFT$
+4F RIGHT$
+50 MID$
+51 POINT
+52 INKEY$
+53 MEM
+54 ATN
+55 COS
+56 TAN
+57 EXP
+58 FIX
+59 LOG
+5A POS
+5B SQR
+5C HEX$
+5D VARPTR
+5E INSTR
+5F TIMER
+60 PPOINT
+61 STRING$ 
+62 CVN
+63 FREE
+64 LOC
+65 LOF
+66 MKN$
+67 LPEEK
+68 BUTTON
+69 ERNO/ERRNO
+6A ERLIN/ERRLINE
+6B ATTR
+
+80...DF: commands
+
+80 FOR
+81 GOTO
+82 GOSUB
+83 REM
+84 ' (Separate to REM because of different semantics)
+85 IF
+86 DATA
+87 PRINT
+88 ON
+89 INPUT
+8A END
+8B NEXT
+8C DIM
+8D READ
+8E RUN
+8F RESTORE
+90 RETURN
+91 POP
+92 STOP
+93 POKE
+94 CONT
+95 LIST
+96 CLEAR
+97 NEW
+98 OPEN
+99 CLOSE
+9A LLIST
+9B SET
+9C RESET
+9D CLS
+9E MOTOR
+9F SOUND
+A0 EXEC
+A1 DEL
+A2 EDIT
+A3 TRON
+A4 TROFF
+A5 DEF
+A6 LET
+A7 LINE
+A8 PCLS
+A9 PSET
+AA PRESET
+AB SCREEN
+AC PCLEAR
+AD COLOR
+AE CIRCLE
+AF PAINT
+B0 GET
+B1 PUT
+B2 DRAW
+B3 PCOPY
+B4 PMODE
+B5 PLAY
+B6 RENUM
+B7 DIR
+B8 DRIVE
+B9 FIELD
+BA FILES
+BB KILL
+BC LOAD
+BD LSET
+BE MERGE
+BF RENAME
+C0 RSET
+C1 SAVE
+C2 WRITE
+C3 VERIFY
+C4 UNLOAD
+C5 DSKINI
+C6 BACKUP
+C7 COPY
+C8 DSKI$
+C9 DSKO$
+CA DOS
+CB WIDTH
+CC PALETTE
+CD LPOKE
+CE LOCATE
+CF ATTR