programming language; part 1
— in which i propose a language conceptover the years i have attempted to make many programming languages, often it’s something very silly and esoteric like “lazers” (weird 2d language), that one regex replacement language (bad /// clone), or No Semicolon C (which isn’t really a language, but it’s language adjacent so i’ll count it here). However over the past 6 or so years i’ve wanted to make an actually useful programming language, and have thus rewritten it about 22 times (rough estimate). hopefully this time i can actually make a language!
this part 1 will pretty much just be me listing the silly ideas that lead up to this language & a status on my current progress (pretty much just lexing & parsing)
general ideas / goals
the language itself is called asyl
, i might write about the origin story of the name in a later part of this series¹.
pretty much i want to make a “small” language (as many people probably do), and i really like the idea of not giving primitive types too much priority over user-defined things, that includes things like
- very few shorthands for types (currently only strings and functions)
- lots of fun syntax things to make operations doable on multiple types
- hopefully easily extending existing things / types / functions, &c
- but also very minimal type system? i don’t really wanna program a full type system but i like types
this leads to some sillier ideas such as
- embed a some binary (like an image) directly in your source code, and use a fancy enough editor to view it
- far too much metaprogramming
- also what if every op was actually just a function call, and there’s just way too much syntax for calling functions
- that’s a good reason to have it typed, so we could have compile-time function overloading
for actually writing the language, i’m using racket this time around because it seems to already implement most of the ideas of compile-time vs. run-time that i want with its idea of phase levels, as well as generally having a lot of language-building conveniences.
lexing
the file itself is parsed with bytes rather than unicode characters, and my tokens are as follows:
- whitespace:
00→32
, and the utf-8 encodings³ for85
,A0
,1680
,2000→200B
,2028
,2029
,202F
,205F
,3000
, andFEFF
- newline characters:
0A→0D
, and the utf-8 encodings for85
2028
, and2029
- some basic op characters,
()[]{}:;.,@
- the most terrifying string syntax:
'marker'text'marker'
, the marker can be any number of bytes that aren’t'
, including no bytes for a quick''string''
, and matches any number of bytes until the marker is reached, and no escape sequence parsing of any sort - line comments:
# line comment...
terminated by a newline - block comments:
#''block comment''
, just comments until the end of the string, the#
and'
need to be next to each other, so# ''something like this is still a line comment''
- the one keyword so far (this number will likely change in the future):
fn
- vague reader extensions with
#@
although currently that just crashes the lexer - any other text is treated as an identifier⁴, which allow escapes in the forms:
\hh
,\{u…}
, and\c
(any literal character)
the actual lexer itself is just some handwritten input-port-reading nonsense but works rather well.
parsing
for parsing i wanted to try out brag, which lets me write a
#lang brag block : stmt ';' block | stmt | ∅ stmt : "fn" [s-ident] '(' table ')' stmt | '@' expr stmt | expr stmt-tail stmt-tail : expr stmt-tail | '.' expr-dot ['(' table ')'] stmt-tail | '.' '(' table ')' stmt-tail | ∅ expr-dot : '@' expr expr-dot | expr-head expr : "fn" '(' table ')' expr | '@' expr expr | expr-head expr-tail* expr-head : s-ident | s-string | ':' s-ident | '{' block '}' expr-tail : '(' table ')' table : table-key ',' table | table-key | ∅ table-key : expr ':' block | block | '.' [ block ] ; to generate specific ast nodes s-ident : IDENT s-string : STRING
and it sets up all the parsing for me automatically, so i just give it a list of brag tokens and it returns some vague ast.
also yes, almost all of these combinations of .
s, ()
s, and @
s are function calls, i’ll probably go into more detail once functions are actually being called.
unparsing?
problem is the ast it returns is way too verbose and also just a direct translation of the syntax tree, so i have an “unparsing” step that syntax-parse’s⁵ the strings, for example my current testing file contains⁶
#lang asyl @public fn factorial(let Number n, let Number t, ->: Number) if {n .<= 0} t factorial(n .- 1, t .*n); @public fn factorial(let Number n, ->: Number) factorial(n, 1);
which expands to these tokens
(list (token-struct '@ "@" 11 2 1 1 #f) (token-struct 'IDENT #"public" 12 2 2 6 #f) (token-struct 'fn "fn" 19 3 1 2 #f) (token-struct 'IDENT #"factorial" 22 3 4 9 #f) (token-struct '|(| "(" 31 3 13 1 #f) (token-struct 'IDENT #"let" 32 3 14 3 #f) (token-struct 'IDENT #"Number" 36 3 18 6 #f) (token-struct 'IDENT #"n" 43 3 25 1 #f) (token-struct '|,| "," 44 3 26 1 #f) (token-struct 'IDENT #"let" 46 3 28 3 #f) (token-struct 'IDENT #"Number" 50 3 32 6 #f) (token-struct 'IDENT #"t" 57 3 39 1 #f) (token-struct '|,| "," 58 3 40 1 #f) (token-struct 'IDENT #"->" 60 3 42 2 #f) (token-struct ': ":" 62 3 44 1 #f) (token-struct 'IDENT #"Number" 64 3 46 6 #f) (token-struct '|)| ")" 70 3 52 1 #f) (token-struct 'IDENT #"if" 73 4 2 2 #f) (token-struct '|{| "{" 76 4 5 1 #f) (token-struct 'IDENT #"n" 77 4 6 1 #f) (token-struct '|.| "." 79 4 8 1 #f) (token-struct 'IDENT #"<=" 80 4 9 2 #f) (token-struct 'IDENT #"0" 83 4 12 1 #f) (token-struct '|}| "}" 84 4 13 1 #f) (token-struct 'IDENT #"t" 88 5 3 1 #f) (token-struct 'IDENT #"factorial" 92 6 3 9 #f) (token-struct '|(| "(" 101 6 12 1 #f) (token-struct 'IDENT #"n" 102 6 13 1 #f) (token-struct '|.| "." 104 6 15 1 #f) (token-struct 'IDENT #"-" 105 6 16 1 #f) (token-struct 'IDENT #"1" 107 6 18 1 #f) (token-struct '|,| "," 108 6 19 1 #f) (token-struct 'IDENT #"t" 110 6 21 1 #f) (token-struct '|.| "." 112 6 23 1 #f) (token-struct 'IDENT #"*n" 113 6 24 2 #f) (token-struct '|)| ")" 115 6 26 1 #f) (token-struct '|;| ";" 116 6 27 1 #f) (token-struct '@ "@" 118 7 1 1 #f) (token-struct 'IDENT #"public" 119 7 2 6 #f) (token-struct 'fn "fn" 126 8 1 2 #f) (token-struct 'IDENT #"factorial" 129 8 4 9 #f) (token-struct '|(| "(" 138 8 13 1 #f) (token-struct 'IDENT #"let" 139 8 14 3 #f) (token-struct 'IDENT #"Number" 143 8 18 6 #f) (token-struct 'IDENT #"n" 150 8 25 1 #f) (token-struct '|,| "," 151 8 26 1 #f) (token-struct 'IDENT #"->" 153 8 28 2 #f) (token-struct ': ":" 155 8 30 1 #f) (token-struct 'IDENT #"Number" 157 8 32 6 #f) (token-struct '|)| ")" 163 8 38 1 #f) (token-struct 'IDENT #"factorial" 166 9 2 9 #f) (token-struct '|(| "(" 175 9 11 1 #f) (token-struct 'IDENT #"n" 176 9 12 1 #f) (token-struct '|,| "," 177 9 13 1 #f) (token-struct 'IDENT #"1" 179 9 15 1 #f) (token-struct '|)| ")" 180 9 16 1 #f) (token-struct '|;| ";" 181 9 17 1 #f))
and this ast
'(block (stmt "@" (expr (expr-head (s-ident #"public"))) (stmt "fn" (s-ident #"factorial") "(" (table (table-key (block (stmt (expr (expr-head (s-ident #"let"))) (stmt-tail (expr (expr-head (s-ident #"Number"))) (stmt-tail (expr (expr-head (s-ident #"n"))) (stmt-tail)))))) "," (table (table-key (block (stmt (expr (expr-head (s-ident #"let"))) (stmt-tail (expr (expr-head (s-ident #"Number"))) (stmt-tail (expr (expr-head (s-ident #"t"))) (stmt-tail)))))) "," (table (table-key (expr (expr-head (s-ident #"->"))) ":" (block (stmt (expr (expr-head (s-ident #"Number"))) (stmt-tail))))))) ")" (stmt (expr (expr-head (s-ident #"if"))) (stmt-tail (expr (expr-head "{" (block (stmt (expr (expr-head (s-ident #"n"))) (stmt-tail "." (expr-dot (expr-head (s-ident #"<="))) (stmt-tail (expr (expr-head (s-ident #"0"))) (stmt-tail))))) "}")) (stmt-tail (expr (expr-head (s-ident #"t"))) (stmt-tail (expr (expr-head (s-ident #"factorial")) (expr-tail "(" (table (table-key (block (stmt (expr (expr-head (s-ident #"n"))) (stmt-tail "." (expr-dot (expr-head (s-ident #"-"))) (stmt-tail (expr (expr-head (s-ident #"1"))) (stmt-tail)))))) "," (table (table-key (block (stmt (expr (expr-head (s-ident #"t"))) (stmt-tail "." (expr-dot (expr-head (s-ident #"*n"))) (stmt-tail))))))) ")")) (stmt-tail))))))) ";" (block (stmt "@" (expr (expr-head (s-ident #"public"))) (stmt "fn" (s-ident #"factorial") "(" (table (table-key (block (stmt (expr (expr-head (s-ident #"let"))) (stmt-tail (expr (expr-head (s-ident #"Number"))) (stmt-tail (expr (expr-head (s-ident #"n"))) (stmt-tail)))))) "," (table (table-key (expr (expr-head (s-ident #"->"))) ":" (block (stmt (expr (expr-head (s-ident #"Number"))) (stmt-tail)))))) ")" (stmt (expr (expr-head (s-ident #"factorial")) (expr-tail "(" (table (table-key (block (stmt (expr (expr-head (s-ident #"n"))) (stmt-tail)))) "," (table (table-key (block (stmt (expr (expr-head (s-ident #"1"))) (stmt-tail)))))) ")")) (stmt-tail)))) ";" (block)))
and finally gets “unparsed” into
'(#%kw-block (public (#%kw-fn (factorial (let Number n) (let Number t) (#%kw-dot -> Number)) (if (<= n |0|) t (factorial (- n |1|) (*n t))))) (public (#%kw-fn (factorial (let Number n) (#%kw-dot -> Number)) (factorial n |1|))) (#%kw-block))
to be expanded later on. “unparsing” is definitely the wrong name for this, but i don’t care, it sounds funny.
for the future
the next step to implement is some macro expansion system, originally i wanted to just use racket’s expander but i think now that the ways in which it doesn’t work from how i want it to means that i need to make my own. so far there’s pretty much nothing implemented there except for a giant comment with my ideas in it and a function that crashes.
already being my 22nd attempt⁷, there’s a high chance it’s not my last, but so far it’s going along pretty well and seeming fairly doable as a programming language, although i don’t have as much time or motivation as i’d like to work on it. currently the implementation isn’t published online anywhere, but i might open-source it sometime soon once i can be more confident this language won’t explode anytime soon.
see you next month if i can get an idea for a post by then.
-michael