shithub: mc

Download patch

ref: e467b6f2f715a98732bd401a483af2d45b5cf32d
parent: 38f794994e2b62efdbaebabf18eb0cee84390a2f
author: Ori Bernstein <[email protected]>
date: Tue Aug 7 19:40:24 EDT 2012

New version of language docs.

--- a/doc/lang.txt
+++ b/doc/lang.txt
@@ -1,8 +1,8 @@
                     The Myrddin Programming Language
-                              Jun 2012
+                              Aug 2012
                             Ori Bernstein
 
-Overview:
+1. OVERVIEW:
 
         Myrddin is designed to be a simple, low level programming
         language.  It is designed to provide the programmer with
@@ -16,195 +16,342 @@
         easy to understand language for work that needs to be close
         to the hardware.
 
-Introduction:
+        Myrddin is a computer language influenced strongly by C
+        and ML, with ideas from Rust, Go, C++, and numerous other
+        sources and resources.
 
-    We begin with the archetypical "Hello world" example, deconstructing
-    it as we go:
 
-        use std
+2. LEXICAL CONVENTIONS:
 
-        const main = {
-            /* say hello */
-            std.write(1, "Hello World\n")
-        }
+    The language is composed of several classes of token. There
+    are comments, identifiers, keywords, punctuation, and whitespace.
+    
+    Comments, begin with "/*" and end with "*/". They may nest.
 
-    The first line, `use std`, tells the compiler to import the standard
-    library, which at the time of this writing only barely exists as a
-    copy-paste group of files that works only on Linux, implementing almost
-    no useful functions.  One of the functions that it does provide,
-    however, is the 'write' system call.
+        /* this is a comment /* with another inside */ */
 
-    The next line, 'const main = ...' declares a constant value called
-    'main'. These constant values must be initialized at their declaration
-    to a literal value. In this case, it is intialized to a constant
-    function '{;std.write(1, "Hello World\n");}'
+    Identifiers begin with any alphabetic character or underscore,
+    and continue with any number of alphanumeric characters or
+    underscores. Currently the compiler places a limit of 1024
+    bytes on the length of the identifier.
 
-    In Myrddin, all functions begin with a '{', followed by a list
-    of arguments, which is terminated by a newline (or semicolon. The
-    two are equivalent). This is followed by any number of statements,
-    and closed by a '}'.
+        some_id_234__
 
-    The text '/* say hello */' is a comment. It is ignored by the compiler,
-    and is used to add useful information for the programmer. In Myrddin,
-    unlike many popular languages, comments nest. This makes code like
-    /* outer /* inner coment */ comment */ valid.
+    Keywords are a special class of identifier that is reserved
+    by the language and given a special meaning. The set of
+    keywords in Myrddin are as follows:
 
-    The text 'std.write' refers the 'write' function from the 'std' library.
-    In Myrddin, a name can belong to an imported namespace. The language,
-    for reasons of parsimony, only allows one level of namespace. I saw
-    Java package names and ran screaming in horror, possibly too far to
-    the other extreme. This function is statically typed, taking a single
-    integer argument, and a byte slice to write.
+        castto          match
+        const           pkg
+        default         protect
+        elif            sizeof
+        else            struct
+        export          trait
+        extern          true
+        false           type
+        for             union
+        generic         use
+        goto            var
+        if              while
 
-    The text '(1, "Hello World)' is the function call itself. It takes
-    the literal "1", and the byte slice "Hello World\n", and calls the
-    function 'std.write' with them as arguments.
 
-    It would be useful now to specify that the value '1' is an integer-like
-    constant, but it is not an integer. It is polymorphic, and can be used
-    at any point where a value of any integer type is needed.
+    At the current stage of development, not all of these keywords
+    are implemented within the language.[1]
 
-Declarations:
+    Literals are a direct representation of a data object within the
+    source of the program. There are several literals implemented
+    within the Myrddin language:
 
-    In Myrddin, declarations take the following form:
+        Integers literals are a sequence of digits, beginning with a
+        digit and possibly separated by underscores. They are of a
+        generic type, and can be used where any numeric type is
+        expected. They may be prefixed with "0x" to indicate that the
+        following number is a hexadecimal value, or 0b to indicate a
+        binary value. Decimal values are not prefixed, and octal values
+        are not supported.
 
-        var|const|generic name [: type] [= expr]
+            eg: 0x123_fff, 0b1111, 1234
 
-    To give a few examples:
+        Float literals are also a sequence of digits beginning with a
+        digit and possibly separated by underscores. They are also of a
+        generic type, and may be used whenever a floating point type is
+        expected. Floating point literals are always in decimal, and
+        as of this writing, exponential notation is not supported[2]
 
-        var x
-        var foo : int
-        const c = 123
-        const pi : float32 = 3.1415
-        generic id : (@a -> @a) = {a:@a -> @a; -> a}
+            eg: 123.456
 
-    The first example, 'var x', declares a variable named x. The type is not
-    set explicitly, but it will be determined by the compiler (or the code
-    will fail to compile, saying that the type of the variable could not
-    be determined).
+        String literals represent a byte array describing a string in
+        the compile time character set. Any byte values are allowed in
+        a string literal. There are a number of escape sequences
+        supported:
+            \n          newline
+            \r          carriage return
+            \t          tab
+            \b          backspace
+            \"          double quote
+            \'          single quote
+            \v          vertical tab
+            \\          single slash
+            \0          nul character
+            \xDD        single byte value, where DD are two hex digits.
+        String literals begin with a ", and continue to the next
+        unescaped ".
 
-    The second example, 'var foo : int' explicitly sets the type of a
-    variable named 'foo' to an integer. It does not initialize it. However,
-    it is [FIXME: make this not a lie] a compilation error to use a
-    variable without prior intialization, so this is not dangerous.
+            eg: "foo\"bar"
 
-    The third example, 'cosnt c = 123' declares a constant named c,
-    and initializes it to the value 123. All constants require initializers,
-    as they cannot be assigned to later in the code.
+        Character literals represent a single codepoint in the character
+        set. A character starts with a single quote, contains a single
+        codepoint worth of text, encoded either as an escape sequence
+        or in the input character set for the compiler (generally UTF8).
 
-    The fourth example, 'const pi : float32 = 3.1415', shows the full form
-    of declarations. It includes both the type and initializer components.
+            eg: 'א', '\n', '\u1234'[3]
 
-    The final "overdeclared" example declares a generic function called
-    'id', which takes any type '@a' and returns the same type. It is
-    initialized to a function which specifies these types again, and
-    has a body that returns it's argument. This is not idiomatic code,
-    and is only provided as an example of what is possible. The normal
-    declaration would look something like this:
+        Boolean literals are either the keyword "true" or the keyword
+        "false".
 
-        generic id = {a:@a; -> a}
+            eg: true, false
 
-Control Structures:
+        Funciton literals describe a function. They begin with a '{',
+        followed by a newline-terminated argument list, followed by a
+        body and closing '}'. They will be described in more detail
+        later in this manual.
 
-Types:
+            eg: {a : int, b
+                    -> a + b
+                }
+        
+        Sequence literals describe either an array or a structure
+        literal. They begin with a '[', followed by an initializer
+        sequence and closing ']'. For array literals, the initializer
+        sequence is either an indexed initializer sequence[4], or an
+        unindexed initializer sequence. For struct literals, the
+        initializer sequence is always a named initializer sequence.
 
-    Myrddin comes with a large number of built in types. These are
-    listed below:
+        An unindexed initializer sequence is simply a comma separated
+        list of values. An indexed initializer sequence contains a
+        '#number=value' comma separated sequence, which indicates the
+        index of the array into which the value is inserted. A named
+        initializer sequence contains a comma separated list of
+        '.name=value' pairs.
 
-        void
-            The void type. This type represents an empty value.
-            For reasons of consistency when specializing generics, void
-            values can be created, assigned to, and manipulated like
-            any other value.
+            eg: [1,2,3], [#2=3, #1=2, #0=1], [.a = 42, .b="str"]
 
-        bool
-            A Boolean type. The value of this is either 'true' (equivalent
-            to any non-zero) or 'false', equivalent to a zero value. The
-            size of this type is undefined.
+        A tuple literal is a parentheses separated list of values.
+        A single element tuple contains a trailing comma.
 
-        char
-            A value representing a single code point in the default
-            encoding. The encoding is undefined, and the value of the
-            character is opaque.
+            eg: (1,), (1,'b',"three")
 
+3. SYNTAX OVERVIEW:
 
-        int8 int16 int32 int64 int
-        uint8 uint16 uint32 uint64 uint
-            Integer types. For the above types, the number at the end
-            represents the size of the type. The ones without a number at
-            the end are of undefined type. These values can be assumed to
-            be in two's complement. The semantics of overflowing are yet to
-            be specified.
+    Myrddin syntax will likely have a familiar-but-strange taste
+    to many people. Many of the concepts and constructions will be
+    similar to those present in C, but different.
 
-        float32 float64
-            Floating-point types. The exact semantics are yet to be
-            defined.
+    3.1: Declarations:
 
-        @<name>
-            A generic type. This is only allowed in the scope of 'generic'
-            constants.
+        A declaration consists of a declaration class (ie, one
+        of 'const', 'var', or 'generic'), followed by a declaration
+        name, optionally followed by a type and assignment. One thing
+        you may note is that unlike most other languages, there is no
+        special function declaration syntax. Instead, a function is
+        declared like any other value: By assigning its name to a
+        constant or variable.
 
-    It also allows composite types to be defined. These are listed below:
+            const:      Declares a constant value, which may not be
+                        modified at run time. Constants must have
+                        initializers defined.
+            var:        Declares a variable value. This value may be
+                        assigned to, copied from, and 
+            generic:    Declares a specializable value. This value
+                        has the same restricitions as a const, but
+                        taking its address is not defined. The type
+                        parameters for a generic must be explicitly
+                        named in the declaration in order for their
+                        substitution to be allowed.
 
-        <type>*
+        Examples:
 
-            A pointer to a type This type does not support C-style pointer
-            arithmetic, indexing, or any other such manipulation. However,
-            slices of it can be taken, which subsumes the majority of uses
-            for pointer arithmetic. The pointer is passed by value, but as
-            expected, the pointed to value is not.
+            Declare a constant with a value 123. The type is not defined,
+            and will be inferred.
 
-        <type>[,]
-
-            A slice of a type. Slices point to a number of objects. They
-            can be indexed, sliced, and assigned. They carry their range,
-            and can in principle be bounds-checked (although the compiler
-            currently does not do this, due to the lack of a runtime library
-            that will allow a 'panic' function to be called).
+                const x = 123
+                
+            Declares a variable with no value and no type defined. The 
+            value can be assigned later (and must be assigned before use),
+            and the type will be inferred.
 
-        <type>[size]
+                var y
 
-            An array of <type>. Unlike most languages other than Pascal, the
-            size of the array is a part of it's type, and arrays of
-            different sizes may not be assigned between each other. Arrays
-            are passed by value, and copied when assigned.
+            Declares a generic with type '@a', and assigns it the value
+            'blah'. Every place that 'z' is used, it will be specialized,
+            and the type parameter '@a' will be substituted.
 
-        <type0>,<type1>,...,<typeN>
+                generic z : @a = blah
 
-            A tuple of type t0, t1, t2, ....
+            Declares a function f with and without type inference. Both
+            forms are equivalent. 'f' takes two parameters, both of type
+            int, and returns their sum as an int
 
-    Finally, there are aggregate types that can be defined:
+                const f = {a, b
+                    var c : int = 42
+                    -> a + b + c
+                }
 
-        struct
+                const f : (a : int, b : int -> int) = {a : int, b : int -> int
+                    var c : int  = 42
+                    -> a + b + c
+                }
 
-        union
+    3.2: Data Types:
 
-    Any of these types can be given a name. This naming defines a new
-    type which inherits all the constraints of the previous type, but
-    does not unify with it. Eg:
+        The language defines a number of built in primitive types. These
+        are not keywords, and in fact live in a separate namespace from
+        the variable names. Yes, this does mean that you could, if you want,
+        define a variable named 'int'.
 
-        type t = int
-        var x : t
-        var y : int
-        x = y  // type error
-        x = 42 // sure, why not?
+        There are no implicit conversions within the language. All types
+        must be explicitly cast if you want to convert, and the casts must
+        be of compatible types, as will be described later.
 
-Type Constraints
+            3.2.1. Primitive types:
 
+                    void        
+                    bool            char
+                    int8            uint8
+                    int16           uint16
+                    int32           uint32
+                    int64           uint64
+                    int             uint
+                    long            ulong
+                    float32         float64
 
-Literals:
+                These types are as you would expect. 'void' represents a
+                lack of type, although for the sake of genericity, you can
+                assign between void, return void, and so on. This allows
+                generics to not have to somehow work around void being a
+                toxic type.
 
-    character
-    bool
-    int
-    float
-    func
-    sequence
+                bool is a boolean type, and can only be used for assignment
+                and comparison. 
 
-Symbols
+                char is a 32 bit integer type, and is guaranteed to be able
+                to hold exactly one codepoint. It can be assigned integer
+                literals, tested against, compared, and all the other usual
+                numeric types.
 
-Imports
+                The various [u]intXX types hold, as expected, signed and
+                unsigned integers of the named sizes respectively.
+                Similarly, floats hold floating point types with the
+                indicated precision.
 
-Exports
+                    var x : int         declare x as an int
+                    var y : float32     declare y as a 32 bit float
 
+
+            3.2.2. Composite types:
+
+                    pointer
+                    slice           array
+
+                Pointers are, as expected, values that hold the address of
+                the pointed to value. They are declared by appending a '*'
+                to the type. Pointer arithmetic is not allowed. They are
+                declared by appending a '*' to the base type
+
+                Arrays are a group of N values, where N is part of the type.
+                Arrays of different sizes are incompatible. Arrays in
+                Myrddin, unlike many other languages, are passed by value.
+                They are declared by appending a '[SIZE]' to the base type.
+
+                Slices are similar to arrays in many contemporary languages.
+                They are reference types that store the length of their
+                contents. They are declared by appending a '[,]' to the base
+                type.
+                    
+                    foo*        type: pointer to foo
+                    foo[123]    type: array of 123 foo
+                    foo[,]      type: slice of foo
+
+            3.2.3. Aggregate types:
+
+                    tuple           struct
+                    union
+
+                Tuples are the traditional product type. They are declared
+                by putting the comma separated list of types within square
+                brackets.
+
+                Structs are aggregations of types with named members. They
+                are declared by putting the word 'struct' before a block of
+                declaration cores (ie, declarations without the storage type
+                specifier).
+
+                Unions are the traditional sum type. They consist of a tag
+                (a keyword prefixed with a '`' (backtick)) indicating their
+                current contents, and a type to hold. They are declared by
+                placing the keyword 'union' before a list of tag-type pairs.
+
+                    [int, int, char]            a tuple of 2 ints and a char
+
+                    struct                      a struct containing an int named
+                        a : int                 'a', and a char named 'b'.
+                        b : char                
+                    ;;
+
+                    union                       a union containing one of
+                        `Thing int              int or char. The values are not
+                        `Other float32          named, but they are tagged.
+                    ;;
+
+
+            3.2.4. Magic types:
+
+                    tyvar           typaram
+                    tyname
+
+                A tyname is a named type, similar to a typedef in C, however
+                it genuinely creates a new type, and not an alias. There are
+                no implicit conversions, but a tyname will inherit all
+                constraints of its underlying type.
+
+                A typaram is a parametric type. It is used in generics as
+                a placeholder for a type that will be substituted in later.
+                It is an identifier prefixed with '@'. These are only valid
+                within generic contexts, and may not appear elsewhere.
+
+                A tyvar is an internal implementation detail that currently
+                leaks out during type inference, and is a major cause of
+                confusing error messages. It should not be in this manual,
+                except that the current incarnation of the compiler will
+                make you aware of it. It looks like '@$type', and is a
+                variable that holds an incompletely inferred type.
+
+                    type mine = int             creates a tyname named
+                                                'mine', equivalent to int.
+
+
+                    @foo                        creates a type parameter
+                                                named '@foo'.
+
+            3.2.5. Traits:
+
+    3.3: Control Constructs:
+    3.4: Packages and Uses:
+    3.5: Expressions
+
+4. TYPES:
+
+5. EXAMPLES:
+        
+6. GRAMMAR:
+
+7. FUTURE DIRECTIONS:
+
+BUGS:
+
+[1] TODO: trait, default, protect,
+[2] TODO: exponential notation.
+[3] TODO: \uDDDD escape sequences not yet recognized
+[4] TODO: currently the only sequence literal implemented is the
+          unindexed one