tiny-rustc

编译原理课程设计：简易 rust 编译器

使用 rust 实现的简易 rust 编译器，实现了词法分析、语法分析，支持定义函数、定义变量、运算符、控制流、输出等基本功能。

用法

tiny-rustc <source_file.rs> [-o <output_file>] [-emit=tokens|ast|ir|asm]

输出文件默认在同目录下

emit 参数用于输出编译过程的中间产物

语法

tiny-rust 的语法是 rust 的子集，即 tiny-rust 程序都可以由 rustc 编译。

程序结构

tiny-rust 的程序由若干函数组成，每个函数都以 fn 关键字开头，函数名后跟一对括号，括号内是函数的参数列表，每个参数由变量名、:、类型三部分构成，之后是 -> 和函数的返回类型，函数体由一对大括号包围。

由大括号包围的代码块的结尾如果是表达式，则作为代码块的值。

整个程序以 main 函数作为入口，main 函数的返回类型是 () ，可以省略。

数据类型

tiny-rust 支持以下 6 种数据类型：

i32 32 位有符号整数，如 123
f64 64 位浮点数，如 3.14
bool 布尔类型，值为 true 或 false
&str 字符串，只支持 "" 形式的字符串字面量，不支持 r"" 形式的原始字符串字面量
元组，例如 (i32, f64)
数组，例如 [i32; 10]

由于 tiny-rust 支持的数据类型在 rust 中都实现了 copy trait ，所以赋值时会直接拷贝而不是转移 ownership ，所以 tiny-rustc 并没有实现 rust 中 ownership 的机制。

变量

定义变量的语法为 let [mut] variable_name: variable_type [=value]; ，其中 mut 表示变量可变，变量类型不能省略。

运算符

支持的运算符：

+ 加法
- 减法，取反
* 乘法
/ 除法
% 取模
= 赋值
+= 加法赋值
-= 减法赋值
*= 乘法赋值
/= 除法赋值
%= 取模赋值
== 等于
!= 不等于
< 小于
> 大于
<= 小于等于
>= 大于等于
! 逻辑非
&& 逻辑与
|| 逻辑或
[] 索引构造数组
. 访问成员
() 改变优先级构造元组

io

tiny-rust 只支持如下形式的输出：

println!("output: {}", value);

println!("output: {} {}", v1, v2);

...

不支持格式化选项

控制流

支持的控制流语句：

if 语句，语法为 if condition { block }
if-else 语句，语法为 if condition { block } else { block } ，其中 block 可以是值，但两个分支的类型必须相同
while 语句，语法为 while condition { block }
loop 语句，语法为 loop { block }

支持 break 、 continue 和 return

不支持 for 和 match

注释

只支持单行注释，以 // 开头

示例

fn main() {
  let mut n: i32 = 1;
  while n <= 10 {
    println!("The {}th Fibonacci number is: {}", n, fib(n));
    n += 1;
  }
}
fn fib(n: i32) -> i32 {
  if n <= 1 {
    n
  } else {
    fib(n - 1) + fib(n - 2)
  }
}

编译过程

以这段代码为例

fn main() {
  println!("Hello, world!");
}

词法分析结果：

[
  Ident(
    "fn",
  ),
  Ident(
    "main",
  ),
  OpenParen,
  CloseParen,
  OpenBrace,
  Ident(
    "println",
  ),
  Bang,
  OpenParen,
  Str(
    "Hello, world!",
  ),
  CloseParen,
  Semi,
  CloseBrace,
]

语法分析结果：

Program {
  func_decls: [
    FuncDecl {
      name: "main",
      params: [],
      return_type: Tuple(
        [],
      ),
      code_block: CodeBlock {
        statements: [
          Val(
            Print(
              [
                Lit(
                  Str(
                    "Hello, world!",
                  ),
                ),
              ],
            ),
          ),
        ],
        val: Lit(
          Tuple(
            [],
          ),
        ),
      },
    },
  ],
}

代码实现

项目结构

.
├── Cargo.toml
├── README.md
├── examples
│   ├── hello.rs
│   └── ...
├── src
│   ├── ast.rs
│   ├── lexer.rs
│   ├── lib.rs
│   ├── llvm.rs
│   └── main.rs
└── tests
    └── tests.rs

examples 一些示例程序
lexer 词法分析器
ast 语法分析器
llvm 通过 llvm 的接口生成目标代码（目前只实现了一小部分）
lib 编译器的框架
main 解析命令行参数
tests 测试 examples 中的程序运行结果是否符合预期

词法分析

实现对标识符、字面量的解析，并去除注释和空白字符。

为了方便对 Option<Box<Self>> 形式的处理，双字符运算符会被解析为两个 token 。

#[derive(Logos, Debug, PartialEq, Clone)]
pub enum Token {
  #[regex(r"\/\/.*\n", skip)]
  LineComment,
  #[regex(r"\s+", skip)]
  Whitespace,

  #[regex(r"[a-zA-Z_]\w*", |lex| lex.slice().to_string())]
  Ident(String),
  #[regex(r"[0-9]+", |lex| lex.slice().parse().ok())]
  Int(i32),
  #[regex(r"[0-9]+\.[0-9]+", |lex| lex.slice().parse().ok())]
  Float(f64),
  #[regex(r#""([^"\\]|\\.)*""#, |lex| { 
    let s = lex.slice();
    unescape(&s[1..s.len()-1])
  })]
  Str(String),

  #[token(".")]
  Dot,
  // ...
}

用来表示词法分析器的输出，使用正则表达式来匹配。

语法分析

主要语法的产生式如下：

Program -> FuncDecl*
FuncDecl -> "fn" Ident Params ["->" Type] CodeBlock
Params -> "()" | "(" [Param ("," Param)*] [","] ")"
Param -> Ident ":" Type
Type -> "i32" | "f64" | "bool" | "&str" | ArrayType | TupleType
ArrayType -> "[" Type ";" Val "]"
TupleType -> "()" | "(" [Type ("," Type)*] [","] ")"
CodeBlock -> "{" [Statement*] [Val] "}"
Statement -> Val ";" | LetStatement | WhileStatement | LoopStatement | BreakStatement | ContinueStatement
LetStatement -> "let" ["mut"] Ident ":" Type ["=" Val] ";"
WhileStatement -> "while" Val CodeBlock
LoopStatement -> "loop" CodeBlock
BreakStatement -> "break" ";"
ContinueStatement -> "continue" ";"
Val -> OperandD (BiOp OperandD)*
OperandD -> UnaryOp* Operand ["." FuncCall | "[" Val "]"]*
Operand -> Literal | Ident | Print | CodeBlock | FuncCall | IfExpr
Literal -> Int | Float | Bool | Str | Tuple | Array
Tuple -> "()" | "(" Val ("," Val)* [","] ")"
Array -> "[" [Val ("," Val)* [","]] "]"
Print -> "println" Bang Args
FuncCall -> Ident Args
IfExpr -> "if" Val CodeBlock ["else" CodeBlock]
Args -> "()" | "(" [Val ("," Val)* [","]] ")"

其中 Val 表示由二元运算符连接操作数形成的表达式， OperandD 带有一元运算符、下标、函数链式调用的操作数， Operand 表示操作数的主题部分。

各种类的组成关系如下

classDiagram
    class Program {
        Vec~FuncDecl~ func_decls
    }

    class FuncDecl {
        String name
        Vec~String, Type~ params
        Type return_type
        CodeBlock code_block
    }

    class Type {
        <<enumeration>>
        I32
        F64
        Bool
        Str
        Array(Box~Type~, Value)
        Tuple(Vec~Type~)
    }

    class CodeBlock {
        Vec~Statement~ statements
        Value val
    }

    class Statement {
        <<enumeration>>
        Let(String, Type, bool, Box~Value~)
        Val(Value)
        While(Value, CodeBlock)
        Loop(CodeBlock)
        Break
        Continue
        None
    }

    class BiOp {
        <<enumeration>>
        Plus
        ...
    }

    class BiOperation {
        BiOp op
        Box~Value~[2] operands
    }

    class UnOp {
        <<enumeration>>
        Neg
        Not
    }

    class UnOperation {
        UnOp op
        Box~Value~ operand
    }

    class Literal {
        <<enumeration>>
        I32(i32)
        F64(f64)
        Bool(bool)
        Str(String)
        Array(Vec~Value~)
        Tuple(Vec~Value~)
    }

    class Value {
        <<enumeration>>
        BiOp(BiOperation)
        UnOp(UnOperation)
        Lit(Literal)
        Ident(String)
        FuncCall(String, Vec~Value~)
        Print(Vec~Value~)
        CodeBlock(Box~CodeBlock~)
        IfExpr(Box~Value~, Box~CodeBlock~, Box~CodeBlock~)
    }

    Program --> FuncDecl
    FuncDecl --> Type
    FuncDecl --> CodeBlock
    CodeBlock --> Statement
    CodeBlock --> Value
    Statement --> Value
    Statement --> CodeBlock
    BiOperation --> BiOp
    BiOperation --> Value
    UnOperation --> UnOp
    UnOperation --> Value
    Value --> BiOperation
    Value --> UnOperation
    Value --> Literal
    Value --> CodeBlock
    Value --> Type
    Type --> Value

代码采用递归下降实现，下面是语法分析器的函数头

pub struct Parser<I: Iterator<Item = Token>> {
  tokens: Peekable<I>,
}
impl<I: Iterator<Item = Token>> Parser<I> {
  pub fn parse(iter: I) -> Program;
  fn expect(&mut self, expected: Token);
  fn expect_ident(&mut self) -> String;
  fn parse_program(&mut self) -> Program;
  fn parse_func_decl(&mut self) -> FuncDecl;
  fn parse_param(&mut self) -> (String, Type);
  fn parse_type(&mut self) -> Type;
  fn parse_code_block(&mut self) -> CodeBlock;
  fn parse_statement(&mut self) -> (Statement, bool);
  fn try_parse_biop(&mut self) -> Option<BiOp>;
  fn try_parse_unop(&mut self) -> Option<UnOp>;
  fn parse_expr(&mut self) -> Value;
  fn parse_operand_deco(&mut self) -> Value;
  fn parse_operand(&mut self) -> Value;
  fn parse_tuple_array(&mut self) -> Value;
  fn parse_args(&mut self) -> Vec<Value>;
}

I: Iterator<> 是词法分析器产生的 token stream 。

parse_statement 返回的 bool 值表示是否作为代码块的值

在 parse_operand_deco 中会依次记录遇到的一元运算符，然后调用 parse_operand ，之后解析遇到的下标和函数调用。

parse_expr 中维护了两个栈保存操作数和运算符，遇到运算符时会比较优先级，决定是否合并操作数。

总结

编写过程中遇到了一些平时不太会注意的细节：

rust 支持嵌套注释，如

/* /* nested comment */ */

嵌套的注释无法直接使用正则表达式进行匹配，需要使用记录嵌套层数的状态机进行解析。

在 rust 中，很多语句都可以作为值，即单元类型（空元组），例如

let u: () = ();
let a: () = while false {};
let b: () = loop { break; };
let c: () = if false {};

但值为单元类型的语句和 () 又不完全等价， () 无法被解析为语句

let a;
while false {} // ok
let b;
() // compile error: expected `;`
let c;

且返回值不为 () 的 if 表达式无法作为独立语句，不过这应该在语义分析时检查类型，因此不属于语法错误

let a;
if true { () } else { () } // ok
let b;
if true { 1 } else { 2 } // type mismatch: expected `()` but found `i32`
let c;

这使得对于流程控制语句的解析有些复杂。

对尾逗号和单元素元组的处理

assert_eq!((1, 2), (1, 2,)); // ok
assert_eq!((1), (1,)); // compile error: can't compare integer with (integer,)

一开始对元组的解析和对函数调用参数的解析使用了相同的函数，导致单参数函数调用无法被正确解析。

运算符 + 只能作为二元运算符使用，和 c++ 不同

项目源码 https://github.com/iuy1/tiny-rustc

参考资料

The Rust Programming Language https://doc.rust-lang.org/book/
Rust source code https://github.com/rust-lang/rust
Overview of the Rust compiler https://rustc-dev-guide.rust-lang.org/overview.html
TinyC 编译器 https://pandolia.net/tinyc/index.html
logos handbook https://logos.maciej.codes/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tiny-rustc

目录

用法

语法

程序结构

数据类型

变量

运算符

io

控制流

注释

示例

编译过程

代码实现

项目结构

词法分析

语法分析

总结

参考资料

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
examples		examples
src		src
.gitattributes		.gitattributes
Cargo.toml		Cargo.toml
README.md		README.md

iuy1/tiny-rustc

Folders and files

Latest commit

History

Repository files navigation

tiny-rustc

目录

用法

语法

程序结构

数据类型

变量

运算符

io

控制流

注释

示例

编译过程

代码实现

项目结构

词法分析

语法分析

总结

参考资料

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages