Building the Scanner

May 26, 2023 · 4 min read

Piglang Creator

"Take big bites. Anything worth doing is worth overdoing."

- Robert A. Heinlein, Time Enough for Love

The commencement of any compiler or interpreter involves scanning, a process where raw source code, depicted as a string of characters, undergoes segmentation into meaningful segments termed as tokens. These tokens represent the essential building blocks that define the structure and grammar of the language.

Referred to interchangeably as "scanning" or "lexing" (short for "lexical analysis") over time, these terms historically carried different interpretations. In the earlier computing era, "scanner" was specifically linked to code handling raw source code characters' retrieval from disk and subsequent buffering in memory. Meanwhile, "lexing" encompassed the phase following this, involving the useful processing of these characters.

Initiating with scanning serves as a favorable entry point for us. The task isn't overly complex; essentially, it involves a match statement with ambitious undertones, serving as a precursor to more intricate elements later on. By chapter's end, our aim is to construct a comprehensive and efficient scanner capable of transforming any Pig source code string into tokens, subsequently utilized in parsing.

Tokens

Tokens represent the primary components in the language's syntax, and our initial focus will be on their implementation. They encompass diverse elements, such as identifiers, literals (like integers, floats, strings), and various operators, meticulously categorized within the Token enum structure.

#[derive(Debug, Clone)]
pub enum Token {
    Illegal,
    Eof,

    // Identifiers + literals
    Ident(String),  // add, foobar, x, y, ...
    Int(String),    // 123456
    Float(String),  // 123.456
    String(String), // "hello"

    // Operators
    Assign,
    Plus,
    Minus,
    Bang,
    Asterisk,
    Slash,

    Lt,
    Gt,

    Eq,
    NotEq,

    Comma,
    Colon,
    Semicolon,

    Lparen,
    Rparen,
    Lbrace,
    Rbrace,
    Lbracket,
    Rbracket,

    Function,
    Let,
    True,
    False,
    If,
    Else,
    Return,
}

Scanner

With the token definitions in place, the subsequent step is the implementation of the scanner. This component traverses through each line of the source code, character by character, discerning and assembling them into respective tokens. Employing a match construct, individual characters are identified and mapped to their corresponding tokens. For instance, characters like '=', ':', ';', '(' are matched to specific tokens like Token::Eq, Token::Colon, Token::Semicolon, Token::Lparen, etc.

Furthermore, the code snippet incorporates conditional logic for complex character combinations (like '==', '!=') and other constructs such as identifiers, integers, floating-point numbers, and strings. These conditions are instrumental in accurately categorizing characters and generating respective tokens.

        match self.ch {
            '=' => {
                if self.peek_char() == '=' {
                    self.read_char();
                    tok = Token::Eq;
                } else {
                    tok = Token::Assign;
                }
            }
            ':' => {
                tok = Token::Colon;
            }
            ';' => {
                tok = Token::Semicolon;
            }
            '(' => {
                tok = Token::Lparen;
            }
            ')' => {
                tok = Token::Rparen;
            }
            ',' => {
                tok = Token::Comma;
            }
            '+' => {
                tok = Token::Plus;
            }
            '-' => {
                tok = Token::Minus;
            }
            '!' => {
                if self.peek_char() == '=' {
                    self.read_char();
                    tok = Token::NotEq;
                } else {
                    tok = Token::Bang;
                }
            }
            '*' => {
                tok = Token::Asterisk;
            }
            '/' => {
                tok = Token::Slash;
            }
            '<' => {
                tok = Token::Lt;
            }
            '>' => {
                tok = Token::Gt;
            }
            '{' => {
                tok = Token::Lbrace;
            }
            '}' => {
                tok = Token::Rbrace;
            }
            '[' => {
                tok = Token::Lbracket;
            }
            ']' => {
                tok = Token::Rbracket;
            }
            '"' => {
                tok = Token::String(self.read_string().to_string());
            }
            '\u{0}' => {
                tok = Token::Eof;
            }
            _ => {
                if is_letter(self.ch) {
                    let ident = self.read_identifier();
                    return token::lookup_ident(ident);
                } else if is_digit(self.ch) {
                    let integer_part = self.read_number().to_string();
                    if self.ch == '.' && is_digit(self.peek_char()) {
                        self.read_char();
                        let fractional_part = self.read_number();
                        return Token::Float(format!("{}.{}", integer_part, fractional_part));
                    } else {
                        return Token::Int(integer_part);
                    }
                } else {
                    tok = Token::Illegal
                }
            }
        }

"Take big bites. Anything worth doing is worth overdoing."​

- Robert A. Heinlein, Time Enough for Love​

Tokens​

Scanner​

"Take big bites. Anything worth doing is worth overdoing."

- Robert A. Heinlein, Time Enough for Love

Tokens

Scanner