top of page
Search

The Journey of a Novice: Building a Programming Language from Scratch

Writer: Hareesh Lakshmi NarayananHareesh Lakshmi Narayanan

I finally have enough free time to pursue my dream of creating my own programming language.


It will be named PLox as I am taking inspiration from JLox. I want to keep the implementation simple and clean, so I will use Python. PLox language will be dynamically typed, and my initial goal is to create an interpreter that can execute a basic code and display the output to the user. Although I plan to add more features later, I prioritize getting the interpreter up and running. The code I have written has a comment line and a print statement that displays the output to the user."

//First PLox program
print "Hello, world!"

I have been learning about translating source code to machine code using interpreters and compilers. This process involves several steps, including:


1. Scanning: Also known as lexical analysis, this step involves using a scanner to group a stream of characters into meaningful tokens. This helps filter out meaningless characters that do not add value to the source code.

2. Parsing: The parser takes the sequence of tokens generated in the scanning step and builds a tree-like structure that follows the nested structure of the programming language's grammar. This tree structure is called an Abstract Syntax Tree (AST).


3. Static Analysis: In this step, the compiler determines where the code defines the variables and understands their scope. This step ensures that the program does not produce TypeErrors for statically typed languages.


4. Intermediate Representation: The output from the compiler is stored in an intermediate representation that decouples the program's source language from the final architecture it will run in.


5. Optimization: The compiler swaps out parts of the program with a similar program that has the same meaning but is implemented more effectively.


6. Code Generation: The final step is converting the code into a machine-readable format to run assembly instructions.


Details on language semantics are available here.


I am going to start by implementing the Scanner. For my current goal, I only need to support a String literal and a Statement.


Literal: String

  • A String literal is enclosed in double quotes. For example,

  • "" // Empty string

  • "123" //String, not a number

Statement

  • A statement's job is to produce an effect. The print statement evaluates an expression in the above code and displays the result to the user.


Output of Scanner is a list of tokens. Definition of each Token is below

"""
The token is a wrapper to enclose a lexeme and its associated data.
Lexeme: Character blob that represents something. Raw substrings from the source code.
Type: Kind of Lexeme This represents
Literal: Runtime object corresponding to the textual representation
Line: Line number at which the Token appears
"""


class Token(object):
    def __init__(self, type, lexeme, literal, line):
        self.type = type
        self.lexeme = lexeme
        self.literal = literal
        self.line = line

    def __str__(self):
        return self.type + " " + self.lexeme + " " + self.literal

A TokenType is defined as

from enum import Enum, auto


class TokenType(Enum):
    #Keywords
    PRINT = auto()

    #Literals
    IDENTIFIER = auto()
    STRING = auto()
    EOF = auto()

Initial Scanner implementation to support string literals and PRINT keyword

from Token import Token
from TokenType import TokenType

def is_alpha(c):
    return ('a' <= c <= 'z') or ('A' <= c <= 'Z')


class Scanner(object):
    # Container for all keywords in the language
    keywords = {"print": TokenType.PRINT}

    def __init__(self, source: str):
        self.source = source
        self.tokens = list()

        # Start points to the first character in the lexeme being scanned
        self.start = 0

        # Current position currently being considered
        self.current = 0

        # Source line current is on to produce tokens that know their location
        self.line = 1

    # Function to check if end of the source file
    def is_at_end(self):
        return self.current >= len(self.source)

    # Function to Tokenize source file
    def scan_tokens(self):

        # Construct tokens till EOF
        while not self.is_at_end():
            self.start = self.current
            self.scan_token()

        # Append a unique token to denote EOF
        self.tokens.append(Token(TokenType.EOF, None, None, self.line))
        return self.tokens

    # Return char at current and increment by 1
    def advance(self):
        output = self.source[self.current + 1]
        self.current += 1
        return output

    # Add token to the list of Tokens
    def add_token(self, token_type, literal=None):
        if not literal:
            text = self.source[self.start: self.current]
            self.tokens.append(Token(token_type, text, literal, self.line))
        else:
            self.add_token(token_type, None)

    # Scan a single character token from input file
    def scan_token(self):
        c = self.advance()

        # " denotes starting of a String literal
        if c == '"':
            self.string()
        if is_alpha(c):
            self.identifier()

    def string(self):
        while self.peek() != '"' and not is_alpha(self):
            if self.peek() == '\n':
                self.line += 1
            self.advance()

        # Account for closing "
        self.advance()
        value = self.source[self.start + 1:self.current - 1]
        self.add_token(TokenType.STRING, value)

    def identifier(self):
        while is_alpha(self.peek()):
            self.advance()

        text = self.source[self.start: self.current]
        type = self.keywords.get(text)
        if not type:
            type = TokenType.IDENTIFIER
        self.add_token(type)

    def peek(self) -> object:
        if self.is_at_end():
            return '\0'
        return self.source[self.current]

Implementation of Scanner is going to be tedious, as I add feature to support other operations like arithmetic, boolean.


Next is to take the Tokens and transform them into a richer representation. To be continued...




 
 
 

Recent Posts

See All

Rust - variable shadow

Rust allows to declare a new variable with the same as previously declared variable. It is referred to as shadowing. Rust also allows...

Joins

Joining datasets or tables is a common operation these days and Join type determines what rows will be in the result set. This post is a...

Always write Unit Tests

Unit test code should be treated as production code and unit tests should be considered a product feature. Unit tests ensure defects are...

1 Comment


Murali Krishnan
Murali Krishnan
Jan 05, 2024

Way to go Hareesh. I still remember the 'S' language that was allegedly 'created' by our first yr UG prof which we made fun of. From there to here,it's been a great journey

Like
Post: Blog2_Post

hareesh.lakshminarayanan(at)gmail.com

bottom of page