I finally have enough free time to pursue my dream of creating my own programming language.
It will be named PLox as I am taking inspiration from JLox. I want to keep the implementation simple and clean, so I will use Python. PLox language will be dynamically typed, and my initial goal is to create an interpreter that can execute a basic code and display the output to the user. Although I plan to add more features later, I prioritize getting the interpreter up and running. The code I have written has a comment line and a print statement that displays the output to the user."
//First PLox program
print "Hello, world!"
I have been learning about translating source code to machine code using interpreters and compilers. This process involves several steps, including:
1. Scanning: Also known as lexical analysis, this step involves using a scanner to group a stream of characters into meaningful tokens. This helps filter out meaningless characters that do not add value to the source code.
2. Parsing: The parser takes the sequence of tokens generated in the scanning step and builds a tree-like structure that follows the nested structure of the programming language's grammar. This tree structure is called an Abstract Syntax Tree (AST).
3. Static Analysis: In this step, the compiler determines where the code defines the variables and understands their scope. This step ensures that the program does not produce TypeErrors for statically typed languages.
4. Intermediate Representation: The output from the compiler is stored in an intermediate representation that decouples the program's source language from the final architecture it will run in.
5. Optimization: The compiler swaps out parts of the program with a similar program that has the same meaning but is implemented more effectively.
6. Code Generation: The final step is converting the code into a machine-readable format to run assembly instructions.
Details on language semantics are available here.
I am going to start by implementing the Scanner. For my current goal, I only need to support a String literal and a Statement.
Literal: String
A String literal is enclosed in double quotes. For example,
"" // Empty string
"123" //String, not a number
Statement
A statement's job is to produce an effect. The print statement evaluates an expression in the above code and displays the result to the user.
Output of Scanner is a list of tokens. Definition of each Token is below
"""
The token is a wrapper to enclose a lexeme and its associated data.
Lexeme: Character blob that represents something. Raw substrings from the source code.
Type: Kind of Lexeme This represents
Literal: Runtime object corresponding to the textual representation
Line: Line number at which the Token appears
"""
class Token(object):
def __init__(self, type, lexeme, literal, line):
self.type = type
self.lexeme = lexeme
self.literal = literal
self.line = line
def __str__(self):
return self.type + " " + self.lexeme + " " + self.literal
A TokenType is defined as
from enum import Enum, auto
class TokenType(Enum):
#Keywords
PRINT = auto()
#Literals
IDENTIFIER = auto()
STRING = auto()
EOF = auto()
Initial Scanner implementation to support string literals and PRINT keyword
from Token import Token
from TokenType import TokenType
def is_alpha(c):
return ('a' <= c <= 'z') or ('A' <= c <= 'Z')
class Scanner(object):
# Container for all keywords in the language
keywords = {"print": TokenType.PRINT}
def __init__(self, source: str):
self.source = source
self.tokens = list()
# Start points to the first character in the lexeme being scanned
self.start = 0
# Current position currently being considered
self.current = 0
# Source line current is on to produce tokens that know their location
self.line = 1
# Function to check if end of the source file
def is_at_end(self):
return self.current >= len(self.source)
# Function to Tokenize source file
def scan_tokens(self):
# Construct tokens till EOF
while not self.is_at_end():
self.start = self.current
self.scan_token()
# Append a unique token to denote EOF
self.tokens.append(Token(TokenType.EOF, None, None, self.line))
return self.tokens
# Return char at current and increment by 1
def advance(self):
output = self.source[self.current + 1]
self.current += 1
return output
# Add token to the list of Tokens
def add_token(self, token_type, literal=None):
if not literal:
text = self.source[self.start: self.current]
self.tokens.append(Token(token_type, text, literal, self.line))
else:
self.add_token(token_type, None)
# Scan a single character token from input file
def scan_token(self):
c = self.advance()
# " denotes starting of a String literal
if c == '"':
self.string()
if is_alpha(c):
self.identifier()
def string(self):
while self.peek() != '"' and not is_alpha(self):
if self.peek() == '\n':
self.line += 1
self.advance()
# Account for closing "
self.advance()
value = self.source[self.start + 1:self.current - 1]
self.add_token(TokenType.STRING, value)
def identifier(self):
while is_alpha(self.peek()):
self.advance()
text = self.source[self.start: self.current]
type = self.keywords.get(text)
if not type:
type = TokenType.IDENTIFIER
self.add_token(type)
def peek(self) -> object:
if self.is_at_end():
return '\0'
return self.source[self.current]
Implementation of Scanner is going to be tedious, as I add feature to support other operations like arithmetic, boolean.
Next is to take the Tokens and transform them into a richer representation. To be continued...
Way to go Hareesh. I still remember the 'S' language that was allegedly 'created' by our first yr UG prof which we made fun of. From there to here,it's been a great journey