Chapter 4: Tokens??

Tokens are the smallest meaningful part of the source code ( the high level language).

So if your language contains if keyword it should have a token “IF” that would allow the compiler to identify the stream of characters in a source code and then send them further for processing

How do we generate these tokens?

Well for this purpose we have what is called a lexer .

A lexer simply takes in the source file or wherever you have stored the code and match them through a set of predefined tokens .

It then stores these tokens in a list which is further passed for processing.

Since i am using cpp as my go to language , i am using enums to group all of this tokens in a single piece inside the tokens.hpp header

What are enums?

It is a user-defined data type in programming that allows you to define a set of named integral constants. This makes your code more readable and manageable by grouping related values under a single type. For example, you might use an enum to represent days of the week, colors, or states in a finite state machine.

enum Color {
    RED,
    GREEN,
    BLUE
};

Color favoriteColor = GREEN;

Now , the variable favoriteColor cannot hold any thing except for the literals mentioned inside the enum Color

This is how a typical enum would look like in our case

enum TokenType {
    TOKEN_IDENTIFIER,
    TOKEN_NUMBER,
    TOKEN_OPERATOR,
    TOKEN_KEYWORD,
    TOKEN_END_OF_FILE
};

Every token will have these basic information associated with them that we will implement using a class :

  • the token type (taken from enum)

  • the actual word

  • line number (useful for providing debugging messages)

I have provided a basic layout of how the class will look like


class Token {
public:
//the enum TYPE has been enclosed within a namespace TOKEN
//this will prevent any sort of naming conflicts that might occur later as the code base grows
//you can skip it if you think you are implementing a simple language
    TOKEN::TYPE type;  
    std::string lexeme;
    int line=-1;

    // Constructor
    Token(TOKEN::TYPE type, std::string lexeme, int line) 
        : type(type), lexeme(lexeme), line(line) {}


    // Method to convert Lexer details to string
    std::string toString() const {
        return "tokenType: " + std::to_string(static_cast<int>(type)) + 
               " lexeme: " + lexeme + 
               " lineNo.: " + std::to_string(line);
    }
};

That’s it , now this tokens files would be useful when we discuss the lexing phase

Any suggestion is highly appreciated !!

Next up we will have a look at Grammar of the language

Till then if you are using cpp you could have a read about namespace and how could we use it with enums to prevent naming conflicts.