lundi 28 juillet 2014

Writing a lexer in C++ using re2c

Hello,

For the past 2 weeks, during my holidays, I wanted to learn how to write a good parser in C++ but not using flex/bison (they were not generating c++ last time I did look at it, with global variables etc. It might have changed since then).

Anyway, let's go first with the re2c part. The fact that PHP is using it is giving confidence. However as it was not updated since a long time, I just sent an email to the mailing list and I just got a nice answer from a maintainer. 1 week later, the new release went out (0.13.7.3) with UTF-8 and UTF-16 support, just excellent!

I will just focus on using re2c on a string (C null terminated), so I won't show the buffer refill method. It will be probably a topic for another article.

let's now try to make a lexer that identify a number, and some token like +, - and *.
(yes, the famous calculator! but it is just this article, I will go deeper into PL/SQL in the next ones)

so just below our class definition:

#ifndef LEXER_HPP
#define LEXER_HPP

#include <string>

class Lexer {
public:

    enum token_t {
        TOKEN_END = 0,
        TOKEN_MUL,
        TOKEN_PLUS,
        TOKEN_MINUS,
        TOKEN_NUMBER
    };

    Lexer( const char *s );

    token_t scan();
    
    std::string getTokenValue() const;

private:
    const char *m_content;
    
    const char *m_start;
    const char *m_cursor;
    const char *m_limit;
    const char *m_marker;
    const char *m_ctxmarker;
};


#endif

then our "re" file, that will generate our .cpp file:

#include "lexer.hpp"

using namespace std;

Lexer::Lexer( const char *s ) : m_content(s)
{
    m_start = m_cursor = m_content;
    m_limit = m_content+strlen(m_content);
}

std::string Lexer::getTokenValue() const
{
    return string(m_start,m_cursor-m_start);
}

/*!max:re2c */

Lexer::token_t Lexer::scan()
{
    m_start = m_cursor;

    #define YYCTYPE char
    #define YYCURSOR m_cursor
    #define YYLIMIT m_limit
    #define YYMARKER m_marker
    #define YYCTXMARKER     m_ctxmarker
    #define YYFILL(n)   

    /*!re2c
        re2c:indent:top      = 1;
        re2c:yyfill:enable   = 0;
        
        '\000'              { return TOKEN_END; }
        [0-9]+              { return TOKEN_NUMBER; }
        "+"                 { return TOKEN_PLUS; }
        "-"                 { return TOKEN_MINUS; }
        "*"                 { return TOKEN_MUL; }
    */
}

some explanations:
- YYLIMIT is the end of the buffer, for us this is the end of our string
- YYCURSOR is the character position being read
- YYMARKER and YYCTXMARKER is used when re2c has to backtrack (like for instance when one token has to be followed by something)

What is important, very important even, is to guard the end of your string, like the null character by the token '\000', identified as TOKEN_END

Then you just have to run re2c like below, to generate the lexer.cpp file:
re2c -i -o lexer.cpp lexer.re

-i     --no-debug-info  Do not generate '#line' info (usefull for versioning).
-o of  --output=of      Specify the output file (of) instead of stdout

and now we will use the lexer in our main.cpp:

#include "lexer.hpp"
#include <iostream>

using namespace std;

int main() {
    const char * content = "1+4*2-1+10*2+";

    Lexer lexer(content);
    
    Lexer::token_t token;
    
    do {

        token = lexer.scan();
        
        switch (token) {
            case Lexer::TOKEN_NUMBER:
                cout << "Num: " << lexer.getTokenValue() << endl;
                break;
            case Lexer::TOKEN_PLUS:
                cout << "+: " << lexer.getTokenValue() << endl;
                break;
            case Lexer::TOKEN_MINUS:
                cout << "-: " << lexer.getTokenValue() << endl;
                break;
            case Lexer::TOKEN_MUL:
                cout << "*: " << lexer.getTokenValue() << endl;
                break;
            case Lexer::TOKEN_END:
                cout << "end reached" << endl;
                break;
        };
    }
    while( token != Lexer::TOKEN_END );
    
    cout << "finished!" << endl;

    return 0;
}

giving the output:

Num: 1
+: +
Num: 4
*: *
Num: 2
-: -
Num: 1
+: +
Num: 10
*: *
Num: 2
+: +
end reached

finished!

Isn't that simple?

Next article will be covering Lemon... again a fantastic tool to generate a LALR(1) parser (done by the creator of sqlite and used in sqlite).

See you soon!