jeudi 7 août 2014

Create a parser using Lemon with re2c for the lexer

Hello,

Last time, I explained how to use re2c to create a lexer, now I will present how to combine it with Lemon for the parser.

I wanted first to show it with a more concrete example than a calculator (the famous one), but I swear I tried, but the result is too big and it is not easy to write an simple article, focused on Lemon.

I decided then to continue showing Lemon for a calculator but then writing other article focusing on how to use it especially for more complex grammar like PLSQL.

I encourage you first to read the documentation (short enough) of Lemon... and read it again, and again ... Actually it is well explained but I had personally to read it multiple times. In fact, you need to try a concrete example to understand it (no I didn't say fully :-)).

http://www.hwaci.com/sw/lemon/lemon.html

Lemon is just a C file to compile. very simple, just download lemon.c and lempar.c
http://www.hwaci.com/sw/lemon/

compile lemon simply by:
> clang lemon.c -o lemon

I personally put lemon and lempar.c in my /usr/local/bin, you can actually put it everywhere you want.

see the grammar of the calculator (calc_parser.y):

%name calcParse
%token_prefix CALC_TOKEN_

%token_type {int}  

%left PLUS MINUS. 
   
%include {
#include ≶assert.h>
#include ≶iostream>
}  
   
%syntax_error {
    std::cout << "syntax error - ";
    int n = sizeof(yyTokenName) / sizeof(yyTokenName[0]);
    for (int i = 0; i < n; ++i) {
            int a = yy_find_shift_action(yypParser, (YYCODETYPE)i);
            if (a < YYNSTATE + YYNRULE) {
                    std::cout << "expected " << yyTokenName[i] << std::endl;
            }
    }
}

calc ::= exp(e) END.                { std::cout << "= " << e << std::endl; }
   
exp(lhs) ::= NUMBER(e).             { lhs = e; }
exp(lhs) ::= LPAR exp(e) RPAR.      { lhs = e; }
exp(lhs) ::= exp(le) MINUS exp(re). { lhs = le-re; }

exp(lhs) ::= exp(le) PLUS exp(re).  { lhs = le+re; }

Lemon is really better than bison or yacc on how to handle variables, I really like to not reference $1 but named variable.

Look at the syntax error, it will show which token was expected if the parsing fails.
(I would have liked a default one like that)

Now, we have to generate the .c and the .h files.
lemon -m -l calc_parser.y

  -m           Output a makeheaders compatible file.
  -l           Do not print #line statements.

as you can see, I generate the .c file compatible with "makeheaders"
makeheaders is also made by the same author than lemon and sqlite.
By default lemon only export the token in the .h, but not the functions.
it will be generated if I use makeheaders

makeheaders calc_parser.c

now the calc_parser.c should be forced to be compiled as c++.
In xcode, it is simple, just select the file and on the right select the source type.


calc_parser.h:

/* This file was automatically generated.  Do not edit! */
#define calcParseTOKENTYPE int
#define calcParseARG_PDECL
void calcParse(void *yyp,int yymajor,calcParseTOKENTYPE yyminor calcParseARG_PDECL);
#if defined(YYTRACKMAXSTACKDEPTH)
int calcParseStackPeak(void *p);
#endif
void calcParseFree(void *p,void(*freeProc)(void *));
void *calcParseAlloc(void *(*mallocProc)(size_t));
#if !defined(NDEBUG)
void calcParseTrace(FILE *TraceFILE,char *zTracePrompt);
#endif
#define calcParseARG_STORE
#define calcParseARG_FETCH
#define calcParseARG_SDECL
#define CALC_TOKEN_RPAR                            6
#define CALC_TOKEN_LPAR                            5
#define CALC_TOKEN_NUMBER                          4
#define CALC_TOKEN_END                             3
#define CALC_TOKEN_MINUS                           2
#define CALC_TOKEN_PLUS                            1
#define INTERFACE 0

Also something to know about lemon is that it is a push parser, it will never call the lexer.
We need then some code to read from the lexer and call the parser (main.cpp):

#include "lexer.hpp"
#include "calc_parser.h"
#include <iostream>

using namespace std;

int main() {
    const char * content = "1+4+1+(5+3)";

    void* parser = calcParseAlloc( malloc );

    Lexer lexer(content);
    
    int token;
    
    do {

        token = lexer.scan();
        
        switch (token) {
            case CALC_TOKEN_NUMBER:
            {
                string num = lexer.getTokenValue();
                calcParse(parser, CALC_TOKEN_NUMBER, atoi(num.c_str()));
                break;
            }
            default:
                calcParse(parser, token, 0);
                break;
        };
    }
    while( token != CALC_TOKEN_END );
    
    calcParse(parser, 0, 0);
    
    cout << "finished!" << endl;

    calcParseFree( parser, free );
    
    return 0;
}

I modified the lexer to use now the token from the parser.

#ifndef LEXER_HPP
#define LEXER_HPP

#include <string>

class Lexer {
public:

    Lexer( const char *s );

    int scan();
    
    std::string getTokenValue() const;

private:
    const char *m_content;
    
    const char *m_start;
    const char *m_cursor;
    const char *m_limit;
    const char *m_marker;
    const char *m_ctxmarker;
};

#endif

and the lexer implementation:

#include "lexer.hpp"
#include "calc_parser.h"
#include <iostream>

using namespace std;

Lexer::Lexer( const char *s ) : m_content(s)
{
    m_start = m_cursor = m_content;
    m_limit = m_content+strlen(m_content);
}

std::string Lexer::getTokenValue() const
{
    return string(m_start,m_cursor-m_start);
}

/*!max:re2c */

int Lexer::scan()
{
    m_start = m_cursor;

    #define YYCTYPE char
    #define YYCURSOR m_cursor
    #define YYLIMIT m_limit
    #define YYMARKER m_marker
    #define YYCTXMARKER     m_ctxmarker
    #define YYFILL(n)   

    /*!re2c
        re2c:indent:top      = 1;
        re2c:yyfill:enable   = 0;
        
        '\000'              { return CALC_TOKEN_END; }
        [0-9]+              { return CALC_TOKEN_NUMBER; }
        "+"                 { return CALC_TOKEN_PLUS; }
        "-"                 { return CALC_TOKEN_MINUS; }
        "("                 { return CALC_TOKEN_LPAR; }
        ")"                 { return CALC_TOKEN_RPAR; }
    */
}

please refer to my previous article how to use re2c to compile the .re file.

everything should work now and produce the output below:
= 14
finished!

please let me know if it has been useful for you






lundi 28 juillet 2014

Writing a lexer in C++ using re2c

Hello,

For the past 2 weeks, during my holidays, I wanted to learn how to write a good parser in C++ but not using flex/bison (they were not generating c++ last time I did look at it, with global variables etc. It might have changed since then).

Anyway, let's go first with the re2c part. The fact that PHP is using it is giving confidence. However as it was not updated since a long time, I just sent an email to the mailing list and I just got a nice answer from a maintainer. 1 week later, the new release went out (0.13.7.3) with UTF-8 and UTF-16 support, just excellent!

I will just focus on using re2c on a string (C null terminated), so I won't show the buffer refill method. It will be probably a topic for another article.

let's now try to make a lexer that identify a number, and some token like +, - and *.
(yes, the famous calculator! but it is just this article, I will go deeper into PL/SQL in the next ones)

so just below our class definition:

#ifndef LEXER_HPP
#define LEXER_HPP

#include <string>

class Lexer {
public:

    enum token_t {
        TOKEN_END = 0,
        TOKEN_MUL,
        TOKEN_PLUS,
        TOKEN_MINUS,
        TOKEN_NUMBER
    };

    Lexer( const char *s );

    token_t scan();
    
    std::string getTokenValue() const;

private:
    const char *m_content;
    
    const char *m_start;
    const char *m_cursor;
    const char *m_limit;
    const char *m_marker;
    const char *m_ctxmarker;
};


#endif

then our "re" file, that will generate our .cpp file:

#include "lexer.hpp"

using namespace std;

Lexer::Lexer( const char *s ) : m_content(s)
{
    m_start = m_cursor = m_content;
    m_limit = m_content+strlen(m_content);
}

std::string Lexer::getTokenValue() const
{
    return string(m_start,m_cursor-m_start);
}

/*!max:re2c */

Lexer::token_t Lexer::scan()
{
    m_start = m_cursor;

    #define YYCTYPE char
    #define YYCURSOR m_cursor
    #define YYLIMIT m_limit
    #define YYMARKER m_marker
    #define YYCTXMARKER     m_ctxmarker
    #define YYFILL(n)   

    /*!re2c
        re2c:indent:top      = 1;
        re2c:yyfill:enable   = 0;
        
        '\000'              { return TOKEN_END; }
        [0-9]+              { return TOKEN_NUMBER; }
        "+"                 { return TOKEN_PLUS; }
        "-"                 { return TOKEN_MINUS; }
        "*"                 { return TOKEN_MUL; }
    */
}

some explanations:
- YYLIMIT is the end of the buffer, for us this is the end of our string
- YYCURSOR is the character position being read
- YYMARKER and YYCTXMARKER is used when re2c has to backtrack (like for instance when one token has to be followed by something)

What is important, very important even, is to guard the end of your string, like the null character by the token '\000', identified as TOKEN_END

Then you just have to run re2c like below, to generate the lexer.cpp file:
re2c -i -o lexer.cpp lexer.re

-i     --no-debug-info  Do not generate '#line' info (usefull for versioning).
-o of  --output=of      Specify the output file (of) instead of stdout

and now we will use the lexer in our main.cpp:

#include "lexer.hpp"
#include <iostream>

using namespace std;

int main() {
    const char * content = "1+4*2-1+10*2+";

    Lexer lexer(content);
    
    Lexer::token_t token;
    
    do {

        token = lexer.scan();
        
        switch (token) {
            case Lexer::TOKEN_NUMBER:
                cout << "Num: " << lexer.getTokenValue() << endl;
                break;
            case Lexer::TOKEN_PLUS:
                cout << "+: " << lexer.getTokenValue() << endl;
                break;
            case Lexer::TOKEN_MINUS:
                cout << "-: " << lexer.getTokenValue() << endl;
                break;
            case Lexer::TOKEN_MUL:
                cout << "*: " << lexer.getTokenValue() << endl;
                break;
            case Lexer::TOKEN_END:
                cout << "end reached" << endl;
                break;
        };
    }
    while( token != Lexer::TOKEN_END );
    
    cout << "finished!" << endl;

    return 0;
}

giving the output:

Num: 1
+: +
Num: 4
*: *
Num: 2
-: -
Num: 1
+: +
Num: 10
*: *
Num: 2
+: +
end reached

finished!

Isn't that simple?

Next article will be covering Lemon... again a fantastic tool to generate a LALR(1) parser (done by the creator of sqlite and used in sqlite).

See you soon!

lundi 12 mai 2014

DBLite has been updated

Hello,

Today is a big update for DBLite.

I added the savepoint,
fixed a bug in the transaction in the destructor,
introduced the objective c type NSString wherever it made sense (for ios, you need to compile the cpp as objective c++ source file),
added the flags to control the session (readonly, create, readwrite) when opening the database,

and the most important, DBLite is now fully unit tested. I used xcode and it is not yet publicly available but it will be in the next few days.

here the wiki (still not complete, just show the basic, to start)
https://gitorious.org/sylisa-dblite/pages/Home

I wonder if I won't migrate to github, as there is at least a basic issue tracker...

also don't hesitate if you have any feedback, I am all attentive to you.

See you,
Sylvain