mercredi 19 mai 2010

How to implement an regex replace as a sqlite extension using ICU and boost

Hello,

I am currently implementing a sqlite dll extension for having the regexp replace...
it is so useful that I decided to share it with you all.

first of all, I went with boost + icu because I met some issues with Qt. I still didn't find any workaround for using Qt...

I am working on a mac os x 10.6.3 and I don't know if it works on other platform, maybe yes, maybe not... Let me know :-)

I made a c++ file that I named sqliteextensions.cpp

#include "/usr/local/include/sqlite3ext.h"
SQLITE_EXTENSION_INIT1


#include <unicode/unistr.h>
#include
<boost/regex/icu.hpp>
using namespace std;

extern "C" void sqlite3_regexp_replace(sqlite3_context *context, int argc, sqlite3_value **argv){
UnicodeString column_value_uft8 = UnicodeString::fromUTF8((const char*)sqlite3_value_text(argv[0]));
const unsigned char* pattern_utf8 = sqlite3_value_text(argv[1]);
UnicodeString pattern_replace_utf8 = UnicodeString::fromUTF8((const char*)sqlite3_value_text(argv[2]));
const boost::u32regex e = boost::make_u32regex(pattern_utf8); //from docs: UTF-8 when char size = 1
UnicodeString us = boost::u32regex_replace(column_value_uft8, e, pattern_replace_utf8);
string cs;
us.toUTF8String(cs);

sqlite3_result_text( context, cs.c_str(), -1, SQLITE_TRANSIENT );
}

extern "C" int sqlite3_extension_init(
sqlite3 *db,
char **pzErrMsg,
const sqlite3_api_routines *pApi
){
SQLITE_EXTENSION_INIT2(pApi)
sqlite3_create_function(db, "regexp_replace", 3, SQLITE_UTF8, 0, sqlite3_regexp_replace, 0, 0);
return 0;
}

then you compile this file using the command line below:

g++ -O3 -shared src/sqliteextensions.cpp \
-arch x86_64 -Xarch_x86_64 \
-framework CoreFoundation \
-L/usr/local/lib/ -lboost_regex \
-licudata -licui18n -licuio -licule -liculx -licutu -licuuc \
-o ../libsqliteextensions.dylib

now run sqlite3 (for me located in /usr/local/bin/ to have the latest version that I installed)
and type this command:

.load ../../libsqliteextensions.dylib

now you can use the regexp_replace function like:

update mytable set myfield = regexp_replace(myfield,'^[ \t]+','')

... and much more ... That's really great, isn't it?
it made my day :-)

I forgot to mention that I used ICU 4.2 and boost 1.43.
boost has been compiled with ICU, to make the regex lib aware of unicode.

Write to you soon.
Sylvain





4 commentaires:

  1. Sylvian, could you publish this as a DLL that I can load with sqlite? I don't know how to compile this, but I would like to do a search & replace inside an sqlite db, and reading your blog, I think this is what do the trick, right?

    thanks

    gert

    RépondreSupprimer
    Réponses
    1. unfortunately I don't have the time to publish and maintain a dll. However I gave all the instructions to build your lib on mac os x, I don't know how it would work on windows through.

      Supprimer
  2. Hello,

    It seems that the #include lines are in incomplete, no ?

    Your extension interest me, I try to compile boost with ICU support.

    RépondreSupprimer