mercredi 19 mai 2010

How to implement an regex replace as a sqlite extension using ICU and boost


I am currently implementing a sqlite dll extension for having the regexp replace...
it is so useful that I decided to share it with you all.

first of all, I went with boost + icu because I met some issues with Qt. I still didn't find any workaround for using Qt...

I am working on a mac os x 10.6.3 and I don't know if it works on other platform, maybe yes, maybe not... Let me know :-)

I made a c++ file that I named sqliteextensions.cpp

#include "/usr/local/include/sqlite3ext.h"

#include <unicode/unistr.h>
using namespace std;

extern "C" void sqlite3_regexp_replace(sqlite3_context *context, int argc, sqlite3_value **argv){
UnicodeString column_value_uft8 = UnicodeString::fromUTF8((const char*)sqlite3_value_text(argv[0]));
const unsigned char* pattern_utf8 = sqlite3_value_text(argv[1]);
UnicodeString pattern_replace_utf8 = UnicodeString::fromUTF8((const char*)sqlite3_value_text(argv[2]));
const boost::u32regex e = boost::make_u32regex(pattern_utf8); //from docs: UTF-8 when char size = 1
UnicodeString us = boost::u32regex_replace(column_value_uft8, e, pattern_replace_utf8);
string cs;

sqlite3_result_text( context, cs.c_str(), -1, SQLITE_TRANSIENT );

extern "C" int sqlite3_extension_init(
sqlite3 *db,
char **pzErrMsg,
const sqlite3_api_routines *pApi
sqlite3_create_function(db, "regexp_replace", 3, SQLITE_UTF8, 0, sqlite3_regexp_replace, 0, 0);
return 0;

then you compile this file using the command line below:

g++ -O3 -shared src/sqliteextensions.cpp \
-arch x86_64 -Xarch_x86_64 \
-framework CoreFoundation \
-L/usr/local/lib/ -lboost_regex \
-licudata -licui18n -licuio -licule -liculx -licutu -licuuc \
-o ../libsqliteextensions.dylib

now run sqlite3 (for me located in /usr/local/bin/ to have the latest version that I installed)
and type this command:

.load ../../libsqliteextensions.dylib

now you can use the regexp_replace function like:

update mytable set myfield = regexp_replace(myfield,'^[ \t]+','')

... and much more ... That's really great, isn't it?
it made my day :-)

I forgot to mention that I used ICU 4.2 and boost 1.43.
boost has been compiled with ICU, to make the regex lib aware of unicode.

Write to you soon.

ICU UnicodeString to UTF-8


This morning, I spent most of the time to know how to convert an ICU UnicodeString to an UTF-8 string.

the result is so simple, but no documentation, nowhere on google land...
The ICU documentation is obviously not so up-to-date.
I am using the latest version of ICU thought (4.2).

so to convert UnicodeString to std::string is simple as:
UnicodeString us = ...;
string cs;

so simple but so complex to find, I ended up to look at the source code...

so I hope it will be helpful for others....