You are currently browsing the archives for the Search Engine category


Simple Naive Bayes Classifier for PHP

Recently Hacker News is flooded with numerous articles discussing or at least mentioning Naive Bayes Classifier algorithm. It’s a basic algorithm to classify a set of words into a certain category (set) based on prior learning of words and its probabilities. It sounds simple enough but without actual technical guide book, it’s quite trivial since most of the information out there regarding it is too messy for newbies like myself.

Just today, there was an article by Alexandru Nedelcu about Naive Bayes Classifier here which is exactly what I am looking for. It’s simple, to the point and most importantly outlines the benefit of using the algorithm with practical examples. The codes are in Ruby but I think the article is finely written, you don’t have to look at the source code.

So I somewhat forked and ported the idea into PHP and voila, the PHP counterpart is available at https://github.com/tistaharahap/Simple-Naive-Bayes-Classifier-for-PHP. It’s still very basic, just a prove of concept with MySQL as its persistent storage. The Store is abstracted so you can write your own Store with any database you’d like.

My focus is creating codes that will scale for big documents, and yes MySQL won’t be a definite winner here for scalability but I’m using it now to make learning easier. I’m planning on creating a HandlerSocket Store as well as a MongoDB Store.

The codes at the repository for now is not ready for prime time, however, feel free to fork, port or anything you feel right with the codes. Have a great time ;)

Small is NOT so small

It’s been like a few months not posting into my own blog. The 24th day of 2011 and it seems that months have passed by just like that. In the span of 24 days that have passed, every single day is cramped with all the aspects of being a startup. Just today, I went from a programmer, a cable crimper, a business partner, a troubleshooter, a mobile app consultant, a colleague to a friend for a friend, all in just one day. Multiply that by 24 and that’s exactly what’s been going on. The dynamics revolving is mind blowing.

I can’t help to wonder exactly what I’m doing right now is to look back and kinda being nostalgic to myself to figure out how so much energy can be exuded with far less sleep than ever before. Physics defines e (energy) equals m (mass) times c (speed of light) square. There is no indefinite variable there although I have to admit my mass is fluctuating but it is drastically paled to insignificance when multiplied with the square of the speed of light which is 3.6 millions meters/second. NO, the speed of light is a constant and therefore always have the same value while mass is the dynamics.

Well mass is the variable and this is where energy can be increased or decreased exponentially literally. With Einstein’s theory of energy, it’s safe to say that without mass, it’s just light. The same light currently transporting zillions of gigabit traffic around the world to every connected computers globally. What’s best of all, at least for now, light is unlimited and available anywhere in the world. What the world lack is mass.

These past months, we’ve been having a 40% problem in Urbanesia. Our sudden drop of traffic is just weird. Efforts to tackle the drop were taken, from AB testing new designs to optimizing codes to be able to serve faster were done like it’s do or die. We found a really troubling fact. Our Google Analytics Javascript code is not loading as expected in most parts of Urbanesia.

With the new look, we introduced a new framework for Urbanesia. We hate inline JS and CSS, so we created a framework to parse JS and CSS, minified them and put them in separated files. The Google Analytics loader JS was one of them. We put it at the bottom like any other sane website. Well that was the problem. Most of our visitors’ Internet connection was not fast enough to grab the loader file in a way to keep up with their engagement with the website.

So with every good programming practice, we refactored the codes. Affectionally, we put the codes up top before closing the HEAD tag and pointed ga.js to load from our CDN. The result is a 40% difference! This is a big lesson for everyone in Urbanesia. With Google Analytics measuring the right numbers of traffic, our approach to enable Urbanesia to scale is now sane again.

During the process, we treated ourselves a new Cloud Server at Biznet and boy that has made another big impact towards our traffic. From the beginning of the new look for Urbanesia, we concentrated our efforts so that our application servers can be migrated to multiple servers instantly. The big roadblock was that we didn’t had a good enough API to support this. So we went on to create one of the best API to date for Urbanesia.

By having a great API, we managed to localize each parts of Urbanesia to scale with proportion. Optimizing Apache is trivial to be able to handle this many API calls/seconds. Because of Apache’s great flexibility, it became its own worst nightmare. Our Apache configuration is hand tuned slowly to cope with the changes. Thank God for nginx!

Our cloud server is a single core, 1 GB RAM server and it’s serving half of Urbanesia’s traffic with that little man namely nginx. Nginx as a proxy load balancer is amazing. Just by having 1 nginx, we boosted 300% of our concurrent traffic. With proper caching of static contents, this has been a relieve. Since we now have 2 nginx instances, we have increased our daily traffic to cope with multi Mbps traffic constantly.

In parallel, we transitioned Urbanesia from a 100% MySQL oriented website to switch to a 70% portion for MySQL and 30% portion of MongoDB. The problem with MySQL is that it can’t keep up with our increasing needs and we don’t wanna scale hardware yet. So it’s back to the applications we used. By using MongoDB, we put all of our high volumes & low value data in it. The best thing about MongoDB is that it’s really fast! The catch is MongoDB uses a lot and i mean a lot of physical memory to be able to do this. Think of of it as MySQL + Memcache on steroids. To keep MongoDB performing like this, we revamped our Database server to cope with MongoDB’s need for memory.

So as you can see, logically we have separated Urbanesia into partitions. Now we can scale Urbanesia according to the needs of each partitions. A Googler told me that “Faster loading of pages equals more revenues“. We measure revenue not necessarily in $$$. Users engagements is one of our metrics. Well the words were proven. We hit a new record this January with users engagements. The total number of reviews as of 24 January 2011 have beaten the best month of 2010 and we still have 7 days to get more reviews.

Our marketing and development team is moving with right pace both independently and as a big team. This is key to our sudden increase of performance in January 2011. We learned to work it out as a team and make decisions based on data collected. We AB Test all of our changes and introduced a heat map. The data we gained was invaluable to our efforts. Because of the nature of Urbanesia being so vast, we couldn’t just highlight everything all at once, we needed to move single minded. Reinforcing everything one step at a time.

The last 3 months of 2010 were allocated to create a rock solid foundation for 2011 and it worked out like a charm. I’m very proud to say that our team of superheroes are incredible. The dedication and affection towards Urbanesia are not like what you can expect in any other ordinary workplace. Salute to every one of us!

In closing, I learned an important lesson with scaling. Scaling is an artwork, nothing is for certain until you get to the point where every destination is mapped and prepped for dynamics. Meaning that anytime, our infrastructure must be ready for changes while maintaining a high level of stability. From a business side, well when you got the software, hardware and human resource right then you just gotten your most invaluable asset: A living and breathing product manifested in all of our work. This will pay the bills, just gotta have faith :D

Urbanesia API Wrapper Released – Merry Christmas!

Before anything else, I would love to say MERRY CHRISTMAS to everyone, have a great holiday and may the Christmas spirit make all of us ready for 2011. God bless you all :) This 2 days I’ve been working on a wrapper class to wrap Urbanesia’s OAuth and xAuth authentication for our API. The main idea was to keep it as simple and as easy as possible. Here’s my Christmas present :)

For this to happen, you must sign up for Consumer Tokens. We’re still working on an automated system for it so in the mean time, you can email superhero [at] urbanesia [dot] com for early access. I used GIT this time for the versioning software to experience something new this Christmas. The files are already uploaded at GitHub. To get a taste of Urbanesia’s vast data, you can go here and spend some time with the README, after that, creativity is the limit :D

All codes are released as Open Source under Apache License, Version 2.0.

To wrap things up, Merry Christmas to everyone and have a great 2011!

Sphinx – Fulltext Search Engine – Part 1

It has been a few weeks after my first encounter with this Egyptian named gem called Sphinx. At first glance, it’s complicated when looking at an already made sphinx.conf. However, after careful redesigning and re-tinkering, it turns out to be one of the most flexible and yet light fulltext search engine available today. There are others but nothing as light, fast and sleek as Sphinx. The cold truth is that Sphinx is supporting SQL based databases as far as I know. Since Urbanesia is already using MySQL as our backend, we’re lucky.

The first and most difficult part of learning Sphinx is it’s installation routine. Numerous times I have failed compiling Sphinx on my Macbook and also on CentOS servers. That made me stayed away too far from it. So after a few Googling sessions, I thought it was time to tame the beast. First step was to compile Sphinx on my Macbook.

What you’re gonna need are:

  1. Sphinx source code provided here. I downloaded the latest 1.10 Beta version, spoilers: Realtime Indexes :)
  2. expat library provided here.
  3. libiconv library provided here.

Where you’re all set, we’re gonna go ahead and start the fiesta. Just a note, this tutorial is downloading everything to /opt/sources and installing everything on /usr/local directory. You’re free to tinker.

    sudo -s
    cd /opt/sources
    tar xfz expat-2.0.1.tar.gz
    cd expat-2.0.1
    ./configure --prefix=/usr/local
    make && make install
    cd ..
    tar xfz libiconv-1.12.tar.gz
    cd libiconv-1.12
    ./configure --prefix=/usr/local
    make && make install
    cd ..
    tar xfz sphinx-1.10-beta.tar.gz
    cd sphinx-1.10-beta
    ./configure '--prefix=/usr/local/sphinx' CPPFLAGS="-I/usr/local/include
    -I/opt/local/include -I/Applications/xampp/xamppfiles/include
    -I/Applications/xampp/xamppfiles/include -arch i386" LDFLAGS="-L/usr/local/lib
    -L/opt/local/lib" 'CFLAGS=-O -arch i386' 'LDFLAGS=-arch i386' 'CXXFLAGS=-O -arch i386'
    make -j4 install

So you now have successfully compiled Sphinx and installed it to your Macbook. In any case, as long as it’s a Unix flavored OS, the routine is basically the same. Only in Mac OS X Snow Leopard I’d have to put the compiler in 32 bit mode because it mistakenly overridden all flags to 64 bit if not.

The next part of the tutorial will be about generating your own sphinx.conf. Until then!


photo of Batista Batista R Harahap [email protected]
Jl. Bango II/29C, Pondok Labu
Cilandak , DKI Jakarta , 12450 Indonesia
62817847023

This hCard created with the hCard creator.