You are currently browsing the archives for the Portfolio category


Naive Bayes Classifier – Revisited

During the last week, I’ve been following up work with a side project to do machine learning with Urbanesia’s comprehensive data. A lot of late night reading and fiddling with foreign codes were the highlights of my last week. Wanted to elaborate my implementations and how several kinds of technologies affect benchmarks particularly with classification performance.

The repo for the codes is at Github here.

During time span of the first batch of codes until now, I have made lots of changes to the codes and also the data store. I wasn’t sure at first, which database will bring the best performance. I’m testing on a fairly low spec hardware which is a Macbook Air Late 2011 with 4 GB DDR3, SSD and Intel Core i5 1.7GHz, this is nothing compared to a real server relatively. By the way, although relatively low spec, she’s got a name, it’s Claire.

My first challenge was to abstract data stores and deal with the algorithm later. To keep things familiar and easy, MySQL was the first store I dealt with. After getting the tables ready, I coded the algorithm with help from Alexandru Nedelcu‘s excellent Hacker News posting to implement Naive Bayes Classifier in Ruby. The alpha version was produced.

The alpha sucks really bad in terms of performance, it took +1000 seconds to classify a single word. MySQL was expectedly not up for the task. Since the data is actually a collection of words, I was intrigued to use MongoDB as the data store. Since the abstraction layer is already there, I wrote a MongoDB store quite painless and hoping to get better results. The codes were done and the benchmark showed with MongoDB, it only took +400 seconds to classify a single word. Still not good enough, I wasn’t prepared to write scheduled backend services which will explode the servers with +50.000 users at least and not to mention the 200.000+ businesses we have, it’s gonna be a Sys Admin’s nightmare.

Real work was catching up with side projects so I decided to take a break until last week, I managed to get some time to write more codes. So I read along Hacker News to look for the perfect NoSQL database to work with the data we have. I remembered a friend of mine Dondy Bappedyanto talking about Redis and how it is a superset of Memcache. So I went straight to Redis.io and compiled the source code.

Disclaimer: I knew the algorithm wasn’t optimized as I would have liked it to be with the MySQL and MongoDB store, wanted to focus on macro optimizations and do micro optimizations afterwards.

Redis is quite unique because it’s “Memcache-like” storing data as key values, the logic changes dramatically and further learning of Redis’ data types will help a lot. My aim was to study Redis while doing the project so I opted to do the codes with primitive data types first and optimize along the way. So with a lousy algorithm and a not-so-optimized data model in Redis, I classfied a keyword and it was instant love. It only took ~1 second to do it.

So in my mind, I already got the optimization I wanted on a macro level, it’s time to get dirty now. Being my nature of enjoying new stuffs as they come up, I researched other implementations of Naive Bayes Classifier in other languages. I was thinking about implementing a Node.js + Socket.io proxy to do the JavaScript communication with our V2 client side codings and was interested to know more about Node.js.

A quick google introduced me to several Node.js modules to do the job. One that I was particularly interested was Classifier by Heather Arhur. I read through the source code and finding some clever methods to speed up things, get all the data first and do the calculations afterwards. But, I was curious about Node.js and wanted to learn to code with it. So I did a more optimized of my previous algorithm in PHP and implemented it in JavaScript. Wanted to know how my codes will perform against the Classifier Node.js module. Both codes were using Redis as the data store.

The quick answer is that both my codes and the Classifier module achieved sub second performance, classifying single keywords in ~300 milliseconds. This was a great morale boost but the fun only lasted a while. It turns out that sometimes both implementations won’t spit out results in medium to large datasets. Being a newbie with Node.js, I didn’t know what to do. My guess it’s got something to do with memory because the both implementions didn’t emit the finish events. Could be a Node.js problem or rather the redis and hiredis node modules.

This makes me code in PHP again. Heavily modified the implementation in PHP to get the data first and calculate later. I was surprised with the result. It took only ~0.01 second to classify a single keyword after the optimization was done. This gives me an idea to do the calculation in PHP and using Node.js + Socket.io as a frontend to JavaScript clients.

Since it was really painless to do WebScoket with Socket.io, it took only a few minutes to produce the Node.js frontend available here. During a subjective benchmark, it took 68 milliseconds to classify and deliver the result to JavaScript clients. This was a near realtime result and I found my solution.

Last night was full with fiddling around with the algorithm, trying to get the best accuracy from it and during last night and today, the PHP implementation is now at version 0.3.0. A coding session this afternoon led to a helper to produce blacklist/stopwords from a collection of text. I couldn’t just import the most frequent words to the blacklist collection because it’s really subjective depending on languages. Urbanesia’s data is a mix of Indonesian and English so it will take more time to analyze. If there’s an acceptable automation method, I will share it at the repo.

The conclusion of this project was to think less and do more. Algorithms to do machine learning is available through out the Internet, I mean smart and talented developers before and after us will keep finding new ways to organize data, it’s the implementation that counts. Each problems has its own domain and I’m sure my codes will not cater all problems. However, learning by doing is also an excellent experience.

Naive Bayes Classifier is a probability calculation of each keyword being independent to the other keywords classified so it’s really suited to mine preferences, related content, etc but in some cases when a group of keyword is actually what we want to know about, Naive Bayes Classifier’s accuracy won’t be so great. This calls for another solution, if you have any ideas about this, please do comment, would love to know what you think.

Cheers!

Jajan for Android Open Sourced at Github

Jajan for Android is now Open Sourced at Github a few hours ago. I personally hope that by looking at the source code provided, more and more developers will sync to the tune of how easy it is to create an Android application. I wrote most of the codes 7 August 2011 in under 4 hours. Using ready made libraries already available within Android and also other third party libraries, it helped to ease the complications.

The source code is NOT perfect, there’s a lot of places where it could be optimized aggressively even more. More of the optimization will most definitely lie within the ListView. At any case, it will load 100 search results, you can make this endless by loading an incremental of your choice.

The codes are available at https://github.com/tistaharahap/jajan/.

Excerpts from the README shown below:

JAJAN by Urbanesia
==================

Jajan is a simple app to showcase Urbanesia's API v1.0 and how you can extend for your own apps.

As of this writing, the initial commit is at sync with Jajan's binaries at Android Market which is version 1.1.1. Upcoming Jajan versions will NOT be published from the codebase here in Github, this repository is treated as an example for future third party apps by you.

Jajan is available in multiple platforms, go to , if your device is one of the supported platform, it will redirect to your device's application store or it may have you download a binary for your platform.

Mid 2011 – Learning Curves

It’s not exactly Mid 2011, almost a month has passed but I feel it’s still reasonable to title this blog post as it is. In the most cliche way of saying this: A lot has happened. One effort that stands out from all the others is keeping complicated from being complicated. Everything that has been going on for the last 6 months is inspired by a quote saying: Less is More.

Murphy’s law states that anything that can go wrong, will go wrong. No matter how well the preparation, execution and anticipation, there is still the slightest possibility that some thing WILL go wrong. When it does go wrong, headaches are bound to happen. That’s why by keeping things simple, troubleshooting will very much be less prone to cause more headaches.

Just about any Object Oriented Programming language facilitate us with Exceptions to play around with. However, even when we try to catch Exceptions, it’s not a failsafe mechanism of pinning down the source of the problem. In a production environment, especially one where a single node is serving a lot of services, the source of the problems is multiplied exponentially by the number of services at any given time.

Late last year, a Googler once confirmed that the faster the page is served to visitors, the more the revenue generated. This has played true every time. It’s the principle rule of thumb for any website. I can’t count how many man hours are spent trying to squeeze seconds into milliseconds. As do any best practice will tell, optimizing (upgrading) application/brain is the answer to scale properly. It has been a nightmare and also sweet dreams over and over again. I guess it’s never ending, you scale when it’s time to and never do it when you don’t need to.

Following the last paragraph, a lot has changed in the way I code both in front ends and back ends. Starting up as a web developer and with PHP, I tend to be spoiled by how relatively painless the language is compared with other languages. On the hardware side, servers are equipped with the best resource money can buy. These two represents all the performance woes that any website will endure.

Coding for front ends, it’s hard to really practice scaling if you don’t know the big picture. Whether it’s PHP, JavaScript and or Java, it’s difficult to know for sure. Statistics and benchmarking may help but it’s not enough really. Lab tests are controlled conditions where we simulate a production environment. Even if we have all the application, OS and hardware exactly the same, it’s still not the same. As developers, really knowing the big picture can save a lot of time and headaches.

The problem is how to communicate all this to other developers. Every developer is unique but every one of them share the same principle which is curiosity. A good lead developer should know how to get the best out of the team by what my friend in Yahoo! Indonesia calls Progressive Disclosure. It’s a User Experience (UX) technique actually but I feel it’s usable in real life also. By practicing this, although it’s relatively a longer learning curve, the principles that are captured and practiced will last a lifetime. After all, all things instant, goes away instantly without a doubt.

Anyone who had code in back ends are by nature wide thinkers. They are adapt to making frameworks and API’s that will be used over and over by other developers. They tend to think 2-3x times harder and produce lines of codes way less than others with more time consumed but their result is more often than not, bug free. Bugs will happen when they meant it to happen. In translation, the haven’t thought of the condition yet, usually happens when the codes are in production environments.

People can’t stand back end people usually because they curse a lot more than the lines of code they produce. Including them in daily routine coding tasks is suicidal. That’s why bridging by making people bond over after hours or a few bottles of beer is important. Every developer is human and humas are social creatures. Although geek looking, developers still feel the urge to see the world in the eyes of others.

Enough have been said or written about the love and hate relationship between Developers VS Designers. In any big names in the tech world, the balance of which is more pride over the other is balanced. I don’t really care about which is more than the others, what I care is what the two can achieve together.

Developers might not know how a design is forged. As designers, they think about human interactions as part of their work. The psychology of the visitors is the key to capturing valuable insights of how well the design is performing. Nowadays in the web and mobile devices, we see more creative and innovative way of pimping up dull data into meaningful information because of the design. Take for instance Facebook, the data about one of your friend’s profile info matters to you as a friend but it doesn’t have any value to other who doesn’t have interest towards your friend. Leveraged by heavy JavaScript codings, Facebook in my opinion have created a simple approach to a nasty set of data. Make that 500 millions of friends.

To do all the above, it takes research and careful thinking, this is most likely not credited enough by developers. As do the other way around. I myself am a developer, all of my codes stays hidden behind those silky designs. Every lines of code is worth at least 0.5 second of my time. During that 0.5 seconds, I must make more than 1 decision because it can and most probably will branch into another condition that I must eliminate.

So going back to Developers VS Designers, I feel it’s important to get things straight from the moment the project is started. It takes two to tango and above all, teamwork is more important than personal gains. The recipe for success will vary across teams but one thing for sure is that between the two, communication, I mean really communication like actually talking face to face with one another is crucial. Body language, the tone, the words, etc counts a great deal in shaping the same perspective and perception of the matters at hand.

There are still a lot to go but the last 6 months has been great. I managed to entertain myself with a lot of new knowledge while digging my passion deeper and deeper. I honestly don’t know how to quit, the only thing I know I’m good at is My Passion wherever it takes me.

CodeIgniter Session With Memcache + Anti Bots!

Last night was a thrilling change of routine. Urbanesia was crippled because of the unprecedented growth of our MongoDB databases. I must admit that MongoDB is like Memcache with steroids, well it overdosed. MongoDB doesn’t have any mechanism to limit its memory usages, the only limit we can define is the size of its individual files. Therefore, something must be done!

The second flaw was with CodeIgniter by design. By default, CodeIgniter uses its own Session handling mechanism either by using cookie and or database. The database types supported were limited to drivers available for CodeIgniter. Well we hacked it to use MongoDB a few months ago.

The boomerang was that CodeIgniter again by default does not filter bots for its session mechanism. Urbanesia is very attractive to bots and therefore all of our sessions were mostly bots, this equals junk data. The garbage collector for sessions was also very primitive. We had to do something about this.

We wanted a fast and simple yet elegant solution to tackle the problems above. MySQL is out of the question of course, Insert/Update activities will surely lock tables and we can’t afford it. So we turned to Memcache. The most important built in feature with Memcache was its ability to limit memory usage and therefore giving us a garbage collector for stale sessions without extra codes at all!

There are no known Memcache session handling available with CodeIgniter as to my knowledge, so I went ahead and did a whole redo of our MY_Session library to accomodate Memcache as our Session storage engine. The first thing to do was to filter bots that frequently visit Urbanesia and deny them sessions, instead a cookie will do them just fine.

function __detectVisit() {
       $this->CI->load->library('user_agent');
       $agent = strtolower($this->CI->input->user_agent());

       $bot_strings = array(
           "google", "bot", "yahoo", "spider", "archiver", "curl",
           "python", "nambu", "twitt", "perl", "sphere", "PEAR",
           "java", "wordpress", "radian", "crawl", "yandex", "eventbox",
           "monitor", "mechanize", "facebookexternal", "bingbot"
       );

       foreach($bot_strings as $bot) {
               if(strpos($agent, $bot) !== false) {
                       return "bot";
               }
       }

       return "normal";
}

Yes it’s quite primitive but it works and it satisfied our needs to filter the most frequent bots. The next step was to build namespaces adjusted with some of CodeIgniter’s built in Session handling mechanisms.

function __build_namespace($sess_id, $ip_addr = 0, $user_agent = '') {
	$this->namespace .= $sess_id;
	if($this->sess_match_ip == TRUE && $ip_addr > 0)
		$this->namespace .= '#'.ip2long($ip_addr);
	if($this->sess_match_useragent == TRUE && $user_agent != '')
		$this->namespace .= '#'.md5($user_agent);
}

The 3 parameters accepted are all components within a standard CodeIgniter Session. Since CodeIgniter gave us options like sess_match_ip and sess_match_useragent, it’s important to adjust the namespace as a filter of its own actually. One of the most difficult part was to decide whether to use JSON or serialized array to store custom user data. I decided to use JSON in the end. Here’s a code snippet of setting a session value to Memcache.

$this->CI->memsess->set($this->namespace, json_encode($this->userdata), $this->sess_expiration);

FYI, I used another library called memsess, short of Memcache Sessions lol to let me shard Memcache arrays. I wanted an exclusive Memcache instance solely be used to store sessions. The main reason was to keep session data as tidy as possible meaning that there are no other data that will push the sessions data away unless we tell them to. This makes the Memcache instance far more predictable. Most of the codings were derived from CI_Session and modified to use Memcache as storage. I will not go into the full details of the library, instead I’m gonna give the code for the sess_read() method. I’m pretty sure it’s enough for you to experiment on your own.

function sess_read() {
	// Kick out bots!
	if($this->is_bot) {
		$this->sess_destroy();
		return FALSE;
	}

	$session = $this->CI->input->cookie($this->sess_cookie_name);

	if($session === FALSE) {
		return FALSE;
	}

	if ($this->sess_encrypt_cookie == TRUE) {
		$session = $this->CI->encrypt->decode($session);
	} else {
		$hash	 = substr($session, strlen($session)-32);
		$session = substr($session, 0, strlen($session)-32);

		if ($hash !==  md5($session.$this->encryption_key)) {
			$this->sess_destroy();
			return FALSE;
		}
	}

	$session = $this->_unserialize($session);

	if (
		!is_array($session)
		OR ! isset($session['session_id'])
		OR ! isset($session['ip_address'])
		OR ! isset($session['user_agent'])
		OR ! isset($session['last_activity'])
	) {
		$this->sess_destroy();
		return FALSE;
	}

	if (($session['last_activity'] + $this->sess_expiration) < $this->now) {
		$this->sess_destroy();
		return FALSE;
	}

	if ($this->sess_match_ip == TRUE AND $session['ip_address'] != $this->CI->input->ip_address()) {
		$this->sess_destroy();
		return FALSE;
	}

	if (
		$this->sess_match_useragent == TRUE
		AND trim($session['user_agent']) != trim(substr($this->CI->input->user_agent(), 0, 50))
		) {
		$this->sess_destroy();
		return FALSE;
	}

	// Build namespace!
	$this->__reset_namespace();
	$this->__build_namespace($session['session_id'], $session['ip_address'], $session['user_agent']);

	$query = $this->CI->memsess->get($this->namespace);
	if(empty($query)) {
		$this->sess_destroy();
		return FALSE;
	}

	$row = json_decode($query);
	if(isset($row->user_data) AND $row->user_data != '') {
		$custom_data = $this->_unserialize($row->user_data);
		if(is_array($custom_data)) {
			foreach($custom_data as $key => $val) {
				$session[$key] = $val;
			}
		}
	}

	$this->userdata = $session;
	unset($session);

	return TRUE;
}

There you go, a glimpse into Session management in CodeIgniter with Memcache. This is a product of experiment because of needs. I’m sure it can be done in smarter ways, the sky is the limit ;)


photo of Batista Batista R Harahap [email protected]
Jl. Bango II/29C, Pondok Labu
Cilandak , DKI Jakarta , 12450 Indonesia
62817847023

This hCard created with the hCard creator.