You are currently browsing the archives for the Urbanesia category


Naive Bayes Classifier – Revisited

During the last week, I’ve been following up work with a side project to do machine learning with Urbanesia’s comprehensive data. A lot of late night reading and fiddling with foreign codes were the highlights of my last week. Wanted to elaborate my implementations and how several kinds of technologies affect benchmarks particularly with classification performance.

The repo for the codes is at Github here.

During time span of the first batch of codes until now, I have made lots of changes to the codes and also the data store. I wasn’t sure at first, which database will bring the best performance. I’m testing on a fairly low spec hardware which is a Macbook Air Late 2011 with 4 GB DDR3, SSD and Intel Core i5 1.7GHz, this is nothing compared to a real server relatively. By the way, although relatively low spec, she’s got a name, it’s Claire.

My first challenge was to abstract data stores and deal with the algorithm later. To keep things familiar and easy, MySQL was the first store I dealt with. After getting the tables ready, I coded the algorithm with help from Alexandru Nedelcu‘s excellent Hacker News posting to implement Naive Bayes Classifier in Ruby. The alpha version was produced.

The alpha sucks really bad in terms of performance, it took +1000 seconds to classify a single word. MySQL was expectedly not up for the task. Since the data is actually a collection of words, I was intrigued to use MongoDB as the data store. Since the abstraction layer is already there, I wrote a MongoDB store quite painless and hoping to get better results. The codes were done and the benchmark showed with MongoDB, it only took +400 seconds to classify a single word. Still not good enough, I wasn’t prepared to write scheduled backend services which will explode the servers with +50.000 users at least and not to mention the 200.000+ businesses we have, it’s gonna be a Sys Admin’s nightmare.

Real work was catching up with side projects so I decided to take a break until last week, I managed to get some time to write more codes. So I read along Hacker News to look for the perfect NoSQL database to work with the data we have. I remembered a friend of mine Dondy Bappedyanto talking about Redis and how it is a superset of Memcache. So I went straight to Redis.io and compiled the source code.

Disclaimer: I knew the algorithm wasn’t optimized as I would have liked it to be with the MySQL and MongoDB store, wanted to focus on macro optimizations and do micro optimizations afterwards.

Redis is quite unique because it’s “Memcache-like” storing data as key values, the logic changes dramatically and further learning of Redis’ data types will help a lot. My aim was to study Redis while doing the project so I opted to do the codes with primitive data types first and optimize along the way. So with a lousy algorithm and a not-so-optimized data model in Redis, I classfied a keyword and it was instant love. It only took ~1 second to do it.

So in my mind, I already got the optimization I wanted on a macro level, it’s time to get dirty now. Being my nature of enjoying new stuffs as they come up, I researched other implementations of Naive Bayes Classifier in other languages. I was thinking about implementing a Node.js + Socket.io proxy to do the JavaScript communication with our V2 client side codings and was interested to know more about Node.js.

A quick google introduced me to several Node.js modules to do the job. One that I was particularly interested was Classifier by Heather Arhur. I read through the source code and finding some clever methods to speed up things, get all the data first and do the calculations afterwards. But, I was curious about Node.js and wanted to learn to code with it. So I did a more optimized of my previous algorithm in PHP and implemented it in JavaScript. Wanted to know how my codes will perform against the Classifier Node.js module. Both codes were using Redis as the data store.

The quick answer is that both my codes and the Classifier module achieved sub second performance, classifying single keywords in ~300 milliseconds. This was a great morale boost but the fun only lasted a while. It turns out that sometimes both implementations won’t spit out results in medium to large datasets. Being a newbie with Node.js, I didn’t know what to do. My guess it’s got something to do with memory because the both implementions didn’t emit the finish events. Could be a Node.js problem or rather the redis and hiredis node modules.

This makes me code in PHP again. Heavily modified the implementation in PHP to get the data first and calculate later. I was surprised with the result. It took only ~0.01 second to classify a single keyword after the optimization was done. This gives me an idea to do the calculation in PHP and using Node.js + Socket.io as a frontend to JavaScript clients.

Since it was really painless to do WebScoket with Socket.io, it took only a few minutes to produce the Node.js frontend available here. During a subjective benchmark, it took 68 milliseconds to classify and deliver the result to JavaScript clients. This was a near realtime result and I found my solution.

Last night was full with fiddling around with the algorithm, trying to get the best accuracy from it and during last night and today, the PHP implementation is now at version 0.3.0. A coding session this afternoon led to a helper to produce blacklist/stopwords from a collection of text. I couldn’t just import the most frequent words to the blacklist collection because it’s really subjective depending on languages. Urbanesia’s data is a mix of Indonesian and English so it will take more time to analyze. If there’s an acceptable automation method, I will share it at the repo.

The conclusion of this project was to think less and do more. Algorithms to do machine learning is available through out the Internet, I mean smart and talented developers before and after us will keep finding new ways to organize data, it’s the implementation that counts. Each problems has its own domain and I’m sure my codes will not cater all problems. However, learning by doing is also an excellent experience.

Naive Bayes Classifier is a probability calculation of each keyword being independent to the other keywords classified so it’s really suited to mine preferences, related content, etc but in some cases when a group of keyword is actually what we want to know about, Naive Bayes Classifier’s accuracy won’t be so great. This calls for another solution, if you have any ideas about this, please do comment, would love to know what you think.

Cheers!

Windows 8 RTM & Visual Studio 2012 – Urbanesia on Windows 8

My first experience with Hello Worlds was through an old 8088XT that shows up a primitive BASIC IDE to hack on codes. Well now with the Urbanesia team and also past members of the team, we’ve created a native Windows 8 app for Urbanesia. We were in it from the start when Windows 8 was seeded as a Developer Preview. Our first IDE was Visual Studio 11 Beta that is now Visual Studio 2012.

Urbanesia is a BizSpark member and therefore, we gained benefits such as being the first to enjoy Microsoft products that has yet been released publicly. Our MSDN account enables us to download almost all of Microsoft’s commercial, development and enterprise products to be used without any complicated and expensive expenses for 3 years. I downloaded Windows 8 Pro and Visual Studio 2012 Ultimate.

Our Windows 8 Bootcamp back in August taught us the core of coding for Windows 8. It’s relatively easy and stress free if you’re accustomed to Open Source flavours previously. We created a basic application in a few hours and learned how to effectively structure your web services data. Gained all of that knowledge really quick and mostly painless.

Windows 8 is now approaching its launch date and we were given an ARM tablet by Microsoft installed with WinRT to test our development efforts. To be honest, the tablet is great but we didn’t know what to do with it. We used it mostly to test our upcoming iteration of Urbanesia’s frontend web face. I got a Windows laptop with a really low spec and decided to install Windows 8 there and do some development work for our app.

Let me tell you this, to develop for Windows 8, you must install Visual Studio 2012 on a Windows 8 device. Trying to develop for Windows 8 on versions less than Windows 8 will give you a friendly warning that you’re fucked. This friendly warning made me download a Windows 8 ISO image from Microsoft. It turns out that our MSDN subscription was loaded with Windows 8 RTM and I followed through.

I installed Windows 8 on that crappy Windows laptop without any trouble and finding the performance of the laptop acceptable when I logged in. It wasn’t the same case with the Windows 7 installation. Microsoft did a great job with their new OS, really.

When Visual Studio 2012 was installed, I wasn’t expecting any trouble with our source code, but nothing great is produced without first encountering problems right? To keep it simple, the application didn’t work at all. Spent the better part of my Sunday to scour Google for answers. Before the Manchester United game (that they won), I can’t find what’s not working.

FYI, we are coding in C# for our Windows 8 application.

After the game, I was gonna give up but then inspiration usually comes when you’re about to give up. I hacked my way again into the source code and below is a list of gotchas you should pay attention when you’re gonna convert older Visual Studio 11 Beta projects to a Visual Studio 12 project:

  • After you realize that you’re fucked, close the solution for the project you’re working and create another project.
  • Close that new project you’ve just created and open up the primary project.
  • Copy paste your Common\StandardStyles.xaml file somewhere.
  • Open up explorer and navigate your way to the new project.
  • Go to the Common folder and copy paste everything to your primary project’s Common folder.
  • Now open up the newly pasted Common\StandardStyles.xaml and copy paste all of your previously created custom DataTemplate from your old copy.
  • Go to Shair Raiten’s excellent guide to upgrade Metro apps from Beta to RC here.
  • As you can see, there are a number of changes to naming conventions for classes, XAML styles and static class methods. For each XAML style items, do a Find/Replace, yes it sucks but it works. The same goes to classes and static class methods if you use any in your codes.
  • Clean your solution and be hopeful. Run it now.

So what does this taught me? Microsoft is getting it right over time but they don’t really like early adopters. They make us bleed with the current BREAKING changes with Visual Studio 2012 and offering us only white papers that I don’t like to read. This is done with even the Microsoft Indonesia dev team is not aware of. A sad fact but it’s true.

However, I got help from Pak Risman, Microsoft Indonesia’s Developer & Platform Director. He taught me the right way to do things with codes, this is something I’d understand.

To wrap things up, I’ll be submitting the app to Windows Store soon and hopefully satisfy the QA team over at Microsoft. Cheers!

Products & Technology

Tons of blogs and various other reading sources discuss about products & technology across different perspectives and also geographically. What may be successful in one country (area) could be successful in other places or it may fail horribly. The point is, products are hand made using technology as its driver, a fact that is always true anywhere in the world.

So what’s the deal here in Indonesia? Usually between products & technology, they both play catch up with each other depending on the product owner’s focus. But as a product become more and more mature, it is technology that is the driver behind all the innovation. This paragraph is like proposing that a Minimum Viable Product (MVP) is the way to go but I don’t think it is necessarily so.

The question that we should be asking ourselves is essentially about the product itself. What are we trying to build? What will it accomplish? Will it be a solution or as a chance for users to do things enjoyably? The million dollar question as always; will it be “feasible” as a business?

Just now I just read an article at Venture Beat about SAP’s initiative toward startups here. There’s a great quote I feel needing to be asserted into every product and every effort of building one.

Build the Future

A bold quote and also the truth. Anyone in the tech scene must know SAP and their products, their enterprises magic to automate, index, manage and most importantly act upon the foundation of data (analytics) generated by their businesses. Their revenue is with the kind of numbers, any company would target for and the quote above is an affirmation of the company’s culture. Data abundance is surely a positive indicator of the future, right?

I am a very open person when it comes to the latest advancements in technology and more often than not, I plunged into all kinds of sci-fi imaginations. Well, here in Indonesia, technology is advancing at a slower rate generally. We have cities like Jakarta, the capital that is no doubt has the most fiber optics coverage city wide. We also have Jogjakarta, a place where in hawker stalls, you can get free wifi while you eat. But those two are cities within the Java island, I can’t say for cities in others islands, I never been there so I’m not speculating. Information is steadily expanding to places where the general population are.

My point is, we need more people in the tech scene to have the guts to define who we are technologically as Indonesians. Yes, we need businesses but as I said in earlier paragraphs, businesses need the technology. We are a nation of curious and tinkerer, criminally here in Indonesia, the outside world doesn’t trust credit cards issued in Indonesia because of fraud levels are high here. This shows the kind of technical skills we have. It’s a matter of putting it to good use for the rest of the population who doesn’t know shit about technology.

Startup incubators are flourishing in Indonesia, they take technical and non-technical founders. Train and consult the founders to create awesome products through their startups. I want to underline non-technical founders and call them toonies. Toonies are founders who have the same passion with Techies, don’t think of my term there as a devaluation form, it’s not. The more toonies, the more startups here and that’s a good thing for all of us. Competition will surely create innovation, education for the users and most importantly a solid foundation for startups to be businesses.

At the moment, I’m in Urbanesia and founded by a toonie. As a team, we’ve had so many differences in the past and I’m sure in the future but this is a good thing as far as the startup is concerned I believe. We learned organically about how to manage teams and create products. Yes, right now the products are still finding its identity but we’re getting there. Experiments and experience is delivering us to new heights. So many things I’d love to share but will do so when it’s due. Fundamentally, the most difficult part of creating a product and the technology behind it is not the product itself, it’s 2 things: mindset and communication.

I can’t say how much I love being in a world that I’ve passionately love since my first “Hello World“. To wrap things up, all the randomness above leads to this. There is no such thing as a perfect product, even miserably so, a perfect technology to support products. It is the bread and butter of any startup to find ways to cope. My best thinking about product & technology is to just create, create, fail, create, success, create, fail, fail, create, success, success, fail, create, create and create. You just don’t give up. Sounds simple, I know by hand that it is not anywhere near simple, that’s why mindset counts. Communication will happen after mindsets, when it’s there, kickass products are there for the taking.

No product sucks, the creators are and that’s a learning curve any product owners should realize. It takes man hours and a whole lot of efforts to create, things that are usually lightly taken. Toonies or Techies, we’re in the business of delivering happiness to our users, now, to build futures.

Urbanesia – Open Source & Microsoft

Today I was a speaker at Microsoft’s SQL on PHP event and I’m displaying the slides for the presentation below. It was a fun moment of sharing experiences, laughters and geekdom.


photo of Batista Batista R Harahap [email protected]
Jl. Bango II/29C, Pondok Labu
Cilandak , DKI Jakarta , 12450 Indonesia
62817847023

This hCard created with the hCard creator.