Random thoughts.: 2015

I think I have found my favorite tool when it comes to clustering - ELKI. Lots of algorithms, minimal interface, and well implemented. I have a few cribs about the UI/UX, but overall its one of the best options out there. It also helps that it has a great implementation of one of my favorite algorithms - OPTICS.

If you are in the habit of using Weka, for small tasks or maybe because you started off with it, be wary of using it for clustering. Recently,I had to use Weka for cluster analysis for legacy reasons, and I am far from being a happy customer. This short post provides some instances where Weka and ELKI gave different results on a problem.

DBSCAN

The following images show the clusters discovered in the same dataset using eps = 0.5 and minpts = 5 (these are parameters to the DBSCAN algorithm). The original dataset has 31 clusters. Different clusters discovered are represented by different colours.

Left: Weka - 1 cluster, right: ELKI - 24 clusters

As you can see, for the same settings of parameters Weka thinks of the whole dataset as one cluster, while ELKI discovers 24 clusters. In fact this happens to be true for a range of these parameters. The following heatmaps show the number of clusters as eps and minpts are varied.

Number of clusters. Left: Weka, right: ELKI

Weka sees a maximum of 2 clusters - as denoted by the white strip to the extreme left of the plot. ELKI sees quite some variation - at the bottom left there are around 3000 clusters, while for the rest of the plot it is close to 1 (I know it looks like 0, but the shades for 0 and 1 aren't distinguishable visually and the minimum number of clusters can be 1).

If you think about it what ELKI reports makes sense - at some setting of the parameters you would expect each point in the dataset to be a cluster. Since there are 3100 points in this dataset, ELKI sees as many clusters in the bottom left corner. Hence, Weka seems to report incorrect results.

I tried comparing Weka and ELKI on another dataset, this time comparing cluster purity for a range of parameters. Again, in the case of Weka we see that the purity doesn't vary much, whereas for ELKI it varies over a wide range (as you'd expect).

Cluster purity. Left: Weka, right: ELKI

The responses to this question on stackoverflow suggest that this happens in Weka because automatic normalization of distances is imposed. Even if this is true, the implementation is incorrect - normalization here should not be a default and the documentation does not warn you.

k-means

I was a little surprised that the k-means implementation in Weka is buggy. This is one of the simpler algorithms out there!

On the same dataset I've used above, running k-means with k=32 doesn't terminate. It does not terminate even when you set the maximum number of iterations to 500. I had the clustering run for around 40 min before I decided to kill it.

ELKI gave me this in under a second:

k-means, with k = 32

Unlike the problems with DBSCAN, this problem seems to be dataset-specific. For a bunch of datasets I tried, k-means worked as expected.

Performance

I guess a discussion on performance is moot if your results are incorrect. But here are some numbers (from the ELKI website) further strengthening the case for ELKI. I have highlighted DBSCAN and OPTICS since those were the algorithms I was interested in. The other numbers are equally impressive.

ELKI Benchmarks

With this we come to end of this short post. If you haven't used ELKI yet, I hope I have convinced you to give it a shot!

I recently came across the book The Watchmaker of Filigree Street. While it is an enjoyable read overall, the bit that interested me is a particular skill of one of the lead characters "Mori". Mori can see the future. Moris' faculties are so sensitive that people do not have to do things to make a particular future feasible, they only need to intend to do certain things. Mori would know what future would be led to if those intentions were executed. Also, since there are multiple possibilities about how the future can roll out, Mori is good at guessing only when one of the possibilities seem dominant. For example, Mori can guess that a dice is about to fall, but he cannot guess what face it would show because the outcome is truly random.

Now switch to the movie Push (2009), starring Chris Evans when he still wasn't Captain America. About mutants with different kinds of superpowers. We are introduced to this category called Watchers. Watchers can see the future. Here too, we have the notion of many possible futures, and what a Watcher sees changes based on what happens in the present. We meet the "Pop Girl", a powerful Watcher, who, like Mori, needs only people to decide on doing something before she can see the relevant future.

(You are a computer geek if you thought of the word "stack" just because I have said 'push' and 'pop' in the same paragraph. Like I did. ;) )

The reason I mention these characters together are both stories have people with no clairvoyant powers whatsoever trying to outsmart them. I think this is an interesting setup. Think about it - your enemy is prescient; when you even think of a strategy to fight them, they already know. How do you fight someone like that?

Interestingly, both stories deal with this differently:

Mori vs Grace: Grace relies on randomness. When she travels she allows coin flips to do a lot of the decision-making. In a particular part in the story she needs a package to be carried by a device (trying hard to avoid spoilers here) - and she succeeds because the device has the capability to occasionally move randomly. Mori knows something is up, even that there is a package on the move, but he is at a loss to guess precisely. Remember, how he can see dice falling, but can't guess outcomes? This is exactly what happens now.

Pop Girl vs Nick: Nick relies on not knowing the plan or having his team know of the plan till the very last moment. Knowing the plan leads to intentions, and intentions lead to the Pop Girl seeing the future. So if you have a plan that is already set in motion, but you have nothing to do with it since you do not know about it yet, Pop Girl does not see it as a future you are involved in.

How does Nick pull this off? He thinks up a plan - on the other side Pop Girl starts seeing a future - writes letters to the members in his team, including himself, detailing out the part of the plan they are to execute, with instructions to everyone to read their letters at predetermined times. Pop Girl does not see the whole plan yet. Nick then has his memory wiped out starting from the time when he thought of the plan. The memory-erasure is done by another kind of mutant - who is also instructed to hand over the self-addressed letter to Nick just after the erasure session.

Once Nicks' memory is gone, so are his intentions around the now forgotten plan - and Pop Girl stops seeing a definitive future for Nick or his team. At a later point in time, when everyone has seen his or her part of the plan, it is possibly too late for the Pop Girl to do anything (she can tune into a future immediately, she cannot translocate immediately). Note that, just reading at a few instructions in a letter is probably not potent enough as realizing how they fit into the overall plan - blind instructions do show you some kind of a future, but your understanding of how the whole thing works, which lead to very specific intentions, are better fodder for Pop Girl. This adds another layer of vagueness that a Watcher must contend with.

Both stories leave many questions unanswered, and having to do with playing around with time, (possibly) has loopholes. For example:

Moris' visions of the future where he has picked up a new skill imbues him with those skills now. So if in the future he is to speak flawless English, he starts speaking English now. How very Grandfather Paradox-y!
Why doesn't Pop Girl fall back to using her visions from the time before Nick had himself erased? Also, erasing would have been an intention - so she would have known that her visions before the erasing were good to go on with.

I am sure one can come up with possible explanations. But leaving details aside, and thinking of these ideas as only high-level suggestions instead of fleshed-out strategies, I liked how two different approaches - randomness and "just-in-time" plans were explored in the stories.

Random thoughts.

Saturday, December 19, 2015

Clustering: ELKI over Weka

DBSCAN

k-means

Performance

Monday, August 17, 2015

Mori and Pop Girl - how do you fight someone who knows the future?

collapsible text view

About Me

Blog Archive