Sparky Dots
Tuesday, August 2, 2016
AWS EC2 spot pricing
Wednesday, July 27, 2016
Saturday, June 25, 2016
Weird Spark bug?
1.5.0-cdh5.5.0 scala> df.filter("ad_market_id = 4 and event_date = '2016-05-23'").show +----------+------------+ |event_date|ad_market_id| +----------+------------+ +----------+------------+ scala> df.filter("ad_market_id = 4").filter("event_date = '2016-05-23'").show +----------+------------+ |event_date|ad_market_id| +----------+------------+ +----------+------------+ scala> df.filter("ad_market_id = 4").orderBy("event_date").filter("event_date = '2016-05-23'").show +----------+------------+ |event_date|ad_market_id| +----------+------------+ |2016-05-23| 4| +----------+------------+
Tuesday, March 22, 2016
Home Depot Kaggle competition started
Running some cleaning, spell-checking, initial feature generation on my AWS Spark cluster with 33 nodes.
I might not be able to put a lot of effort into it, but I will make sure I make at least one submission with basic features.
Thursday, January 7, 2016
Merger trait - common functionality of merging networks
Saturday, December 5, 2015
Concept diagram for Machine Translation
Thursday, November 5, 2015
Conditional probability on partitioned space
Monday, August 31, 2015
Clean tmux cheat sheet
Thursday, August 20, 2015
Google Deep Dream Generator
I thought this image might work well:
Friday, June 5, 2015
Spark MLlib Review
Iterative methods are at the core of Spark MLlib. Given a problem, we guess an answer, then iteratively improve the guess until some condition is met (e.g. Krylov subspace methods). Improving an answer typically involves passing through all of the distributed data and aggregating some partial result on the driver node. This partial result is some model, for instance, an array of numbers. Condition can be some sort of convergence of the sequence of guesses or reaching the maximum number of allowed iterations.
Thursday, May 7, 2015
Batcher's odd-even merging network
I couldn't find a closed-form formula for odd-even network node partner calculation. The only available implementations were recursive and not very elegant. Here is the code that was provided on Wikipedia.
So I decided to work out a simpler and more intuitive solution to odd-even merge-based sorting network partner calculation, and here it is:
Also, here I put up a little interactive sorting network generator. Of course, I updated that Wikipedia article, to make it easier for learners :)
Here is the best performance analysis of this network that I could find.
Tuesday, May 5, 2015
Digit recognition with Multiclass SVM on Spark MLlib
To test this multi-class classifier, we can try it on handwritten digit recognition problem. Get hand-written digits data from here. Accuracy is only 74% with 100 iterations. Maybe it can't get much better with this construction. A different way of constructing multi-class classifiers from binary SVM is to use pairwise (one-vs-one) schemes with some adjustments as described here and also another method described here. Scikit-learn SVM classifier performs better out of the box (if used with RDF kernel accuracy is in high 90's), but the sklearn implementation is not scalable. Hopefully Spark MLlib will be able to beat this in future, when more sophisticated (high-level abstraction) ML pipeline API features comes online.
For comparison, here are some results with tree classifiers. With RandomForest (30 trees, Gini, depth 7) it goes up to 93%. Adding extra 2nd order interactions (Spark doesn't support kernels in classification yet, but here a simple feature transformation that adds second order feature interactions), and increasing allowed tree depth to 15, brings accuracy to 97%. So, there is a lot of room for improvement in multiclass to binary classifier reduction.
Saturday, March 23, 2013
specialized memory
I tried the task myself and could not keep track of more than five numbers—and I was given much more time than the brainy ape. In the study, Ayumu outperformed a group of university students by a wide margin. The next year, he took on the British memory champion Ben Pridmore and emerged the "chimpion."
The Brains of the Animal Kingdom http://online.wsj.com/article/SB10001424127887323869604578370574285382756.html
Sunday, December 3, 2006
Our galaxy: 1 out of ~125,000,000,000.
There are hundreds of billions of stars in a galaxy and there are hundreds of billions of galaxies out there.
Some estimate that there are ~40,000,000,000,000,000,000,000 stars. I don't know how to even comprehend such a huge number.
Others even came up with results that there are ~10 stars for every grain of sand on all of Earth's beaches.
Can I just say that the Universe is mindbogglingly huge and we are so insignificant on that scale?