Posts

Integrations of AWS services - Big Data

In computer science, a linked list is a linear collection of data elements whose order is not given by their physical placement in memory. Instead, each element points to the next. It is a data structure consisting of a collection of nodes which together represent a sequence.

Just Pandas Things

This post isn't a friendly tutorial for beginners, but a friendly introduction to pandas weirdness. 1) pandas is column-major, which is why row-based operations are slow 2) SettingWithCopyWarning, or why we can't have nice things 3) Indexing and slicing 4) Accessors 5) Data exploration 6) Common pitfalls

Build a distributed system with multi nodes of Spark Kafka Hadoop Yarn on Linux

PC hardware is now so cheap that buying a couple of extra machines and wiring them into the same computing pool could make a very cost-effective expansion. This is what we are going to build, and we're going to use Ubuntu Linux to do it. Linux can take cluster computing tasks like these in its stride, and you don't need to fork out for a licence for every machine.

R Programming - Training Ensemble Models with Full CPU Cores

R supports parallel computations with the core parallel package. What the doParallel package does is provide a backend while utilizing the core parallel package. The caret package is used for developing and testing machine learning models in R. This package as well as others like plyr support multicore CPU speedups if a parallel backend is registered before the supported instructions are called.