BI platform interviews Giovanni

16 Feb

I was recently interviewed by BI Platform to talk all things data science. There I talked extensively on how we do data science at our clients, with tips on how to begin your career in data science.


Import partitioned Google Analytics data in Hive using Parquet

14 Feb

I was recently working on importing Google Analytics data into an Amazon EMR cluster. This post details the various issue you might run into if you'll try to do the same!


Moving from Excel to R

13 Feb

At Data Driven Commerce Vincent Warmerdam talked about the value of open source data tools in a modern-day workplace.


GoDataDriven open source contribution: February 2017 edition

03 Feb

I find that, for a service company, we are quite active in the open source world. However this remains pretty hidden in practice. So I thought that I can start to change that by publishing, every once in a while, the various contributions we make to open source projects, both old and new.


Monitoring HBase with Prometheus

29 Jan

This article shows how to monitor HBase using Prometheus by exposing an HTTP server which serves JMX beans in the Prometheus metric structure and visualize the metrics in Grafana.


How to land a job in data science

27 Jan

I was recently invited to give a talk at the PyData Amsterdam meetup. There I talked about how you can land a job in data science.


How to write code using the Spark Dataframe API: a focus on composability and testing

27 Jan

I was recently thinking about how we should write Spark code using the Dataframe API. In this post I'll guide you through the different options


Join us on February 23rd for Google Hashcode

23 Jan

Fun algorithms/optimization competition where you can solve real Google problems


Use a SSH-key to access your cloud resources with socks-proxy

08 Jan

Securely access your cloud resources with a socks-proxy. Example how to create a SSH-key and use the public key to create a new Linux machine. Configure a sock proxy to access the remote websites.


Solving hard data problems with causal data science

29 Dec

It is tempting for organizations to find biased answers in their data and draw faulty conclusions, like mixing causation with correlation. Adam Kelleher, lead data scientist at Buzzfeed, emphasizes that this is not without risk.