How to Start a Data Science Project in Python
Many blog posts show complicated machine learning methods and cutting edge technologies, putting Data Scientists around the world in a constant state of FOMO. But what are the fundamentals? How should you structure your project? What is the minimum set of tools you need? This post gives a few pointers for setting up your projects so that you will reach Product Ready Data Science as soon as possible.
Project structures often organically grow to suit people's needs, leading to different project structures within a team. You can consider yourself lucky if at some point in time you find, or someone in your team finds, a obscure blog post with a somewhat sane structure and enforces it in your team.
Many years ago I stumbled upon ProjectTemplate for R. Since then I've tried to get people to use a good project structure. More recently DrivenData (what's in a name?) released their more generic Cookiecutter Data Science.
The main philosophies of those projects are:
- A consistent and well-organized structure allows people to collaborate more easily.
- Your analyses should be reproducible and your structure should enable that.
- A projects starts from raw data that should never be edited; consider raw data immutable and only edit derived sources.
I couldn't help to invent my own project structure and my minimal structure looks something like this:
example_project/ ├── data/ <- The original, immutable data dump. ├── figures/ <- Figures saved by notebooks and scripts. ├── notebooks/ <- Jupyter notebooks. ├── output/ <- Processed data, models, logs, etc. ├── exampleproject/ <- Python package with source code. │ └── __init__.py <-- Make the folder a package. └── process.py <-- Example module. ├── tests/ <- Tests for your Python package. └── test_process.py <-- Tests for process.py. ├── environment.yml <- Virtual environment definition. ├── README.md <- README with info of the project. └── setup.py <- Install and distribute your module.
You can find an example here.
It mostly follows the other structures:
- raw data is immutable and goes to
- processed data and derived output goes to different folders such as
- notebooks go to
- project info goes in the
- and the project code goes to a separate folder.
I try to make a full-fledged Python package (plus tests) out of my project structure so that the step between prototyping
and productionizing is as small as possible.
setup.py allows me to install the package in a virtual environment and use it in my notebooks (more on this in a later blog post).
It doesn't really matter which structure you pick, as long as it fits your workflow and you stick with it for a while. Try to understand the philosophies of the projects and pick the structure that suits your needs.
Projects should be independent of each other: you don't want your new experiments to mess up your older work. We do this partly by putting the files of different projects in different folders but you should also use separate Python environments.
Virtual environments are isolated environments that separate dependencies of different projects and avoid package conflicts.
Each virtual environment has its own packages and its own package versions.
Environment A can have
numpy version 1.11 and
pandas version 0.18 while environment B only has
pandas version 0.17.
I like conda virtual environments because they're well suited for Data Science
(read here why).
Create a new conda virtual environment called
example_project with Python 3.5:
$ conda install --name example_project python=3.5
Make sure your virtual environment is activated (leave out the
source if you're on Windows):
$ source activate example_project
... and you're now ready to install your favourite packages!
$ conda install pandas numpy jupyter scikit-learn
When you're switching to a different project, run a
source deactivate and activate the project's virtual environment.
Once you get the hang of the
deactivate-flow, you'll find that a virtual environments is a lightweight tool to
keep your Python environments separated.
By exporting your environment definition file (i.e. all installed packages and their versions) your projects will also
be easily reproducible.
If you want a more detailed discussion, check Tim Hopper's post.
Every project should have its own Git repository. Having a repo per project allows you to track the history of a project and maintain complex version dependencies between projects.
Alternatively, you can choose to have one repository with multiple projects, putting all the knowledge in a single place. The downside is, however, that it often ends up with ugly merge conflicts: Data Scientists are generally not that fluent with Git. In addition to a lot of Git frustrations, it makes your projects less independent of eachother.
$ git clone https://github.com/hgrif/example-project.git
You can then setup your project structure in this empty folder.
If you followed this guide and already created a folder with some files, first initialize a git repository on your machine:
$ git init
Then create a new git repository on your host, get its link and run:
$ git remote add origin https://github.com/hgrif/example-project.git
This adds the remote repository with the link
https://github.com/hgrif/example-project.git and names it
You probably have to push your current
master branch to
$ git push --set-upstream origin master
.gitignore to your project directory
so that you don't accidentally add figures or data to your repository.
I generally start with a
.gitignore for Python and add the folders
output/ so that Git ignores these folders.
Now that Git is set up, you can
git add and
git commit to your heart's content!
You can get away of some of the repetitive tasks by using some tooling!
The easiest way to use virtual environments is to use an editor like PyCharm that supports them.
You can also use autoenv or direnv to activate a
virtual environment and set environment variables if you
cd into a directory.
Having a good setup for your Data Science projects makes it easier for other people to work on your projects and makes them more reproducible. A good structure, a virtual environment and a git repository are the building blocks for every Data Science project.
Airflow Tutorial for Data Pipelines
August 11, 2017
GoDataDriven open source contribution: July 2017 edition
July 31, 2017
Continuous Deployment of Python eggs with VSTS on Azure
July 28, 2017
Hadoop and LDAP, as seen through Venetian blinds
July 01, 2017
GoDataDriven open source contribution: June 2017 edition
June 30, 2017
Vendor Free Data Science
June 19, 2017