Data Munging in Pandas - Tricks and Pitfalls
July 15, 2016
Data munging is the least sexiest part of "the sexiest job of the 21st century". According to authoritative popular sources like @BigDataBorat, data scientists spend 80% of their time cleaning up data. So we need to be really good at it!
pandas is an awesome python library for manipulating any data that fits in a spreadsheet-like format. If you're familiar with R, it's the R dataframes concept implemented in Python. Since the goal of data munging in data science is often to generate a matrix to be fed into a machine learning model, pandas
is a valuable tool to data scientists who use python.
At the beginning of my time at Metis's Data Science Bootcamp, I decided to learn everything I could about pandas by diving deep into the docs. The docs are good, but there is a very steep learning curve and the syntax can be tricky sometimes. At Metis, I got the nickname "The Panda King" because I helped my fellow students solve their pandas problems. There was even a Slack emoji made (thanks Hannah):
Tutorial: Pandas Tricks & Pitfalls
I did a workshop for my class at Metis called "Pandas Tricks & Pitfalls", and the Jupyter Notebooks are available on GitHub for anyone who wants to go through and learn more about pandas. There are two notebooks: the main notebook, pandas_tricks.ipynb
, which has examples and explanations, and the worksheet notebook pandas_exercises.ipynb
with exercises you can try as you follow along. Please contact me if you find anything confusing or vague. I'm also always open to answer your pandas questions!
Start here: https://github.com/IanLondon/pandas_tricks
Covered in Pandas Tricks and Pitfalls:
- Understanding
groupby
in pandas - Working with dates and times
- Binning with
resample
andcut
- MultiIndexing
- Some plotting tricks with
matplotlib
andseaborn