“Data cleaning,” sounds like a combination of two of my least favorite things: math and cleaning. I’ve never been the sharpest at doing math in my head (calculators in phones are a direct blessing from God) and I think we can all agree that putting laundry away is one of the most tedious tasks in the world. Going into data cleaning I definitely had some preconceived notions about what the process was going to look like, but I still tried to keep a positive outlook.
And to be honest, I’m so glad I did keep an open mind because I really enjoyed the process of data cleaning and looking at data sets as a whole. Everything I disliked about a data set I could remove (within reason of course) and I found myself enjoying looking for different patterns in each of the data sets I collected or created.
The first data set I looked at was from the United States Department of Agriculture, specifically their data on the certified organic farmland acreage, livestock numbers and farm operations. I mentioned earlier that I thought data cleaning was going to be tedious, so I wanted to choose something I thought would be interesting, which is how I ended up picking some data about cows. How could I be bored when I was thinking about baby cows?
After cleaning up some of the data this is the final chart I ended up with. I decided to change up some key elements in the original chart that I think in the end made it easier to read and understand. In the original it listed years sporadically from 1996 onward, but it my edited version I kept the years that only appeared consistently which happened to be from 2002-2008. As a result I also had to alter the changed column to reflect the years that I removed from the original. It’s really interesting to note the flux between numbers in almost every category in every year. There’s never a steady gain or decrease, which is good information to have when you’re looking for patterns.
Things I would liked to see answered with this data set include seeing patterns between different types of cows, if the total number of turkeys start to decline as a result of the number of people not eating meat anymore going up and why broilers are going up even though I have no idea what that is.
The next data set I cleaned up was from the United States Census Bureau and it covered the number of same sex couples in the United States from 2005-2017. I honestly was a little confused by this data set because gay marriage wasn’t legalized in the United States until 2015 so I was left wondering how ten years of “married couples” could even exist on paper? I’m going to assume that these were gay couples that filled out the census and even though they weren’t legitimized by the state they still lived together and considered themselves married and filled it out that way on the census form. Which works for me honestly.
I noticed on the bottom of this form that there was a note that said the census questionnaire changed after 2009 and as a result there was a lot more gay couples counted since that time. With that in mind I decided to cut the data and make the year range from 2010-2017 because there were a lot of gay couples left out and the previously collected data seemed wrong as a result. I noticed that there wasn’t a big jump in the amount of married gay couples in 2015 and I wondered why that is. Not knowing how the original information was gathered is a surprising obstacle I wasn’t expecting, what questions were participants asked to have the data end up this way? Is there a way to clarify data if you don’t know how it was originally obtained? I’m not sure of the answer now but I hope I can discover it in the future. Another question I’d like to answer is where this data is coming from, are they polling mostly in large cities? I’d like to see a break down of rural area couples vs city couples.
For my own data set I decided to take inspiration from the fact that the Tony Awards are on tonight. I got most of my information from Playbill.com. To begin I compiled a chart detailing the winners in the musical categories between 2008 and 2017. I included nine of the possible categories musicals could win awards in and when individual actors won I associated them with the musical they were a part of.
As I was compiling this data I already had a slew of questions brewing in my head that I could answer with this chart. Is there any correlation between a specific winner in one category winning Best Musical? What are the chances of winning Best Musical if that musical hasn’t won (or has won) in any other categories? What category is the most likely to have a winner that hasn’t won in any other category? This is so much more fun than math or putting away laundry I almost can’t believe it wow. I can see this data potentially displayed in a bubble chart, a tree map or a flow chart. This is a formal apology for data cleaning and data sets, you’re nowhere near as bad as math and I’m completely officially excited to use you again in the future.