Make Your Own Toy Data Set
It’s easy to understand why toy datasets are so popular with shiny new data scientists. Whether you’re implementing KNN from scratch on the classic Iris dataset, getting your feet wet on Kaggle with the famous Titanic survivor dataset, or using the MNIST dataset to dive into PyTorch, having a single spreadsheet with clean, well-studied collection of observations can allow you to focus your efforts on understanding the mechanics of the algorithm, model, or library, instead of on menial tasks like merging and cleaning your data or dealing with a substantial amount of missing values.
That said, there’s one big problem with toy datasets. Do you really need to use machine learning to predict whether water pumps in Tanzania are functioning, when you already have the labels, and would discover their status when collecting enough information for your model to classify new pumps? Is there really a good reason to predict Housing Prices in Boston in 1978 with linear regression, when those houses are long-sold and modern markets are totally incomparable? Are you really going to pretend that you’re going to perform a chemical analysis to determine wine quality, instead of just drinking it straight from the box eight hours into a Netflix binge?
The point I’m trying to hammer home is that toy datasets are tools that serve a very specific specific purpose: they provide you with predictable outcomes that allow you to quickly determine whether or not you’re implementing that library, model, or algorithm correctly. But when you try to step outside of that very specific purpose, you find that toy datasets are not an accurate representation of data science as a field. When doing a real-world data science project, you have to deal with datasets that have not been groomed and curated, so toy datasets will only get you so far.
…unless, of course, you build it yourself.
If you’re interested in completely missing the point of this post, feel free to grab my personal toy dataset from github. It consists of over 3000 measurements (length, width, depth, and weight) of two species of captive sea turtle hatchlings collected over nearly 20 years by researchers at NOAA’s Galveston laboratory.
This has been my go-to toy dataset when I want to implement a new algorithm from scratch to understand how it works, compare different models, or explore a new library.
As convenient as this dataset is now for quick little experiments, the real value came in actually cleaning and curating it myself. The entire process was mired in ambiguity, because I knew that I wanted a toy dataset, but didn’t immediately know what I wanted to do with it, which means I had to do a huge amount of EDA just to understand what I was working with. There were a few anomalies where one measurement was taken in different units, but it initially looked like the values were straight up omitted. Or that the turtles were just very, very, flat. Or perfectly cube-shaped, because columns had been duplicated for some entries. And there wasn’t one batch of abnormally large hatchlings — they happened to be around the same size as turtles that were several months older, even though they were the first entry in that collection of measurements (meaning they weren’t actually measured as hatchlings).
The individual csv files weren’t formatted identically, and documentation was scarce, requiring me to dive into some academic articles to figure out what mysterious column headers corresponded to, but that taught me a lot about both how turtles are measured and how tracking devices are fitted. It gave me an excuse to google all sorts of bizarre pictures trying to figure out how those measurements are obtained (spoiler alert: calipers, not measuring tape wrapped around the entire turtle like my idiot self had imagined).
In other words, when you’re building a toy dataset, you can’t just look at the properties of the dataset itself, you have to understand the context in which the data appears.
If you’re new to data science and strapped for creativity when it comes to what to add to your portfolio, I can’t recommend building your own toy dataset enough. It might not be a project that will disrupt a multi-billion-dollar market, revolutionize an industry or shatter the status-quo, but you’ll know that data front and back and feel some real ownership over your analysis.
You’ll gain experience with hunting down data, merging multiple datasets, cleaning, finding anomalies, rescaling when necessarily, subsetting to the useful stuff, and making executive decisions with what you want to make it to the final product. Once you’ve got your toy dataset, you can step through tutorials in a totally unique context, with results that you have to interpret and evaluate on your own because that analysis hasn’t already been done by everyone else — and the more you use your toy dataset, the more you’ll get used to what it means to train compare multiple models.
In other words, even though you’re working with a toy data set, you’re doing real data science.