In my never-ending effort to translate all of that linear algebra I toiled over in college into functioning Java code (make fun of it on github), I came across one operation that gave me a perfect opportunity to use recursion: computing the determinant of a square matrix.

First, we’re going to talk a little bit about the determinant, so we know what computation we’re dealing with. After that, we’ll implement it recursively in Java in the`Matrix`

class––this version is readable, but computationally expensive for large matrices. If you want to cut to the chase and see the code, feel free to check it the full `Matrix`

class on github. …

The way I’m currently twiddling my thumbs and waiting for the world to end is building my very own linear algebra package in Java (make fun of it on github). It’s a good excuse to revisit Euclidean vector spaces through a programming lens, to figure out if I really understand it as well as I like to pretend I do.

I always had to double and triple check my computations when doing multiplying matrices on timed linear algebra exams (because I was prone to making a mistake while trying to take a shortcut), so I had built it up as something much worse than it actually turned out to be in my head. …

It’s really easy for me to take linear algebra for granted when doing data science. Whether I’m overlooking the specifics of eigendecomposition in Principal Component Analysis, completely forgetting that every weighted sum is actually a dot product between a vector of inputs and a vector of coefficients, or sticking my fingers in my ears to avoid hearing the “tensor” in TensorFlow, I feel like I’ve been shortchanging my degree in math by not getting into the weeds and looking at the implementations.

Although there’s nothing in the definition of a vector space that requires vectors to be represented as lists of numbers, we’ll be focusing on vectors that *do* take this form. We’re also going to focus only on vector operations involving vectors and scalars––no matrices yet. …

It’s easy to understand why toy datasets are so popular with shiny new data scientists. Whether you’re implementing KNN from scratch on the classic Iris dataset, getting your feet wet on Kaggle with the famous Titanic survivor dataset, or using the MNIST dataset to dive into PyTorch, having a single spreadsheet with clean, well-studied collection of observations can allow you to focus your efforts on understanding the mechanics of the algorithm, model, or library, instead of on menial tasks like merging and cleaning your data or dealing with a substantial amount of missing values.

That said, there’s one big problem with toy datasets. Do you really *need* to use machine learning to predict whether water pumps in Tanzania are functioning, when you already have the labels, and would discover their status when collecting enough information for your model to classify new pumps? Is there really a *good reason *to predict Housing Prices in Boston in 1978 with linear regression, when those houses are long-sold and modern markets are totally incomparable? Are you really going to pretend that you’re going to perform a chemical analysis to determine wine quality, instead of just drinking it straight from the box eight hours into a Netflix binge? …

Confession: my personal experience is almost the complete opposite of the title of this article. I actually started with C++ in college, moved to Java to teach AP Computer Science A, and then entered Python territory to work with all of the snazzy data science libraries available. Now that I’m working on a Java project again (a neural network package that is really pushing my understanding of the underlying math to its limits), I’m really paying attention to the little differences in how the languages work. …

AP Statistics seems to love dotplots. They’re easy to make by hand, they quickly give you an idea of what your distribution looks like, and they don’t require any real planning or number-crunching before diving in––compare this to histograms, which require you to know how high the bar is going to be before you draw it, and how many bins you want to use. There’s a simple one-to-one correspondence with observations and dots on the plot, so they’re easy to understand and easy to produce by hand in a testing environment.

When I was teaching the early units in AP Statistics, I got sick of making dot plots by hand, and I wound up turning to a variety of online tools that simply weren’t giving me exactly what I was wanting, which was a no-frills, minimal, straightforward plot like this…

Two fields that often get left on the sidelines in conversations about data science are **Information Theory****, **which studies the quantification, storage, and communication of information, **Coding Theory**, which studies the properties of codes and their respective fitness for specific applications. A wonderful introduction to the two fields is Information and Coding Theory by Gareth A. Jones and J. Mary Jones. …

Every new programmer has gone through it. You get an error and all progress screeches to a halt. Maybe you go out on a limb and ask another programmer for help; maybe you post a question on a message board. At some point, you’ll get the slap in the face: “Just f***ing Google it.”

Before diving into strategies about effectively utilizing Google to find a solution for your problem, let’s take a moment to unpack this advice — because it *is* advice, and not just a dismissal.

If you don’t know what the “it” represents in, “Just f***ing Google it,” you’re going to have a difficult time finding the “it” in, “It worked!” …

In my last year of teaching, I was teaching three AP STEM courses at the same time — both AP Statistics and AP Computer Science Principles were new to me that year, but I had taught AP Computer Science A the previous year. I took a risk and invested time into learning how to do statistics with Python, in an attempt to create an overlap between two of my preps. In the end, it definitely paid off for me: I found a new passion and am now pursuing a career in data science.

This blog is the first in a series of standalone posts dedicated to solving basic AP-style statistics problems from scratch using Python. I will keep the use of specialized libraries to a minimum. It is important to keep in mind that only parts of the statistical pipeline can (or should) be automated, and that computing specific values is just one step in problem-solving — interpreting and applying your results is the ultimate goal. Many students tend to get so far into the weeds with number-crunching that they lose sight of the bigger picture. By automating the calculations with code, we can focus on what those calculations actually tell us. …

When a shiny new student of data science or statistics first wanders into the land of hypothesis testing, one of the first conceptual hurdles they’ll have to grapple with is the difference between the *population distribution*, the *sample distribution*, and the *sampling distribution *(of the statistic of interest). This post will explore these three concepts visually by looking at the distribution of heights in Rosner’s FEV dataset (obtained from http://biostat.mc.vanderbilt.edu/DataSets). As usual, the dataset and code can be found on my github.

Our question of interest is: What is the mean height of children involved in the study?

Before we dive in, we need to note something important. When performing hypothesis tests, we want to understand a pattern in the *population of interest*. In practice, obtaining data about the entire population is not generally possible. In this example, however, we will consider our “population” to be only the 654 children involved in the study. Our goal is explicitly not to draw conclusions about the height of children in general, but rather to use sample distributions and the sampling distribution of the mean to estimate information about the entire dataset. Drawing conclusions about the heights of children in the dataset instead of about children in general will allow us to check our work — at the end, we can examine the population distribution to see how effective our methods are. Hopefully, by seeing these methods in action in a controlled environment, we’ll be able to understand their use in cases where we can’t directly check our work. …

About