Stats Adventure Week 1: A Review – The Unmotivated Millennial

I’d be lying if I said I didn’t need to at least touch up on a couple of really basic concepts, just to validate my assumptions. I believe the stats class I took in college pretty much focused on the normal distribution. This is just a review of basic concepts which should be pretty straight forward or familiar.

Let’s start on the few definitions that are needed to move forward: mean, median, variance, and standard deviation.

Mean

The average. Add up all the values you have and divide by the number of data points. I think everyone knows this already.

Median

The point at which 50% of the data falls below and 50% falls above. In a set of (1,2,3,4,5), the median is 3. In a set of (1,1,4,6,8), the median is 4. And so on. This should be equal to the mean in a perfectly normal distribution. This is more valuable when dealing with distributions that aren’t normal.

Variance (σ^2)

Variance is a measure of how spread out your data is. If you have data all over the place, the variance will be high. Conversely, if you have data that’s really close together, variance is low. It’s calculated by taking the average squared distances to the mean.

In other words, take each data point, subtract the mean, and square the value. Now add all of these squared values together and divide by the number of data points and you’ll be left with your variance. This is important because of what we can derive from it which is…

Standard Deviation (σ)

It’s the square root of variance. It’s helpful because variance uses units^2 which is kind of abstract so standard deviation measures how spread out your data is in human readable units.

The Normal Distribution

AKA the bell curve. This distribution shows up when we’re measuring things like the height of a population or the weight of a box of cereal. The most important part of this distribution is that it’s mean is in the center, and the empirical rule: nearly all of the data (99.7%) will fall within three standard deviations of the mean.

The other important piece to know about the normal distribution is that you are able to determine how much data falls within a certain standard deviation from the mean. For example, 68% of data falls within one standard deviation. 95% within two. 99.7 within three.

Z-Score

There’s this other thing called a z-score which effectively measures how many standard deviations a data point is away from the mean. So it could be 0.3 standard deviations, 1.8, etc. This is helpful because there’s this thing called a z-table that can tell you how much data falls below that z-score.

After talking to people in the industry, it seems like z-scores or z-tables aren’t really used, but I remember there was a big focus on this stuff in school so I thought it was worth a mention.

That’s the quick review of some basic concepts. I don’t think any of this should be new or unfamiliar (if it is, I don’t know what to tell you, really…). More to come.

Happy St. Patty’s Day!

I don’t know, but it’s ok because it cheap writing service doesn’t have any swears in it.