Stats Adventure Week 2: Probability

Chapter 1

Literally chapter 1. So we all should have a general understanding or concept of Probabilities.  It’s effectively a function that yields some value between 0 and 1. There are two types of probabilities: Discrete and Continuous. Let’s talk about these first.

Types of Probabilities

Discrete: When dealing with any data that is finite or countably infinite, that is considered discrete. This means anything that needs to be counted. Think about number of customers that sign up, or number of cars that pass by my house.

Continuous: Variables that are infinite are considered continuous. This mostly has to do with measurement. An example would be the amount of snow that falls. Did 3 inches fall, or 3.01? Or 3.001? Or 3.000001? etc.

Probability Density Function

When dealing with discrete variables, we can calculate probabilities by taking (# of successes)/(# of total outcomes). For example, what’s the probability of Heads on a fair coin? There are 2 outcomes, (Heads or Tails), and therefore, 1/2 is the probability.

However when dealing with continuous variables, it’s different. Take the snow example above. What’s the probability it will snow exactly 3 inches? Not 2.99 inches, or 3.01 inches, exactly 3? It’s pretty much 0 because it might snow close to 3 inches, but its extremely unlikely it will snow exactly 3.

This is where density functions come in. It’s basically a curve that describes some probability. You can’t read it like a normal plot though, because for any given value of x, the probability is 0. To get a probability from this, you measure an integral or area under the curve. In the snow example, we can measure the probability 2.99 to 3.01 inches which will be > 0. I’ll leave it there because I don’t feel like going into calculus. Just remember that area under the curve = probability with a pdf.

Summary

Discrete counts. Continuous measures. PDF’s aren’t a dumb file type that you can’t edit. Area under the curve. Pretty much what I took away from these sections. These are foundations for the next piece, which will go into Parametric Distributions. Specifically, the Binomial, Poisson, Normal, and Exponential Distributions. More to come.

 

Stats Adventure Week 1: A Review

I’d be lying if I said I didn’t need to at least touch up on a couple of really basic concepts, just to validate my assumptions. I believe the stats class I took in college pretty much focused on the normal distribution. This is just a review of basic concepts which should be pretty straight forward or familiar.

Let’s start on the few definitions that are needed to move forward: mean, median, variance, and standard deviation.


Mean

The average. Add up all the values you have and divide by the number of data points. I think everyone knows this already.

Median

The point at which 50% of the data falls below and 50% falls above.  In a set of (1,2,3,4,5), the median is 3. In a set of (1,1,4,6,8), the median is 4. And so on. This should be equal to the mean in a perfectly normal distribution. This is more valuable when dealing with distributions that aren’t normal.

Variance (σ^2)

Variance is a measure of how spread out your data is. If you have data all over the place, the variance will be high. Conversely, if you have data that’s really close together, variance is low. It’s calculated by taking the average squared distances to the mean.

In other words, take each data point, subtract the mean, and square the value. Now add all of these squared values together and divide by the number of data points and you’ll be left with your variance. This is important because of what we can derive from it which is…

Standard Deviation (σ)

It’s the square root of variance. It’s helpful because variance uses units^2 which is kind of abstract so standard deviation measures how spread out your data is in human readable units.


The Normal Distribution

AKA the bell curve.  This distribution shows up when we’re measuring things like the height of a population or the weight of a box of cereal. The most important part of this distribution is that it’s mean is in the center, and the empirical rule: nearly all of the data (99.7%) will fall within three standard deviations of the mean.

The other important piece to know about the normal distribution is that you are able to determine how much data falls within a certain standard deviation from the mean. For example, 68% of data falls within one standard deviation. 95% within two. 99.7 within three.

Z-Score

There’s this other thing called a z-score which effectively measures how many standard deviations a data point is away from the mean. So it could be 0.3 standard deviations, 1.8, etc. This is helpful because there’s this thing called a z-table that can tell you how much data falls below that z-score.

After talking to people in the industry, it seems like z-scores or z-tables aren’t really used, but I remember there was a big focus on this stuff in school so I thought it was worth a mention.


That’s the quick review of some basic concepts. I don’t think any of this should be new or unfamiliar (if it is, I don’t know what to tell you, really…). More to come.

Happy St. Patty’s Day!

I don’t know, but it’s ok because it cheap writing service doesn’t have any swears in it.

The Time has Come…

The End of an Era

So the day I’ve been dreading is finally here. 4 years as an analyst and I’ve been able to skirt by with my (extremely) limited knowledge of statistics. Every now and then over the course of my professional career, a stats problem or question would come my way and I’d be able to say enough words like ‘distribution’ or ‘bias’ to squeak out of trouble before I start babbling like an idiot.

Over the past 4 years, I’ve been able to work on my skills as an analyst. Specifically, business acumen and programming. I can make a business case and support it with persuading data, dashboard like a boss, and build a slick automation that would have left 2014 me in awe. I knew how to do exactly none of these when I first joined the workforce and often impress myself with how far I’ve come.

However, I’m finding myself lacking compared to some of my peers. I’ve noticed that many of the high performers in the industry not only have business understanding AND programming, but they are well versed in statistical concepts.

That being said, I’m obviously at a disadvantage here. Which means I’ve either got to stay where I am and hope for the best, or do something to better myself and grow. So here’s the verdict : I’m going to learn stats.

By learning stats I don’t mean things like what’s a mean, median or standard deviation either. I’m talking full on probability using density functions, Bayesian Stats, Time Series and Markov Chains. I hear these terms so much and nod blankly at the person talking about them. No more. I’m gonna do it, and I’m going to do it in two months.

Why two months? Summer’s right around the corner and once that warm weather hits I’ll have zero motivation to stay inside and bury my face in equations and strange symbols. I want to golf and grill and hike and camp and cruise with the windows down.

Alas is what brings me to the point of this blog. I’ve tried learning stats before. It’s not easy. The concepts are difficult and retention requires practice. I’m going to take what I learn each {time period TBD} and summarize it here so future me can go back and remember it without relearning. If you’re interested in Data Science or Analytics, maybe you’ll find it interesting… eh, maybe not.

Either way, it’s gonna happen.

We are the party willing to embrace new writing services company ideas and put them to the test.