Lesson 3

Outliers/influential points

<p>Learn about Outliers/influential points in this comprehensive lesson.</p>

AI Explain — Ask anything

Why This Matters

Imagine you're trying to figure out how much ice cream people eat based on how hot it is outside. Most people eat more ice cream when it's super hot, and less when it's cold. But what if one person eats a whole gallon of ice cream every day, no matter the weather? Or what if someone eats no ice cream even on the hottest day? These unusual data points, called **outliers** or **influential points**, can really mess up your predictions! They're like that one weird ingredient in a recipe that totally changes the taste of the whole dish. Understanding them is super important because they can make your statistical models (your predictions) completely wrong if you don't handle them carefully. In statistics, especially when we're looking at how two things relate (like temperature and ice cream sales), these special points can either just be 'different' (outliers) or they can actually 'pull' our entire prediction line towards them (influential points). Knowing the difference and how to spot them helps us make much better and more reliable predictions about the world.

Key Words to Know

Outlier — A data point that is unusually far away from the general pattern of the other data points, especially in the vertical (y) direction.

Influential Point — A data point that, if removed, would significantly change the slope or y-intercept of the least-squares regression line.

Leverage — The potential for a data point to influence the regression line, which is higher for points with x-values far from the mean of the x-values.

Least-Squares Regression Line — The 'best-fit' straight line that minimizes the sum of the squared vertical distances (residuals) from the data points to the line.

Residual — The difference between the actual observed y-value and the y-value predicted by the regression line (observed y - predicted y).

Residual Plot — A scatterplot of the residuals against the explanatory variable (x), used to check the appropriateness of the linear model and identify outliers.

Explanatory Variable (x) — The variable that is thought to explain or predict changes in the response variable.

Response Variable (y) — The variable that measures the outcome of a study or the variable being predicted.

What Is This? (The Simple Version)

Think of it like a tug-of-war game where you're trying to draw a straight line through a bunch of dots on a graph. Most dots are pretty close to each other, but then you have a few dots that are way out there.

Outlier: This is a data point that just doesn't fit in with the rest of the data. It's like that one kid in a class of 12-year-olds who is 6 feet tall. They're unusual, but they might not necessarily change the average height of the class that much.
- On a scatterplot (a graph showing how two things relate), an outlier is a point that is far away from the general pattern of the other points in the y-direction (up and down). It has a weird 'response' value compared to what we'd expect.
Influential Point: This is a super powerful outlier! It's like that one super strong person on a tug-of-war team who can pull the entire rope (and everyone on it) far to their side. An influential point actually changes the slope of your regression line (the 'best fit' line you draw through the dots) quite a bit.
- An influential point is usually an outlier in the x-direction (left and right), meaning it has an unusually high or low 'explanatory' value. Because it's so far out on the x-axis, it has a lot of 'leverage' to pull the line towards it.

So, an outlier is just 'different', but an influential point is 'different AND powerful' because it can really mess up your predictions!

Real-World Example

Let's say you're a teacher trying to figure out if there's a relationship between how many hours students study for a test and their score on the test. You collect data from 10 students.

Most students who study 2 hours get around a 70, students who study 4 hours get around an 85, and students who study 6 hours get around a 95. This looks like a nice, positive relationship: more study time usually means higher scores.

Now, imagine two special students:

Student A (Outlier): This student studied for 5 hours (which is pretty normal, not super high or low study time), but they only got a score of 30! Maybe they were sick, or just had a really bad day. This point (5 hours, 30 score) would be far below the line of other students. It's an outlier in the y-direction (their score is unusually low for their study time).
Student B (Influential Point): This student studied for 15 hours (way more than anyone else!) and got a perfect 100. Because they studied so much more than everyone else, their data point is far to the right on your graph. This point (15 hours, 100 score) has a lot of 'leverage'. If you include this student, your 'best fit' line might suddenly look like studying even more hours leads to much higher scores than it did before, pulling the line upwards and making it steeper. This student is an outlier in the x-direction and is likely an influential point because their extreme x-value gives them a lot of power to change the line's slope.

How It Works (Step by Step)

Here's how you'd typically look for these special points:

Plot Your Data: First, make a scatterplot (a graph with dots for each data point) of your two variables. This is like drawing a map of your data.
Draw the Line: Calculate and draw the least-squares regression line (the 'best fit' line) through your data. This line tries to get as close as possible to all the dots.
Look for Vertical Strays (Outliers): Look for points that are far away from the line in the up-and-down (vertical) direction. These are your potential outliers.
Look for Horizontal Strays (Potential Influential Points): Look for points that are far away from the other data points in the left-and-right (horizontal) direction. These points have high 'leverage' and are likely influential.
Check the Impact: Remove the suspicious point(s) and re-calculate the regression line. If the line changes a lot (especially the slope), then the point was influential.
Analyze Residuals: Look at the residual plot (a graph of the errors). Points with very large residuals (errors) are outliers.

Why Do We Care? (The Impact)

Imagine you're trying to predict the price of a house based on its size. Most houses follow a pretty clear trend: bigger houses cost more. But what if one house is a tiny shack that costs a million dollars because it's on a super famous beach? Or a huge mansion that's falling apart and costs almost nothing?

Outliers (like the mansion that costs nothing) can make your predictions less accurate. If you include it, your model might predict that big houses are cheaper than they actually are, just because of that one unusual house.
Influential points (like the tiny shack on the famous beach) are even more dangerous. Because they are so far out on the 'size' scale (very small) and have an extreme price, they can actually pull your entire prediction line upwards. Your model might then predict that all small houses are more expensive than they truly are, just because of that one special case.

So, these points can make your model give really bad advice or predictions if you don't understand their effect. It's like trying to navigate with a compass that's been thrown off by a strong magnet – you'll end up going in the wrong direction!

Common Mistakes (And How to Avoid Them)

❌ Mistake 1: Assuming all outliers are bad and should be removed.
- Why it happens: It's tempting to just get rid of points that don't fit.
- ✅ How to avoid: Always investigate! An outlier might be a data entry error (typo), or it might be a genuinely important, unusual observation. For example, a new medical treatment might have one patient who responds incredibly well – that's an outlier, but a very important one to study! Don't remove data without a good reason and always report if you did.
❌ Mistake 2: Confusing an outlier with an influential point.
- Why it happens: Both are 'unusual' data points.
- ✅ How to avoid: Remember the 'tug-of-war' analogy. An outlier is just far away in the y-direction (unusual response). An influential point is far away in the x-direction (unusual explanatory value) and pulls the line towards it. Check if removing the point significantly changes the slope of the regression line. If it does, it's influential.
❌ Mistake 3: Not checking the residual plot.
- Why it happens: Students sometimes only look at the scatterplot.
- ✅ How to avoid: Always create and examine the residual plot. Points with large residuals (the vertical distance from the point to the line) are outliers. The residual plot often makes outliers much easier to spot than on the original scatterplot.

Exam Tips

1.Always draw a scatterplot first! Visualizing the data is the easiest way to spot potential outliers or influential points.
2.When asked to identify an influential point, explain *why* it's influential (e.g., 'It has a large x-value and pulls the line towards it, significantly changing the slope').
3.If you identify an outlier, discuss its potential impact on the regression model (e.g., 'It might inflate the R-squared value' or 'It might make the slope steeper/flatter').
4.Never remove an outlier without a clear justification (like a data entry error) and always mention if you did, and how it affected your results.
5.Remember that an influential point is almost always an outlier in the x-direction (high leverage), but an outlier in the y-direction isn't necessarily influential if its x-value is close to the mean.