Sometimes, we want to detect and exclude outliers in Pandas data frame with Python.
In this article, we’ll look at how to detect and exclude outliers in Pandas data frame with Python.
How to detect and exclude outliers in Pandas data frame with Python?
To detect and exclude outliers in Pandas data frame with Python, we can use NumPy to return a new DataFrame that has values within 3 standard deviations from the mean.
To do this, we can write:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Data':np.random.normal(size=200)})
new_df = df[np.abs(df.Data-df.Data.mean()) <= (3*df.Data.std())]
print(new_df)
We create a Pandas DataFrame with a normal distribution with sample size 200 with np.random.normal
.
Then we pick the values that are within 3 standard deviations from the mean with df[np.abs(df.Data-df.Data.mean()) <= (3*df.Data.std())]
.
And we assign the returned DataFrame to new_df
.
Therefore, new_df
is something like:
Data
0 0.300805
1 -0.474140
2 -0.326278
3 0.566571
4 -1.391077
.. ...
195 0.500637
196 0.341858
197 -1.058419
198 -0.565920
199 -1.008344
[200 rows x 1 columns]
according to print
.
Conclusion
To detect and exclude outliers in Pandas data frame with Python, we can use NumPy to return a new DataFrame that has values within 3 standard deviations from the mean.