Categories
Python Answers

How to detect and exclude outliers in a Python Pandas DataFrame?

To detect and exclude outliers in a Python Pandas DataFrame, we can use the SciPy stats object.

For instance, we write

df = pd.DataFrame(np.random.randn(100, 3))

from scipy import stats
df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]

to create the df dataframe with some random values created from NumPy.

Then we caLL np.abs with stats.zscore to return the values with z-score less than 3.

And we put that in df[] to return the values that matches the condition.

Categories
Python Answers

How to split a Python Pandas dataframe based on groupby?

To split a Python Pandas dataframe based on groupby, we can yuse the groupby method and then call get_group to get data frames from the groups.

For instance, we write

gb = df.groupby('ZZ')    
[gb.get_group(x) for x in gb.groups]

to call groupby to group by the ZZ column.

And then we use list comprehension to call get_group on the gb grouped data frame object with x to return the data frame for each grouped item.

Categories
Python Answers

How to sort Python Pandas dataframe from one column?

To sort Python Pandas dataframe from one column, we call sort_values.

For instance, we write

final_df = df.sort_values(by=['2'], ascending=False)

to call sort_values with the by argument set to ['2'] to sort by column 2.

And we set ascending to False to sort the items by column 2 in descending order.

Categories
Python Answers

How to do three-way joining multiple dataframes on columns with Python Pandas?

To do three-way joining multiple dataframes on columns with Python Pandas, we call the reduce method.

For instance, we write

import pandas as pd
from functools import reduce


dfs = [df0, df1, df2, dfN]
df_final = reduce(lambda left,right: pd.merge(left,right,on='name'), dfs)

to call reduce with a lanbda that merges 2 data frames in the dfs list with the pd.merge method.

We merge them on the name column values.

And we set the initial value of df_final to dfs.

Categories
Python Answers

How to calculate Time Difference Between Two Python Pandas Columns in Hours and Minutes?

To calculate time difference between two Python Pandas columns in hours and minutes, we subtract them directly after converting the values to timestamps.

For instance, we write

import pandas
df = pandas.DataFrame(columns=['to','fr','ans'])
df.to = [pandas.Timestamp('2014-01-24 13:03:12.050000'), pandas.Timestamp('2014-01-27 11:57:18.240000'), pandas.Timestamp('2014-01-23 10:07:47.660000')]
df.fr = [pandas.Timestamp('2014-01-26 23:41:21.870000'), pandas.Timestamp('2014-01-27 15:38:22.540000'), pandas.Timestamp('2014-01-23 18:50:41.420000')]
(df.fr-df.to).astype('timedelta64[h]')

to create the df data frame with a few columns.

And then we assign timestamp values to the columns which we created with the Timestamp method.

And then we subtract the timestamps and convert them to the 'timedelta64[h]' type with astype.