4

I have a data frame and big function like below and i wanted to apply norm_group function to data frame columns but its taking too much time with apply command. is there any way to reduce the time for this code? currently it's taking 24.4s for each loop.

import pandas as pd
import numpy as np

np.random.seed(1234)
n = 1500000

df = pd.DataFrame()
df['group'] = np.random.randint(1700, size=n)
df['ID'] = np.random.randint(5, size=n)
df['s_count'] = np.random.randint(5, size=n)
df['p_count'] = np.random.randint(5, size=n)
df['d_count'] = np.random.randint(5, size=n)
df['Total'] = np.random.randint(400, size=n)
df['Normalized_total'] = df.groupby('group')['Total'].apply(lambda x: (x-x.min())/(x.max()- x.min()))
df['Normalized_total'] = df['Normalized_total'].apply(lambda x:round(x,2))

def norm_group(a,b,c,d,e):
if a >= 0.7 and b >=1000 and c >2:
    return "Both High "
elif a >= 0.7 and b >=1000 and c < 2:
    return "High and C Low"
elif a >= 0.4 and b >=500 and d > 2:
    return "Medium and D High"
elif a >= 0.4 and b >=500 and d < 2:
    return "Medium and D Low"
elif a >= 0.4 and b >=500 and e > 2:
    return "Medium and E High"
elif a >= 0.4 and b >=500 and e < 2:
    return "Medium and E Low"
else:
    return "Low"

%timeit df['Categery'] = df.apply(lambda x:norm_group(a=x['Normalized_total'],b=x['group']), axis=1)

24.4 s ± 551 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

i have multiple text columns in my original data frame and wanted to apply similar kind of function that is taking much more time compare to this one.

Thanks

1

1 Answer 1

5

You can vectorize with np.select:

df['Category'] = np.select((df['Normalized_total'].ge(0.7) & df['group'].ge(1000),
                            df['Normalized_total'].ge(0.4) & df['group'].ge(500)),
                           ('High', 'Medium'), default='Low'
                          )

Performance:

255 ms ± 2.71 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
3
  • Thanks for the answer. i have edited my question if its only 2 or 3 conditions your answer is correct..suppose if i have multiple if else statements then its difficult write it down in select..is there any way tackle multiple conditions?
    – Kumar AK
    Commented Nov 12, 2019 at 18:55
  • 1
    @KumarAK then check the answer i put in comments it works with n number of conditions.
    – Umar.H
    Commented Nov 12, 2019 at 18:58
  • You just stack them into np.select with the default being the last one, e.g. np.select([cond1, cond2, cond3], [val1, val2, val3], default=default_val). Commented Nov 12, 2019 at 18:59

Not the answer you're looking for? Browse other questions tagged or ask your own question.