0

I would like to share with you a classification issue I faced during the modelling process. I have to create a model for an unbalanced binary target by 4 predictors where one of them has 45% of wrong values. This predictor must be in the model.


*** What I have in my data ?

  • Number of observation : 10 000
  • 1 Target: binary variable let's call A -> Yes (38)/No (9 962)
  • 4 Predictors: VarB (category) - VarC (category) - VarD (numeric) - VarE (category)
  • Issues: For 45% of the variable VarD, the values are wrongs. The remining (55%) have been corrected after manual treatments performed by an external team. Treatments made on the 55% ones changed the initial definition of the variable D. Also, there are no ways to correct the 45% wrongs. Plus, this concerns 48% (18) of the category (yes) of the target. Let is call this new remediate variable RVarD. (variable where 55% have been corrected and 45% are wrongs)
  • Constraint : Built a binary model where I must use the remediated RVarD as one of a predictor and I cannot use black box models/tools or too sophisticates approaches.

*** Solutions with the pros/cons :

  1. A model with the remediate variable (VarD) and others after dropping the 45% wrongs of RVarD in the dataset. So we will have 5500 observations - target (yes - 20 / No - 5480)
  • Pros: Easy way
  • Cons: Too low number in the category (yes) of the target (20 yes). Instability for the performance because of the low number in the target
  1. Find a way to impute the 45% wrongs of the new remediate variable RVarD based on the distribution of the 55% corrected. I can also discretize and assign the category to the 45% wrongs based on the 55% right.
  • Pros: Simple and quick way to impute
  • Cons: As the definition changed it looks like I compare bananas and apple.
  1. 1 model without the new remediate variable (VarD) plus use the coefficients for predictions(probs). A second model with only the VarD for the 55% observations right. Compare these two probs and find a scaling factor to link the two models.
  • Cons: Very complicated and hard to define properly the scaling factor and link the two models
  1. As the 2/, modelized a first model without the remediate variable RVarD and use the coefficient for prediction first. Then, find a way to use the mandatory variable RVarD by business rules or additional layer.
  • Pros: A statistic model is secured
  • Cons: Complicated and hard to define the best rule for the 45% wrong data of the remediate variable RVarD

Which one is more realistic or how could I improve it ? Feel free to propose different approach, I am open for discussion.

Thanks a lot.

0