I would like to share with you a classification issue I faced during the modelling process. I have to create a model for an unbalanced binary target by 4 predictors where one of them has 45% of wrong values. This predictor must be in the model.
*** What I have in my data ?
- Number of observation : 10 000
- 1 Target: binary variable let's call A -> Yes (38)/No (9 962)
- 4 Predictors: VarB (category) - VarC (category) - VarD (numeric) - VarE (category)
- Issues: For 45% of the variable VarD, the values are wrongs. The remining (55%) have been corrected after manual treatments performed by an external team. Treatments made on the 55% ones changed the initial definition of the variable D. Also, there are no ways to correct the 45% wrongs. Plus, this concerns 48% (18) of the category (yes) of the target. Let is call this new remediate variable RVarD. (variable where 55% have been corrected and 45% are wrongs)
- Constraint : Built a binary model where I must use the remediated RVarD as one of a predictor and I cannot use black box models/tools or too sophisticates approaches.
*** Solutions with the pros/cons :
- A model with the remediate variable (VarD) and others after dropping the 45% wrongs of RVarD in the dataset. So we will have 5500 observations - target (yes - 20 / No - 5480)
- Pros: Easy way
- Cons: Too low number in the category (yes) of the target (20 yes). Instability for the performance because of the low number in the target
- Find a way to impute the 45% wrongs of the new remediate variable RVarD based on the distribution of the 55% corrected. I can also discretize and assign the category to the 45% wrongs based on the 55% right.
- Pros: Simple and quick way to impute
- Cons: As the definition changed it looks like I compare bananas and apple.
- 1 model without the new remediate variable (VarD) plus use the coefficients for predictions(probs). A second model with only the VarD for the 55% observations right. Compare these two probs and find a scaling factor to link the two models.
- Cons: Very complicated and hard to define properly the scaling factor and link the two models
- As the 2/, modelized a first model without the remediate variable RVarD and use the coefficient for prediction first. Then, find a way to use the mandatory variable RVarD by business rules or additional layer.
- Pros: A statistic model is secured
- Cons: Complicated and hard to define the best rule for the 45% wrong data of the remediate variable RVarD
Which one is more realistic or how could I improve it ? Feel free to propose different approach, I am open for discussion.
Thanks a lot.