import numpy as np
import pandas as pd
from pipelines.control import AADP
from pipelines.defaults import initialize_autoencoder, initialize_autoencoder_modified
from pipelines.defaults import dummy_data
pd.set_option("display.max_columns", None)
from pyod.models.pca import PCA
if __name__ == "__main__":
df_data = pd.read_csv("./temperature_USA.csv")
# clf_if = IForest(n_jobs=-1)
clf_pca = PCA()
anomaly_detection_pipeline = AADP(
deactivate_pattern_recognition=True,
exclude_columns_no_variance=True,
mark_anomalies_pct_data=0.005
)
X_output = anomaly_detection_pipeline.unsupervised_pipeline(
X_train=df_data,
clf=clf_pca,
dump_model=False,
)
X_output.to_csv("temperatures_anomalies.csv", index=False)
Newest research shows similar results for encoding nominal columns with significantly fewer dimensions.
- (John T. Hancock and Taghi M. Khoshgoftaar. "Survey on categorical data for neural networks." In: Journal of Big Data 7.1 (2020), pp. 1β41.)
- Tables 2, 4
- (Diogo Seca and JoΓ£o Mendes-Moreira. "Benchmark of Encoders of Nominal Features for Regression." In: World Conference on Information Systems and Technologies. 2021, pp. 146β155.)
- P. 151
Both methods (MOD Z-Value and Tukey Method) are resilient against outliers, ensuring that the position measurement will not be biased. They also support multivariate anomaly detection algorithms in identifying univariate anomalies.