Dirty_cat is a python package which helps with machine-learning on non curated categories. It provides encoders that are robust to morphological variants, such as typos, in the category strings. It can be considered as a drop in replacement for “One Hot Encoder” from scikit-learn.

Website: find API documentation and examples on the package

GitHub: source code


