DATA SCIENCE — DATA ANALYSIS, PANDAS, OBJECT ORIENTED PROGRAMMING

How to Converting Pandas Column of Comma-Separated Strings Into Dummy Variables?

How to Make a Simple and Quick Dummy Operations for a Pandas Column from Comma Separated Strings. How can Use it in Scikit-learn Pipeline.

Muhammed ÇELİK

--

Bad data will lead to bad results even with a perfect model ( 2 — Data Exploration — Machine Learning Blog | ML@CMU | Carnegie Mellon University )

1- Defining the Problem and Options

Most data scientists face the same problems. When performing exploratory data analysis and then preparing the data for machine learning, one of the problems is “how to handle comma-separated strings”. You can choose to apply a dummy process to all data, but this has some undesirable consequences.
For example Data Leakage. If you want to avoid data leakage, you have to do all the steps step by step. So these take time and increase complexity. So we cover it in the article “How to Make a Simple and Quick Dummy Operations for a Pandas Column from Comma Separated Strings”.

Here Sample Dataset have one Column Comma separated:

Credit Score Classification Clean Data | Kaggle

2- Inheritance

To understand how we can write our own custom transformers with scikit-learn, we first have to get a little familiar with the concept of inheritance in Python. You can get more information from the link below.

3- scikit-learn Sample Code [“OneHotEncoder”, “OrdinalEncoder”]

Created by the Author How to Converting Pandas Column of Comma-Separated Strings Into Dummy Variables? (github.com)

You are creating an instance called ‘ohe’ of the class ‘OneHotEncoder’ using its class constructor and passing it the argument ‘ignore’ for its parameter ‘handle_unknown’ and the argument ‘False’ for its parameter ‘sparse’. The OneHotEncoder class has methods such as ‘fit’, ‘transform’.

4- Let’s Defining the GetDummies Class in Python Programming Language using Object Oriented Programming Approach

Created by the Author How to Converting Pandas Column of Comma-Separated Strings Into Dummy Variables? (github.com)

Here Sample Notebook applied one Column GetDummies:

https://www.kaggle.com/code/clkmuhammed/credit-score-multi-class-classification-part-2-ml

4–1 We have to create a custom transformer for to include this logic into a pipeline.

  • Let’s answer the following questions and fine-tune their functionality to get the result.
  1. [__init__] Do you need any parameters in function?
    - The variables you want to impute. For use in other Functions process.
  2. [FIT] This part is about what the transformation will be?
    - When we calculate to get the Names of the dummy features, the data is somehow transformed for later use.
  3. [TRANSFORM] Which part of the logic transforms the data, given the parameters (in 1) and the setting that was made (in 2)?
    - When you get the seperator parameter, which is the default comma, Series.str.get_dummies(sep=’|’) performs the operation with the sep parameter to get the Names of the dummy properties and then fill in the missing colums and values.
  4. [GET FEATURE NAMES OUT] This part is about how to retrieve Names of the all Features?

4–2 Sample Code fot using our GetDummies Class

Created by the Author How to Converting Pandas Column of Comma-Separated Strings Into Dummy Variables? (github.com)

As you can see, Our dummy operations is computed only based on train data. Then it is re-used to impute the missing columns. It is data-leakage-proof.

(Bonus) Encoding json_normalize

Normalize semi-structured JSON data into a flat table. So if you convert your data json format can be use json_normalize for dummy operation.

Here is a good article that explains how to create a custom transformer.

If you like it, don’t forget to follow and like-comment.

--

--