Multi-Class Classification in R With Imbalanced Dаtaset
Multi-Class Classification in R with Imbalanced Dаtaset
In Machine Learning, we come across a large number of dаtasets. As a fresher, who is learning about machine learning algorithms, the dаtasets to deal with are simple and easy as one gain more experience, the types of the dаtaset will be"imbalanced".
Wandering, what is an "Imbalanced" dаtaset? I will explain.. But before... But before that, know about a balanced dаtaset.
What Is a"Balanced" Dаtaset?
While someone is attending the dаta science class, most of the times, they will be given a
perfectly balanced dаtaset to practice. By balanced, I mean to say, every level in output
variable., in a classification problem, hasan equal proportion of observations. For example, the
famous iris" dаtaset.
Below is the R code to load "iris' dаtaset and to know about the number of observation in
differentSpecies
dаta ("iris")
table (iris$Species)
setosa versicolor virginica
50 50 50
The dаtaset is a perfectly balanced dаtaset, as every species has an equal number of observations.
S0,using the classification algorithms will produce highly accurate results.
What is an"Imbalanced" dаtaset?
In real time projects,we can never expect work with a balanced dаtaset. An “imbalanced" dаtaset is one, in which the proportion of observations in different level is unequal. For example, I have a 'glass" dаtaset and it has got six types of glasses. Let us look into the number of observation in the different type of glass.
> table (glass$Type.)
1 2 3 5 6 7
70 76 17 13 9 29
Now you can find thatType6 glass hastheleast number,whileType2 hasthemoM. So, I can say that it is an imbalanced dаtaset.
What Problem one can face if the arelyal making_ model with the imbalanced dаtaset?
Here, I will make a model using the imbalanced glass dаtaset, and check the Accuracy of my model. I will use K-NN algorithm
The above code tells us about the structure of the glass closet.You Can see tharType '' variable is in integer format. Use lactor0" function to change it into a factor.
Variable Type got converted into actor. Post this we can look at the proportion of observation in antithese different types of glass levels.
You can see the proportion of type6 glass is very Low compared totype2 glass.
Now, let us look at first 6 observations in our dаtasets.
Check that there is a scaling problem in the dаtaset. So let’s normalize the entire dаtaset using function and remove ‘Type’ variables,as I don't want it to get normalized.
The dаta set got normalized.
Let’s Split the normalized dаta set into training and test dаta using random sampling.
To get “Type” variable portioned, split glass dаtaset as follows:
Let's look at the proportions training and test dаta
There is huge difference in proportions for different levels in training and test dаtasets KNN model using K=3 is built.