Data visualization using pandas and classify iris species

Visualize iris dataset using pandas and classify species using k nearest neighbors (sklearn)

Category: Machine Learning Tags: Python, Python 3

Iris Visualization and classifier code files


    Scikit-learn provides iris flower dataset on which we can practice visualization and classification. This dataset has measurements of length and width of sepal and petal of three iris species. See the iris flower below:

Iris Flower Sepal and Petal
Fig 1: Iris Flower Sepal and Petal


    Let’s have a look of data provided in this dataset, create a file

from sklearn import datasets

#loading dataset
iris = datasets.load_iris()
#printing featue names
print('features: %s'%iris['feature_names'])
#printing species of iris
print('target categories: %s'%iris['target_names'])
#iris data shape
print("Shape of data: {}".format(iris['data'].shape))
#print data sample
print("sample of data: \n{}".format(iris['data'][:5]))


features: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

target categories: ['setosa' 'versicolor' 'virginica']

Shape of data: (150, 4)

sample of data:

[[5.1 3.5 1.4 0.2]

 [4.9 3.  1.4 0.2]

 [4.7 3.2 1.3 0.2]

 [4.6 3.1 1.5 0.2]

 [5.  3.6 1.4 0.2]]

In above output we can see there are four features given for each flower are sepal length, sepal width, petal length, petal width and three species of iris are setosa, versicolor and virginica. There are 150 total rows/data points which means each species is having 50 samples. We could see sample of array in the end of the output where each row has 4 columns.

Now we need to classify species and for that we need to analyze whether data is separable or not. Let’s visualize iris data in scatter plot, create a file

import matplotlib.pyplot as plt
import pandas as pd
from sklearn import datasets
from pandas.plotting import scatter_matrix

#loading dataset
iris = datasets.load_iris()
#creating data frame for pandas
dataframe = pd.DataFrame(iris['data'], columns=iris['feature_names'])
#ploting scatter graph of species comparing all features to each other
scatter_matrix(dataframe, c=iris['target'],marker='o', s=10,alpha=.8)

We have used above matplotlib and pandas to plot the data. We can see pandas has DataFrame method to frame the data, it takes data rows in numpy format and column names to frame the data as a table. scatter_matrix method ingest data frame and target. Since it is supervised learning we already know the target which is the species of iris.



Iris species scatter plot
Fig 2: Iris species scatter plot


We can see pandas plots 16 graphs where each feature is compared to other. If we have 4 features it will plot 4*4 = 16 graphs and 4 out of 16 are produced by comparing feature to itself. We can see data is well separated using sepal and petal measurements so we can use it for training the classifier. 

We will use K-Nearest Neighbors to classify the species, create a file

import numpy as np
from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

#loading dataset
iris = datasets.load_iris()
#splliting data for testing and training
x_train, x_test, y_train, y_test = train_test_split(iris['data'], iris['target'])
#initializing knn
knn = KNeighborsClassifier(n_neighbors=5)
#training, y_train)
predicted_specis = knn.predict(x_test)
print("predicted data\n{}".format(predicted_specis))
#printing % of accuracy using mean
print("accuracy: {:.2f}%".format(100*np.mean(predicted_specis == y_test)))

sklearn.neighbors module has class KNeighborsClassifier which will be used to classify species using KNN. train_test_split method split data for training and testing, 75% data means 112 rows out of 150 will be used to train the algorithm and 25% data or 38 rows will be used to test. fit method is used to train the classifier and predict method is used to classify new items. predict method will take array of test data and returns predicted species in array format. 0,1,2 are setosa, versicolor and virginica respectively. We can calculate accuracy by calculating mean between test and predicted data.


predicted data

[1 0 1 2 2 1 2 0 0 0 1 0 2 2 0 2 1 0 1 0 0 0 1 2 2 1 2 1 0 0 1 1 2 1 1 2 2 0]

accuracy: 97.37%

Like 0 People
Last modified on 11 October 2018
Nikhil Joshi

Nikhil Joshi
Ceo & Founder at Dotnetlovers
Atricles: 125
Questions: 9
Given Best Solutions: 8 *


No Comments Yet

You are not loggedin, please login or signup to add comments:

Existing User

Login via:

New User