Numpy, Pandas, Scikit-learn and Matplotlib

Test your knowledge

Mining from data is not a simple task. Sometimes, it can get very complex. Instead of writing all the complicated codes yourself, you can choose to use pre-produced codes, written by experts and given to us in the form of libraries. Using libraries has many advantages:

You don’t have to write long and involved codes.
Almost all official libraries give error-free code. This means you don’t have to worry if the functions have some errors inside that can result in the wrong output.
There are some functions, algorithms, modules, etc. that are very difficult to write. You cannot implement them, but some experts on the globe can.
When you use a popular library (that others are using), others can more easily grasp what you are coding, and provide help in case you need it.
By utilizing libraries, you are standing on the shoulders of giants. You need less time to complete work and hence can produce more values.

In this blog, I introduce 4 of the most popular libraries in Python for data mining.

Numpy
Pandas
Scikit-learn
Matplotlib

Numpy

Numpy is a math library that supports many operations on arrays, from simple to complex.

import numpy as np

Show some basic stats of array.

arr = [2, 5, 4, 9, 2, 5, 2, 4, 3]

print("Sum of arr: ", np.sum(arr))
print("Mean of arr: ", np.mean(arr))
print("Median of arr: ", np.median(arr))
print("Standard deviation of arr: ", np.std(arr))
print("Variance of arr: ", np.var(arr))
print("Max value of arr: ", np.max(arr))

Sum of arr:  36
Mean of arr:  4.0
Median of arr:  4.0
Standard deviation of arr:  2.1081851067789197
Variance of arr:  4.444444444444445
Max value of arr:  9

We can create arrays using numpy.

# Create an array 
# values are from 5 to smaller than 20
# consecutive values increase by 2
np.arange(5, 20, 2)

array([ 5,  7,  9, 11, 13, 15, 17, 19])

# Create an array of 5 elements
# values are from 3 to 10, equally spaced
np.linspace(3, 10, 5)

array([ 3.  ,  4.75,  6.5 ,  8.25, 10.  ])

# Create array of all zeros with customized size
np.zeros((2, 3))

array([[0., 0., 0.],
       [0., 0., 0.]])

# Create array of customized size
# with the same initial values
np.full((3, 2), 10)

array([[10, 10],
       [10, 10],
       [10, 10]])

Numpy can do element-wise operations.

np.log(arr)

array([0.69314718, 1.60943791, 1.38629436, 2.19722458, 0.69314718,
       1.60943791, 0.69314718, 1.38629436, 1.09861229])

np.sin(arr)

array([ 0.90929743, -0.95892427, -0.7568025 ,  0.41211849,  0.90929743,
       -0.95892427,  0.90929743, -0.7568025 ,  0.14112001])

np.power(arr, 2)

array([ 4, 25, 16, 81,  4, 25,  4, 16,  9])

Combining arrays:

b = [[2, 4, 6], [3, 5, 7]]
c = [[10, 15, 20], [25, 30, 35]]

# you can specify axis for combination
# axis=0 means row-wise
np.concatenate([b, c], axis=0)

array([[ 2,  4,  6],
       [ 3,  5,  7],
       [10, 15, 20],
       [25, 30, 35]])

# axis=1 means column-wise
np.concatenate([b, c], axis=1)

array([[ 2,  4,  6, 10, 15, 20],
       [ 3,  5,  7, 25, 30, 35]])

Reshape an array:

np.reshape(b, (1, 6))

array([[2, 4, 6, 3, 5, 7]])

Above are the most commonly used numpy operations. There are many many others (seems infinite to me) that you can use to your need. If you want to know more about Numpy, take a look at Numpy references.

Test your understanding

Pandas

Pandas, the abbreviation for Panel-Data, is a library for representing data on a data-frame.

When studying and practicing data mining, we often have in our hands a dataset that can be well presented on a table, where each row is a sample and each column is a feature. This kind of data is splendidly supported by Pandas. Using Pandas, you can easily handle and wrangle with your data.

import pandas as pd

To input a dataset into Pandas Dataframe, the most 2 common ways are:

# Input from a csv file
df = pd.read_csv('dataset.csv')

# Input from a dict
df = pd.DataFrame({ \
    'feature1' : [1, 2, 5, 4, 2], \
    'feature2' : [5, 2, 3, 1, 1], \
    'feature3' : ['A', 'BC', 'D', 'BC', 'D'], \
                  })
print(df)

   feature1  feature2 feature3
0         1         5        A
1         2         2       BC
2         5         3        D
3         4         1       BC
4         2         1        D

You can check the data-type of each column.

df.dtypes

feature1     int64
feature2     int64
feature3    object
dtype: object

There are 2 ways to access a column:

print('Access column using dot symbol')
print(df.feature1)
print('Access column using brackets')
print(df['feature1'])

Access column using dot symbol
0    1
1    2
2    5
3    4
4    2
Name: feature1, dtype: int64
Access column using brackets
0    1
1    2
2    5
3    4
4    2
Name: feature1, dtype: int64

You can simply make a new column like below:

df['feature4'] = df['feature1'] + df['feature2']
print(df)

   feature1  feature2 feature3  feature4
0         1         5        A         6
1         2         2       BC         4
2         5         3        D         8
3         4         1       BC         5
4         2         1        D         3

or below:

df['feature5'] = 100
print(df)

   feature1  feature2 feature3  feature4  feature5
0         1         5        A         6       100
1         2         2       BC         4       100
2         5         3        D         8       100
3         4         1       BC         5       100
4         2         1        D         3       100

To remove a column:

df = df.drop(['feature5'], axis=1)
print(df)

   feature1  feature2 feature3  feature4
0         1         5        A         6
1         2         2       BC         4
2         5         3        D         8
3         4         1       BC         5
4         2         1        D         3

Look at some first rows:

print(df.head(2))

   feature1  feature2 feature3  feature4
0         1         5        A         6
1         2         2       BC         4

Get an overview of our data:

print(df.describe())

       feature1  feature2  feature4
count  5.000000   5.00000  5.000000
mean   2.800000   2.40000  5.200000
std    1.643168   1.67332  1.923538
min    1.000000   1.00000  3.000000
25%    2.000000   1.00000  4.000000
50%    2.000000   2.00000  5.000000
75%    4.000000   3.00000  6.000000
max    5.000000   5.00000  8.000000

Apply a change to a column:

df.feature1 = df.feature1.apply(lambda x : x + 1)
print(df)

   feature1  feature2 feature3  feature4
0         2         5        A         6
1         3         2       BC         4
2         6         3        D         8
3         5         1       BC         5
4         3         1        D         3

Get correlations between each pair of numerical columns:

print(df.corr())

          feature1  feature2  feature4
feature1  1.000000 -0.327327  0.569495
feature2 -0.327327  1.000000  0.590301
feature4  0.569495  0.590301  1.000000

Still, there are many many more operations you can do with Pandas, above is just a small surface of the deep sea. Get to Pandas references to have a complete guide!

Test your understanding

Scikit-Learn

Scikit-Learn (or in short, sklearn) helps us greatly simplify Machine Learning algorithms and related functions. Data preprocessing, feature engineering, model selection, and validation testing, etc., all those complex tasks, which require complex algorithms and coding, can be done using sklearn with just several lines of code.

import sklearn

For example, we can apply a standard-scaler as below:
(applying standard-scaler means data will be subtracted by mean and scaled by standard deviation, i.e. z-score transformation.)

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df.feature1 = scaler.fit_transform( \
    df.feature1.values.reshape((-1, 1)) \
                                  ).ravel()
print(df)

   feature1  feature2 feature3  feature4
0 -1.224745         5        A         6
1 -0.544331         2       BC         4
2  1.496910         3        D         8
3  0.816497         1       BC         5
4 -0.544331         1        D         3

Doing one-hot-encoding is also similar:

from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse=False)
encoded_arr = encoder.fit_transform( \
    df.feature3.values.reshape((-1, 1)) \
                                     )
encoded_df = pd.DataFrame(encoded_arr, \
                          columns = df.feature3.unique() \
                         )
df = pd.concat((df, encoded_df), axis=1)
df = df.drop(['feature3'], axis=1)
print(df)

   feature1  feature2  feature4    A   BC    D
0 -1.224745         5         6  1.0  0.0  0.0
1 -0.544331         2         4  0.0  1.0  0.0
2  1.496910         3         8  0.0  0.0  1.0
3  0.816497         1         5  0.0  1.0  0.0
4 -0.544331         1         3  0.0  0.0  1.0

Note from the above example that Sklearn’s returning values are always in the form of numpy array (not DataFrame), so after getting output from sklearn, we would need to do a bit of work to transform the output into a DataFrame.

For the next example, let me train a predictive model!
We will use the first 2 columns (‘feature1’ and ‘feature2’) as predictor variables, and ‘feature4’ as the target variable. We use a Linear Regression line, training on the first 2 samples and get predictions for the rest.

from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit(df[['feature1', 'feature2']][:2], df['feature4'][:2])

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

We have done the training. You can print out the coefficients of the regression as below:

print("coefficients: ", reg.coef_)

coefficients:  [-0.14380566  0.63405088]

And do predict:

reg.predict(df[['feature1', 'feature2']][2:])

array([4.34050881, 3.1702544 , 3.36594912])

That’s so simple and easy to use, right?

You can get to Scikit-learn references to get to know more about this library.

Test your understanding

Numpy, Pandas, Scikit-learn and Matplotlib - Quiz 4

1 / 2

What does a standard scaler in Scikit-learn do?

It scales the data to the range [-1, 1]

It performs a z-score transformation on the data.

It centers the data around 0.

It scales the data to the range [0, 1]

2 / 2

What can a Linear regression from Scikit-learn do? Choose all that apply.

It can train on the input data and give the coefficients of the best-fitted line.

It can train on the input data and give prediction on validation data.

Your score is

Please rate this quiz

Matplotlib

Matplotlib is a comprehensive library for visualization in Python. It supports most of the basic plots that we need when starting with data science.

As this post is pretty lengthy, and as I already published a post about Matplotlib before, please following this post to have a look at how Matplotlib works and see some simple examples.

Test your understanding

References:

Numpy references: link
Pandas references: link
Scikit-learn references: link

Tung M Phung's Blog

Numpy, Pandas, Scikit-learn and Matplotlib

Leave a ReplyCancel reply