# Numpy, Pandas, Scikit-learn and Matplotlib

Mining from data is not a simple task. Sometimes, it can get very complex. Instead of writing all the complicated codes yourself, you can choose to use pre-produced codes, written by experts and given to us in the form of libraries. Using libraries has many advantages:

• You don’t have to write long and involved codes.
• Almost all official libraries give error-free code. This means you don’t have to worry if the functions have some errors inside that can result in the wrong output.
• There are some functions, algorithms, modules, etc. that are very difficult to write. You cannot implement them, but some experts on the globe can.
• When you use a popular library (that others are using), others can more easily grasp what you are coding, and provide help in case you need it.
• By utilizing libraries, you are standing on the shoulders of giants. You need less time to complete work and hence can produce more values.

In this blog, I introduce 4 of the most popular libraries in Python for data mining.

Numpy

Numpy is a math library that supports many operations on arrays, from simple to complex.

`import numpy as np`

Show some basic stats of array.

```arr = [2, 5, 4, 9, 2, 5, 2, 4, 3]

print("Sum of arr: ", np.sum(arr))
print("Mean of arr: ", np.mean(arr))
print("Median of arr: ", np.median(arr))
print("Standard deviation of arr: ", np.std(arr))
print("Variance of arr: ", np.var(arr))
print("Max value of arr: ", np.max(arr))```
``````Sum of arr:  36
Mean of arr:  4.0
Median of arr:  4.0
Standard deviation of arr:  2.1081851067789197
Variance of arr:  4.444444444444445
Max value of arr:  9``````

We can create arrays using numpy.

```# Create an array
# values are from 5 to smaller than 20
# consecutive values increase by 2
np.arange(5, 20, 2)```
``array([ 5,  7,  9, 11, 13, 15, 17, 19])``
```# Create an array of 5 elements
# values are from 3 to 10, equally spaced
np.linspace(3, 10, 5)```
``array([ 3.  ,  4.75,  6.5 ,  8.25, 10.  ])``
```# Create array of all zeros with customized size
np.zeros((2, 3))```
``````array([[0., 0., 0.],
[0., 0., 0.]])``````
```# Create array of customized size
# with the same initial values
np.full((3, 2), 10)```
``````array([[10, 10],
[10, 10],
[10, 10]])``````

Numpy can do element-wise operations.

`np.log(arr)`
``````array([0.69314718, 1.60943791, 1.38629436, 2.19722458, 0.69314718,
1.60943791, 0.69314718, 1.38629436, 1.09861229])``````
`np.sin(arr)`
``````array([ 0.90929743, -0.95892427, -0.7568025 ,  0.41211849,  0.90929743,
-0.95892427,  0.90929743, -0.7568025 ,  0.14112001])``````
`np.power(arr, 2)`
``array([ 4, 25, 16, 81,  4, 25,  4, 16,  9])``

Combining arrays:

```b = [[2, 4, 6], [3, 5, 7]]
c = [[10, 15, 20], [25, 30, 35]]

# you can specify axis for combination
# axis=0 means row-wise
np.concatenate([b, c], axis=0)```
``````array([[ 2,  4,  6],
[ 3,  5,  7],
[10, 15, 20],
[25, 30, 35]])``````
```# axis=1 means column-wise
np.concatenate([b, c], axis=1)```
``````array([[ 2,  4,  6, 10, 15, 20],
[ 3,  5,  7, 25, 30, 35]])``````

Reshape an array:

`np.reshape(b, (1, 6))`
``array([[2, 4, 6, 3, 5, 7]])``

Above are the most commonly used numpy operations. There are many many others (seems infinite to me) that you can use to your need. If you want to know more about Numpy, take a look at Numpy references.

Pandas

Pandas, the abbreviation for Panel-Data, is a library for representing data on a data-frame.

When studying and practicing data mining, we often have in our hands a dataset that can be well presented on a table, where each row is a sample and each column is a feature. This kind of data is splendidly supported by Pandas. Using Pandas, you can easily handle and wrangle with your data.

`import pandas as pd`

To input a dataset into Pandas Dataframe, the most 2 common ways are:

```# Input from a csv file
```# Input from a dict
df = pd.DataFrame({ \
'feature1' : [1, 2, 5, 4, 2], \
'feature2' : [5, 2, 3, 1, 1], \
'feature3' : ['A', 'BC', 'D', 'BC', 'D'], \
})
print(df)```
``````   feature1  feature2 feature3
0         1         5        A
1         2         2       BC
2         5         3        D
3         4         1       BC
4         2         1        D``````

You can check the data-type of each column.

`df.dtypes`
``````feature1     int64
feature2     int64
feature3    object
dtype: object``````

There are 2 ways to access a column:

```print('Access column using dot symbol')
print(df.feature1)
print('Access column using brackets')
print(df['feature1'])```
``````Access column using dot symbol
0    1
1    2
2    5
3    4
4    2
Name: feature1, dtype: int64
Access column using brackets
0    1
1    2
2    5
3    4
4    2
Name: feature1, dtype: int64``````

You can simply make a new column like below:

```df['feature4'] = df['feature1'] + df['feature2']
print(df)```
``````   feature1  feature2 feature3  feature4
0         1         5        A         6
1         2         2       BC         4
2         5         3        D         8
3         4         1       BC         5
4         2         1        D         3``````

or below:

```df['feature5'] = 100
print(df)```
``````   feature1  feature2 feature3  feature4  feature5
0         1         5        A         6       100
1         2         2       BC         4       100
2         5         3        D         8       100
3         4         1       BC         5       100
4         2         1        D         3       100``````

To remove a column:

```df = df.drop(['feature5'], axis=1)
print(df)```
``````   feature1  feature2 feature3  feature4
0         1         5        A         6
1         2         2       BC         4
2         5         3        D         8
3         4         1       BC         5
4         2         1        D         3``````

Look at some first rows:

`print(df.head(2))`
``````   feature1  feature2 feature3  feature4
0         1         5        A         6
1         2         2       BC         4``````

Get an overview of our data:

`print(df.describe())`
``````       feature1  feature2  feature4
count  5.000000   5.00000  5.000000
mean   2.800000   2.40000  5.200000
std    1.643168   1.67332  1.923538
min    1.000000   1.00000  3.000000
25%    2.000000   1.00000  4.000000
50%    2.000000   2.00000  5.000000
75%    4.000000   3.00000  6.000000
max    5.000000   5.00000  8.000000``````

Apply a change to a column:

```df.feature1 = df.feature1.apply(lambda x : x + 1)
print(df)```
``````   feature1  feature2 feature3  feature4
0         2         5        A         6
1         3         2       BC         4
2         6         3        D         8
3         5         1       BC         5
4         3         1        D         3``````

Get correlations between each pair of numerical columns:

`print(df.corr())`
``````          feature1  feature2  feature4
feature1  1.000000 -0.327327  0.569495
feature2 -0.327327  1.000000  0.590301
feature4  0.569495  0.590301  1.000000``````

Still, there are many many more operations you can do with Pandas, above is just a small surface of the deep sea. Get to Pandas references to have a complete guide!

Scikit-Learn

Scikit-Learn (or in short, sklearn) helps us greatly simplify Machine Learning algorithms and related functions. Data preprocessing, feature engineering, model selection, and validation testing, etc., all those complex tasks, which require complex algorithms and coding, can be done using sklearn with just several lines of code.

`import sklearn`

For example, we can apply a standard-scaler as below:
(applying standard-scaler means data will be subtracted by mean and scaled by standard deviation, i.e. z-score transformation.)

```from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df.feature1 = scaler.fit_transform( \
df.feature1.values.reshape((-1, 1)) \
).ravel()
print(df)```
``````   feature1  feature2 feature3  feature4
0 -1.224745         5        A         6
1 -0.544331         2       BC         4
2  1.496910         3        D         8
3  0.816497         1       BC         5
4 -0.544331         1        D         3``````

Doing one-hot-encoding is also similar:

```from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse=False)
encoded_arr = encoder.fit_transform( \
df.feature3.values.reshape((-1, 1)) \
)
encoded_df = pd.DataFrame(encoded_arr, \
columns = df.feature3.unique() \
)
df = pd.concat((df, encoded_df), axis=1)
df = df.drop(['feature3'], axis=1)
print(df)```
``````   feature1  feature2  feature4    A   BC    D
0 -1.224745         5         6  1.0  0.0  0.0
1 -0.544331         2         4  0.0  1.0  0.0
2  1.496910         3         8  0.0  0.0  1.0
3  0.816497         1         5  0.0  1.0  0.0
4 -0.544331         1         3  0.0  0.0  1.0``````

Note from the above example that Sklearn’s returning values are always in the form of numpy array (not DataFrame), so after getting output from sklearn, we would need to do a bit of work to transform the output into a DataFrame.

For the next example, let me train a predictive model!
We will use the first 2 columns (‘feature1’ and ‘feature2’) as predictor variables, and ‘feature4’ as the target variable. We use a Linear Regression line, training on the first 2 samples and get predictions for the rest.

```from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit(df[['feature1', 'feature2']][:2], df['feature4'][:2])```
``````LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
normalize=False)``````

We have done the training. You can print out the coefficients of the regression as below:

`print("coefficients: ", reg.coef_)`
``coefficients:  [-0.14380566  0.63405088]``

And do predict:

`reg.predict(df[['feature1', 'feature2']][2:])`
``array([4.34050881, 3.1702544 , 3.36594912])``

That’s so simple and easy to use, right?