Test your knowledge |
|
Mining from data is not a simple task. Sometimes, it can get very complex. Instead of writing all the complicated codes yourself, you can choose to use pre-produced codes, written by experts and given to us in the form of libraries. Using libraries has many advantages:
In this blog, I introduce 4 of the most popular libraries in Python for data mining.
Numpy
Pandas
Scikit-learn
Matplotlib
Numpy
Numpy is a math library that supports many operations on arrays, from simple to complex.
import numpy as np
Show some basic stats of array.
arr = [2, 5, 4, 9, 2, 5, 2, 4, 3] print("Sum of arr: ", np.sum(arr)) print("Mean of arr: ", np.mean(arr)) print("Median of arr: ", np.median(arr)) print("Standard deviation of arr: ", np.std(arr)) print("Variance of arr: ", np.var(arr)) print("Max value of arr: ", np.max(arr))
Sum of arr: 36
Mean of arr: 4.0
Median of arr: 4.0
Standard deviation of arr: 2.1081851067789197
Variance of arr: 4.444444444444445
Max value of arr: 9
We can create arrays using numpy.
# Create an array # values are from 5 to smaller than 20 # consecutive values increase by 2 np.arange(5, 20, 2)
array([ 5, 7, 9, 11, 13, 15, 17, 19])
# Create an array of 5 elements # values are from 3 to 10, equally spaced np.linspace(3, 10, 5)
array([ 3. , 4.75, 6.5 , 8.25, 10. ])
# Create array of all zeros with customized size np.zeros((2, 3))
array([[0., 0., 0.],
[0., 0., 0.]])
# Create array of customized size # with the same initial values np.full((3, 2), 10)
array([[10, 10],
[10, 10],
[10, 10]])
Numpy can do element-wise operations.
np.log(arr)
array([0.69314718, 1.60943791, 1.38629436, 2.19722458, 0.69314718,
1.60943791, 0.69314718, 1.38629436, 1.09861229])
np.sin(arr)
array([ 0.90929743, -0.95892427, -0.7568025 , 0.41211849, 0.90929743,
-0.95892427, 0.90929743, -0.7568025 , 0.14112001])
np.power(arr, 2)
array([ 4, 25, 16, 81, 4, 25, 4, 16, 9])
Combining arrays:
b = [[2, 4, 6], [3, 5, 7]] c = [[10, 15, 20], [25, 30, 35]] # you can specify axis for combination # axis=0 means row-wise np.concatenate([b, c], axis=0)
array([[ 2, 4, 6],
[ 3, 5, 7],
[10, 15, 20],
[25, 30, 35]])
# axis=1 means column-wise np.concatenate([b, c], axis=1)
array([[ 2, 4, 6, 10, 15, 20],
[ 3, 5, 7, 25, 30, 35]])
Reshape an array:
np.reshape(b, (1, 6))
array([[2, 4, 6, 3, 5, 7]])
Above are the most commonly used numpy operations. There are many many others (seems infinite to me) that you can use to your need. If you want to know more about Numpy, take a look at Numpy references.
Test your understanding |
|
Pandas
Pandas, the abbreviation for Panel-Data, is a library for representing data on a data-frame.
When studying and practicing data mining, we often have in our hands a dataset that can be well presented on a table, where each row is a sample and each column is a feature. This kind of data is splendidly supported by Pandas. Using Pandas, you can easily handle and wrangle with your data.
import pandas as pd
To input a dataset into Pandas Dataframe, the most 2 common ways are:
# Input from a csv file df = pd.read_csv('dataset.csv')
# Input from a dict df = pd.DataFrame({ \ 'feature1' : [1, 2, 5, 4, 2], \ 'feature2' : [5, 2, 3, 1, 1], \ 'feature3' : ['A', 'BC', 'D', 'BC', 'D'], \ }) print(df)
feature1 feature2 feature3
0 1 5 A
1 2 2 BC
2 5 3 D
3 4 1 BC
4 2 1 D
You can check the data-type of each column.
df.dtypes
feature1 int64
feature2 int64
feature3 object
dtype: object
There are 2 ways to access a column:
print('Access column using dot symbol') print(df.feature1) print('Access column using brackets') print(df['feature1'])
Access column using dot symbol
0 1
1 2
2 5
3 4
4 2
Name: feature1, dtype: int64
Access column using brackets
0 1
1 2
2 5
3 4
4 2
Name: feature1, dtype: int64
You can simply make a new column like below:
df['feature4'] = df['feature1'] + df['feature2'] print(df)
feature1 feature2 feature3 feature4
0 1 5 A 6
1 2 2 BC 4
2 5 3 D 8
3 4 1 BC 5
4 2 1 D 3
or below:
df['feature5'] = 100 print(df)
feature1 feature2 feature3 feature4 feature5
0 1 5 A 6 100
1 2 2 BC 4 100
2 5 3 D 8 100
3 4 1 BC 5 100
4 2 1 D 3 100
To remove a column:
df = df.drop(['feature5'], axis=1) print(df)
feature1 feature2 feature3 feature4
0 1 5 A 6
1 2 2 BC 4
2 5 3 D 8
3 4 1 BC 5
4 2 1 D 3
Look at some first rows:
print(df.head(2))
feature1 feature2 feature3 feature4
0 1 5 A 6
1 2 2 BC 4
Get an overview of our data:
print(df.describe())
feature1 feature2 feature4
count 5.000000 5.00000 5.000000
mean 2.800000 2.40000 5.200000
std 1.643168 1.67332 1.923538
min 1.000000 1.00000 3.000000
25% 2.000000 1.00000 4.000000
50% 2.000000 2.00000 5.000000
75% 4.000000 3.00000 6.000000
max 5.000000 5.00000 8.000000
Apply a change to a column:
df.feature1 = df.feature1.apply(lambda x : x + 1) print(df)
feature1 feature2 feature3 feature4
0 2 5 A 6
1 3 2 BC 4
2 6 3 D 8
3 5 1 BC 5
4 3 1 D 3
Get correlations between each pair of numerical columns:
print(df.corr())
feature1 feature2 feature4
feature1 1.000000 -0.327327 0.569495
feature2 -0.327327 1.000000 0.590301
feature4 0.569495 0.590301 1.000000
Still, there are many many more operations you can do with Pandas, above is just a small surface of the deep sea. Get to Pandas references to have a complete guide!
Test your understanding |
|
Scikit-Learn
Scikit-Learn (or in short, sklearn) helps us greatly simplify Machine Learning algorithms and related functions. Data preprocessing, feature engineering, model selection, and validation testing, etc., all those complex tasks, which require complex algorithms and coding, can be done using sklearn with just several lines of code.
import sklearn
For example, we can apply a standard-scaler as below:
(applying standard-scaler means data will be subtracted by mean and scaled by standard deviation, i.e. z-score transformation.)
from sklearn.preprocessing import StandardScaler scaler = StandardScaler() df.feature1 = scaler.fit_transform( \ df.feature1.values.reshape((-1, 1)) \ ).ravel() print(df)
feature1 feature2 feature3 feature4
0 -1.224745 5 A 6
1 -0.544331 2 BC 4
2 1.496910 3 D 8
3 0.816497 1 BC 5
4 -0.544331 1 D 3
Doing one-hot-encoding is also similar:
from sklearn.preprocessing import OneHotEncoder encoder = OneHotEncoder(sparse=False) encoded_arr = encoder.fit_transform( \ df.feature3.values.reshape((-1, 1)) \ ) encoded_df = pd.DataFrame(encoded_arr, \ columns = df.feature3.unique() \ ) df = pd.concat((df, encoded_df), axis=1) df = df.drop(['feature3'], axis=1) print(df)
feature1 feature2 feature4 A BC D
0 -1.224745 5 6 1.0 0.0 0.0
1 -0.544331 2 4 0.0 1.0 0.0
2 1.496910 3 8 0.0 0.0 1.0
3 0.816497 1 5 0.0 1.0 0.0
4 -0.544331 1 3 0.0 0.0 1.0
Note from the above example that Sklearn’s returning values are always in the form of numpy array (not DataFrame), so after getting output from sklearn, we would need to do a bit of work to transform the output into a DataFrame.
For the next example, let me train a predictive model!
We will use the first 2 columns (‘feature1’ and ‘feature2’) as predictor variables, and ‘feature4’ as the target variable. We use a Linear Regression line, training on the first 2 samples and get predictions for the rest.
from sklearn import linear_model reg = linear_model.LinearRegression() reg.fit(df[['feature1', 'feature2']][:2], df['feature4'][:2])
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
normalize=False)
We have done the training. You can print out the coefficients of the regression as below:
print("coefficients: ", reg.coef_)
coefficients: [-0.14380566 0.63405088]
And do predict:
reg.predict(df[['feature1', 'feature2']][2:])
array([4.34050881, 3.1702544 , 3.36594912])
That’s so simple and easy to use, right?
You can get to Scikit-learn references to get to know more about this library.
Test your understanding |
|
Matplotlib
Matplotlib is a comprehensive library for visualization in Python. It supports most of the basic plots that we need when starting with data science.
As this post is pretty lengthy, and as I already published a post about Matplotlib before, please following this post to have a look at how Matplotlib works and see some simple examples.
Test your understanding |
|
References: