Introduction to Python and Jupyter

Test your knowledge

Hi everybody!

In this blog post, I would like to introduce to you guys Python, the most common programming language for practicing Machine Learning – Data Mining (ML and DM) today, and Jupyter, a convenient environment for writing Python code.

Python

Let’s start with Python.

How to choose a language?

When selecting a programming language to learn and use, what are your criteria? Let me guess, some of the most crucial factors should be:

Easy to learn. Except for some who like to learn languages and explore the differences among languages, the majority of us prefer to spend less time on this part but more time on the content of our programs. For Machine Learning researchers, whose priority is predictive algorithms but not coding, like us, having a programming language that is simple and easy-to-pick-up would be a great choice.
Versatile. We don’t want a calculator that supports only addition, it would be much better if also providing subtraction, division, multiplication, and even exponentiation, logarithm, etc. The more versatile our programming language is, the less we need to learn another language to write different programs. In ML and DM, versatility can be expressed as being able to process various data-processing phases, as well as supporting many algorithms.
Fast. Of course, we are limited in time. Just like we want our Wifi to be fast, we want to have our homework done in a short time (to have time to watch TV), we also wish for a language that can operate as fast as possible. Note that, often, the more simple the language to our perspectives, the slower it runs, thus this is a trade-off.

With the criteria listed above, Python seems to be the most competent candidate for our purpose.

Python’s basic properties

Python is a high-level, interpreted and general-purpose programming language, first released by Guido van Rossum in 1991.

High-level: this means simple. The higher the level, the more the programming language is similar to our human languages, while the lower the level, the more flip-flop and binary (0/1) involved in the language. High-level languages are basically easier to learn.
Interpreted: there are essentially 2 types of programming languages, those are interpreted and compiled languages.
Compiled languages work as: firstly you write all your code, then you instruct your computer to compile it (i.e. your computer will read all your code and be ready to execute your program), then the computer will execute your code. Interpreted languages are quite different: you write several lines of code, then let your computer execute those lines, then you write some more lines, then your computer runs those subsequent lines, then you again write several lines, etc.
An interpreted language is suitable if your next action highly depends on the output of the previous one. In our ML and DM context, for example, when we analyze data, the next step (like transforming a categorical variable to numerical variable) should only be executed if the output of the previous command (checking if the variable is indeed categorical) is satisfied (yes, the variable is really categorical).
General-purpose: Python is not created for a specific purpose, it suits various purposes. While using Python for ML and DM is our aim, there are many others who develop websites, make shell-scripts, write Operating Systems, etc. in Python.

For ML and DM, there is a very big community of researchers and developers working in Python, together with a huge number of free, reliable, open-source libraries that can fulfill most of our needs (e.g. Numpy, Pandas, Scikit-learn and Matplotlib).

Version

To say a little bit about its history, the first version of Python was released in 1991, while Python 2 was then born in 2000, and finally, the newest version, Python 3 came to the world in 2008.

Note that Python 3 is not fully backward-compatible with its old versions, so be careful if you want to update your code from Python 2 to 3.

A brief comparison with R

R is another language that is frequently used for ML and DM. In the past, R, but not Python, was even the most popular choice in this field. This situation has only changed for several years when Python not only jumped to the top in ML and DM but also became the second-most favored programming language of the world in general.

R is well supported with a large number of useful libraries for statistical analysis, it also has a long history of being an essential tool for researching, so why is R being surpassed by Python? To me, I have used both R and Python and the following are the reasons why I prefer Python over R:

Python is more intuitive. At least for newcomers, Python is more simple and easier to learn. R seems to be more relevant to experienced scientists.
Python is more product-oriented. R is mainly for research purpose and it would be quite hard to integrate R’s code into your product, while Python is a general-purpose language and you can easily make your code a part of your pipeline.
Python is generally faster than R.

As a conclusion, I suggest you guys, especially if you are new to this field and are wondering what language to use, to use Python, or more specifically, Python 3 (as Python 2 is old and will not be supported from 1 Jan 2020).

Test your understanding

All the code I will be using for my blogs on Machine Learning and Data Mining is supposed to be in Python 3, and most of which are written in an environment called Jupyter Notebook (or a new version, Jupyter Lab). Hence, let me introduce Jupyter to you guys.

Jupyter

Jupyter Notebook is a web-based interactive environment for coding Python.

Jupyter Lab is an updated version of the Jupyter Notebook with additional features (e.g. opening multiple notebooks in the same tab), which makes it more similar to a general IDE.

For simple usage, Jupyter Notebook seems to be enough. However, as things evolve, we should also adapt. In the future, it is much likely that Jupyter Lab will replace the old Notebook, given it is backward-compatible with Jupyter Notebook.

Below is a captured-screen of a sample notebook created with Jupyter Notebook.

You can see that it is quite simple. On the head is the name of our notebook, followed by a tool-bar and lastly, a notebook consisting of many cells. Right below each cell is its corresponding output. For example, in the first coding-cell, I write a command to print ‘Hello friends‘ and the greeting appears just below this cell.

A cell can contain Python code (of course) or a Markdown, which is just normal text. You can see in my above notebook, I actually use 2 cells as markdowns, the first contains ‘My introduction to Jupyter‘ and the second contains ‘Example of shared-memory‘.

Remember that even we can run each cell separately, all cells share the same memory space. It is different in compiled language that each file occupies its own memory, here in an interpreted language all the cells in the same notebook share a common memory. For example, in the second cell, I declare a variable x and assign value 100 to it, this value of x is then queried in the third cell, this means that the second and third cell share the same memory space.

You can try using Python with an IDE (like Spyder or PyCharm) and compare your experience with using Jupyter. If it is for data mining, data analysis or machine learning, I bet you will definitely love the Jupiters. It is much more comfortable.

To have more instruction on how to use Jupyter, you can click on Help on the tool-bar, or by going to Jupyter’s website.

Install Python and Jupyter

There are primarily 2 ways to install Python and Jupyter:

Manual installation. You can go to Python’s website to download the most suitable Python-installer for your machine, depends on your OS (e.g. Windows or Ubuntu), 32-bit or 64-bit, Python 2 or Python 3 (Python 3, please, as Python 2 is getting decommissioned), etc.
After you download and install Python, you can use pip to install Jupyter, just like installing any other libraries.
Package installation (recommended). You can install a package (e.g. Anaconda), which contains all the common libraries for Python (Jupyter is included as well). This saves you a lot of time finding the compatible version of those libraries to install when you need them. If you have Anaconda in your machine, most of the frequently used libraries for DM and ML are also installed, for example, numpy, scipy, matplotlib, pandas, seaborn, statsmodels and scikit-learn.

Test your understanding

Conclusion

Python is a simple, easy-to-learn, versatile and “quite” fast programming language, thus it is the most ideal choice for Data Mining and Machine Learning at the moment.

Jupyter Notebook (or Lab) is a light-weighted, simple and comprehensive environment to manage your data analysis code.

To install Python, as well as Jupyter and other general libraries for ML and DM, I recommend to download and use Anaconda – a single package that contains all the needed.

References:

Wikipedia about Python: link
Python official website: link
Jupyter official website: link
Anaconda official website: link
Business insider’s list of top popular programming languages: link

Tung M Phung's Blog

Introduction to Python and Jupyter

Leave a ReplyCancel reply