Introduction to Anaconda, Jupyter and Python Programming
The basics of working with Python to manipulate data
What is Object Oriented Programming?
Python is an Object Oriented Programming language, but what does this actually mean? The term “object-oriented” is a programming paradigm, which is a concept or idea about how a programming language should be used to solve specific problems. There are many types of programming languages and they may fall into more than one paradigm.
Why care about programming paradigms?
Anytime we are using a programming language to solve a problem, we are building software. When we are building software, we to consider a few things:
Usability: who is the intended user and is the software designed in a user-friendly way? This is where User Experience Design (UX Design) and Product Management become relevant, because if a software has a “front-end” (i.e., a user-facing interface), we need to build it in a way that is relevant for the consumer. Consider your own experience in certain programs or websites - do you ever find yourself thinking that a layout doesn’t make sense or critical flow of information is missing? This is what UX design tries to solve.
Scalability: is the software designed in a way that can manage the volume required today, a year from now, and 5 years from now? How fast is the business expected to grow? How easy is it to add new functions and features? Technical debt is often costly, and a major concern for businesses. This concept refers to legacy, or old code, that requires a lot of human, storage and processing resources to maintain. Thus, it’s imperative that we build solutions which can scale with the business and minimize the amount of technical debt.
Maintenance: how easy is it to make updates and changes to the software? Depending on the dynamics and complexity of the business, more or less regular maintenance might be required.
Of course, there are many more considerations when we are building software, but programming paradigms provide us with frameworks to program software components in efficient, scalable, maintainable ways.
Types of Programming Paradigms
Imperative programming is a style of programming where detailed instructions are given to a computer on how exactly to execute a program. For example:
Establish a database connection
Connect to database
Write a SQL query
Execute the SQL query
Download SQL query
Store the query output in DataFrame
Filter query for specific requirement
Calculate mean of column A
Procedural programming takes imperative programming a step further, but instead of listing specific instructions, the user creates sub-procesess or functions. For example:
Download data from a SQL database
establish database connection
connect to database
write and execute the SQL query
Store data in a DataFrame
download SQL query
convert to a DataFrame
filter DataFrame
calculate mean of column A
Functional programming prioritizes functions, and introduces the concept of pure functions. This means that a function only relies on its inputs to produce a result, and given the same inputs, the exact same result will be created each time. For example, we may have a defined function to calculate the average of a list of values; we can then apply this function to any list of values. The function creates an internal variable called “result” within itself; once the function is complete, the variable also disappears which ensures the function won’t modify anything outside of its parameters. Here is how we might write something like this:
function calculate_mean(
result = sum(values) / len(values)
return(result)
Finally, object-oriented programming organizes code into sections. Each section will have some information (properties or attributes) and some actions (functions or methods) that can be performed by the entity. We use a concept called a class which establishes the blueprint of how an object should be organized. Then, we can create objects (or instances) from each class. For example, we can define a class which has certain features - in the case below, rows and columns - and certain functions, for example, calculating the mean of a column, or the standard deviation. Then, we can create a separate DataFrame for a SQL query that returns customer data, and a separate DataFrame for a SQL query that returns transaction data. Both DataFrames will have the same characteristics (e.g. rows and columns) and access to the same functions (e.g. we can calculate column means in both).
class DataFrame(
features: rows, columns
functions: calculate mean, calculate standard deviation)
customer_data = DataFrame(SQL query)
transaction_data = DataFrame(SQL query)
Using Different Programming Paradigms
Python is an object-oriented programming (OOP) language, because many of the data structures and objects we use are classes from which we can create objects or instances. However, we can also build procedural or functional code with Python as well. In reality, we will use some combination of various paradigms to build programs, whether they are for data science, analytics, or automation. In a later module, we will discuss systems design which will touch on some of these strategies.
Integrated Development Environments for Python
To write Python code, we require access to an Integrated Development Environment (IDE). Some popular choices include VS Code, Jupyter Notebook, and PyCharm. For beginners, I usually recommend Jupyter Notebook because it is easy to use and allows us to write iterative code. You can access Jupyter Notebook by downloading Anaconda. This will download Python onto your computer and Jupyter Notebook and Jupyter Lab. Anaconda also allows you to manage environments. Alternatively, you can download VS Code directly and use brew to install Python and Jupyter Notebook; however this is slightly more advanced. For now, we will stick with the basics.
What is Anaconda?
You can think of Anaconda as a “toolbox” for all of your Python needs. By downloading Anaconda, you can access various IDEs (with Jupyter being the most popular one), libraries and resources. Once Anaconda is downloaded, the home screen should look something like this:
From here, you can LAUNCH Jupyter Notebook or JupyterLab. They are more or less the same, with the exception that Lab allows you to use a single browser window and open multiple notebooks, while Notebook requires a separate tab for each window.
In the Environments tab, you should see a base(root) environment. Environments are special configurations of Python for specific projects. For example, we are currently using Python 3.9, however some functions may not be available in this version that perhaps were available in Python 3.8. You can create a new environment which has an installation for Python 3.8 for a specific project. This gives you the flexibility to have multiple versions of Python or any Python library on your device, which you an access for various programs.
The reason this is necessary is because when we build software, we have dependencies on existing language capabilities. For example, suppose that in Python 3.8, you could calculate the standard deviation and variance with one function, but in Python 3.9 this functionality is separated into 2 separate functions. If you were to upgrade Python on your device and were not able to separately access 3.8, any software which uses the 3.8 function would need to be changed. And of course, software has many lines of code - it’s not realistic to change every line just because a new version of Python has been released.
The same issue exists with libraries. Sometimes, libraries are not compatible with new versions of Python because the library developers may not have caught up to the changes made by Python developers. This is the nature of open-source software. Again, we would want to have an environment which works with the various versions which are compatible with each other.
You won’t have to worry much about environments as a beginner, but once you begin working on more complex projects, this will become necessary.
What is Jupyter Notebook?
Jupyter Notebook is an IDE that allows us to write and execute Python code. It also works with R, but for the purpose of this module we will focus on Python. You can try Jupyter first without downloading Anaconda to get a feel for the software. By opening Jupyter, from Anaconda or the online version, you should see a screen like this. Code can be written in the input cell, and the output will be displayed immediately below. Try it yourself!
In the next module, we will dive into the basics of Python syntax and code structures.