Data Analysis with Python - Full Course for Beginners (Numpy, Pandas, Matplotlib, Seaborn)
FULL TRANSCRIPT
Welcome to our data analysis with Python tutorial. My name is Santiago and I will be your instructor.
This is a joint initiative between Free Code Camp and remoter. In this tutorial, we'll
explore the capabilities of Python on the entire PI Data stack to perform data analysis,
we'll learn how to read data from multiple sources such as databases, CSV and Excel files,
how to clean and transform it by applying statistical functions and how to create beautiful
visualizations will show you all the important tools of the PI Data stack pandas, matplotlib,
Seabourn and many others. This tutorial is going to be useful both for Python beginners
that want to learn how to manage data with Python, and also traditional data analysts
coming from Excel tableau, etc. You learn how programming can power up your day to day
analysis. So let's get started.
Welcome to our data analysis with Python tutorial My name is Santiago and I am an instructor@remoter.com
an online Data Science Academy. This tutorial is a result of a joint effort by remoter and
Free Code Camp, and it's totally free. It includes slides, Jupyter, notebooks and coding
exercises. Let me tell you a little bit more about remoter were an online hands on Data
Science Academy. We specialize in data science, including data analysis, programming and machine
learning. We have a complete course catalog and we're adding more content every month.
If you're interested in learning data science or data analysis, check us out. As part of
this joint effort between Free Code Camp and remoter you can get a 10% discount in your
first month by using the following discount coupon. Let's quickly review the contents
of this tutorial. In the description of this video, we have included direct links to each
section, so you can jump between them. This is the first section and we are going to discuss
one is data analysis. We'll also talk about data analysis with Python and why programming
tools like Python SQL and pandas are important. In the following section will show you a real
example of data analysis using Python. So you can see the power of it will not explain
the tools in detail. It's just a quick demonstration for you to understand what this tutorial is
about. The following sections will be the ones explaining each tool in detail, there
are two more sections that I want to especially point out. The first one is section number
three Jupiter tutorial. This is not mandatory, and you can skip it if you already know how
to use Jupyter notebooks. Also the last section Python in under 10 minutes. This is just a
recap of Python. If you're coming from other languages, you might want to take this first.
If that's the case, again, you can use the links in the video description to jump straight
to it. All right now let's define what is data analysis. I think the Wikipedia article
summarizes perfectly the process of inspecting, cleansing, transforming and modeling data
with the goal of discovering useful information, you forming conclusions and support decision
making. Let's analyze this definition piece by piece. The first part of the process of
data analysis is usually tedious. It starts by gathering the data and cleaning it and
transforming it for further analysis. This is where Python and the PI Data Tools Excel.
We're going to be using pandas to read, clean and transform our data. Modeling data means
adapting real life scenarios to information systems using inferential statistics to see
if any pattern or model arise. For this we're going to be using the statistical analysis
features panelists and visualizations for matplotlib and Seabourn. Once we have processed
the data and created models out of it, we'll try to drive conclusions from it finding interesting
patterns or anomalies that might arise. The word information here is key. We're trying
to transform data into information. Our data might be a huge list of all the purchases
made in Walmart in the last year, the information will be something like pop tarts sell better
on Tuesdays. This is the final objective data analysis we need to provide evidence of our
findings, create a readable reports and dashboards and aid other departments with the information
we've gathered. Multiple actors will use your analysis, marketing sales, accounting executives,
etc. They might need to see a different view of the same information. They might all need
different reports or level of detail what tools are available today for data analysis.
We've broken these down into two main categories, our managed tools, our close products, tools
you can buy and start using right out of the box. Excel is a good example. Tableau and
luchar are probably the most popular ones for data analysis. In the other extreme, we
have what we call programming languages or we Call them open tools. These are not sold
by an individual vendor, but they are a combination of languages open source libraries and products.
Python R and Giulia are the most popular ones in this category. Let's explore the advantages
and disadvantages of them. The main advantage of close tools like Tableau or Excel is that
they are generally easy to learn. There is a company writing documentation providing
support and driving the creation of the product. The biggest disadvantage is that the scope
of the tool is limited, you can cross the boundaries of it. In contrast, using Python
and the universe of PI Data Tools gives you amazing flexibility. Do you need to read data
from a closed API using secret key authentication for example, you can do it? Do you need to
consume data directly from AWS kinases, you can do it a programming language is the most
powerful tool you can learn. Another important advantage is a general scope of a programming
language. What happens if Tableau for example, goes out of business. Or if you just get bored
from it and feel like your career is taught you need a career change? learning how to
process data, using a programming language gives you freedom? The main disadvantage of
a programming language is that it's not as simple to learn as with a tool, you need to
learn the basics of coding first, and it takes time. Why are we choosing Python to do data
analysis? Python is the best programming language to learn to code. It's simple, intuitive,
and unreadable. It includes 1000s of libraries to do virtually anything from cryptography
to IoT. Python is free and open source. That means that there are 1000s of PI's very smart
people seeing the internals of the language under libraries. from Google to Bank of America,
major institutions rely on Python every day, which means that it's very hard for it to
go away. Finally, Python has a great open source spirit. The community is amazing, the
documentation, so exhaustive, and there are a lot of free tutorials around checkout for
conferences in your area, it's very likely that there is a local group of Python developers
in your city. We couldn't be talking about data analysis without mentioning r r is also
a great programming language. We prefer Python because it's easier to get started and more
general in the libraries and tools it includes. R has a huge library of statistical functions.
And if you're in a highly technical discipline, you should check it out. Let's quickly review
the data analysis process. The process starts by getting the data where is your data coming
from? Usually it's in your own database, but it could also come from files stored in a
different format, or a web API. Once you've collected the data, you will need to clean
it. If the source of the data is your own database, then it's probably in writing shape.
If you're using more extreme sources like web scraping, then the process will be more
tedious. With your data clean, you'll now need to rearrange and reshape the data for
better analysis, transforming fields merging tables, combining data from multiple sources,
etc. The objective of this process to get the data ready for the next step. The process
of analysis involves extracting patterns from the data that is now clean and in shape. Capturing
trends or anomalies. statistical analysis will be fundamental in this process. Finally,
it's time to do something with data analysis. If this was a data science project, we could
be ready to implement machine learning models. If we focus strictly on data analysis, we'll
probably need to build reports communicate our results, and support decision making.
Let's finish by saying that in real life, this process isn't so linear, we're usually
jumping back and forth between the step and it looks more like a cycle than a straight
line. What is the difference between data analysis and data science? The boundaries
between data analysis and data science are not very clear. The main differences are that
data scientists usually have more programming and math skills, they can then apply these
skills in machine learning on ETL processes. The analysts on the other hand, have a better
communication skills creating better reports with stronger storytelling abilities. By the
way, these Weiler chart you're seeing right here is available in the notes in case you
want to check out the source code. Let's explore the Python and PI Data ecosystem, all the
tools and libraries that we will be using. The most important libraries that we will
be using are pandas for data analysis, and matplotlib and Seabourn for visualizations.
But the ecosystem is large and there are many useful libraries for specific use cases. How
do Python data analysts think if you're coming from a traditional data analysis place using
tools like Excel and Tableau you're probably used to have a constant visual reference of
your data. All these tools are point on Click. This works great for a small amount of data.
But it's less useful when the amount of records grow. It's just impossible for humans to visually
reference too much data, and the processing gets incredibly slow. In contrast, when we
work with Python, we don't have a constant visual reference of the data we're working
with. We know it's there. We know how it looks like. We know the main statistical properties
of it, but we're not constantly looking at it. These allows us to work with millions
of records incredibly fast. This also means you can move your data analysis processes
from one computer to the other, and for example, to the cloud without much overhead. And finally,
why would you like to add Python to your data analysis skills aside from the advantages
of freedom and power theories, another important reason, according to PayScale, data analysts
that no Python and SQL are better paid than the ones that don't know how to use programming
tools. So that's it. Let's get started in our following section will show you a real
world example of data analysis with Python, we want you to see right away what you will
be able to do after this tutorial.
We're gonna start this tutorial by working with a real example of data analysis and data
processing with Python, we're not going to get into the details yet, the following sections
will explain what each one of the tools does, and what is the best way to apply them combining
and the details of them. In general, this is just for you to have a quick on high level
reference of our day to day processes, data analysts, data managers, data scientist using
Python. So the first data set that we're going to use is a CSV file that has this form, you
can find it right here, under the data directory, the data we're going to be used is this, I
have just transformed it into a spreadsheet. So we can pretty much look at it from a more
visual perspective. But remember, as we said in the introduction, as data analysts are
not constantly looking at the data, right, we don't have a constant visual reference,
we are more driven by the understanding of the data right in the back of our head, and
we understand how what the data looks like, what's the shape of it. And that's what it's
conducting our analysis. So the first thing we're going to do is we're going to read it
this CSV into Python, and you can see how simple it is just one line of code gets us
the CSV read into byte, then we're going to give a quick reference. And this is what the
data frame that we have created looks like data frame is a special word is a special
data structure, we use independent tool. And again, we're going to see that in detail in
the pan this part of this tutorial. The data frame is pretty much the CSV representation,
but it has a few more enforced things like for example, each column has a strict data
type. And we will not be able to change it to tetra, it's a better way to conduct our
analysis, the shape of our data frame tells us how many rows and how many columns we have.
So you can imagine that with these amount of rows, it's not so simple to again, follow
a visual representation of it's like, it's pretty much infants crawling, in this point
100,000 rows. But the way we work is by immediately after we load our data we have we want to
find some sort of reference in the shape and the the properties of the data we're working
with. And for that we're going to do first an info to quickly understand the columns
we're working with. In this case, we have date, which is a date time field, we have
day, month year on that are just complimentary to date, we have the customer age, which is
uninjured, which makes sense right? age group, you can say it's right here. It's age group
youth, customer gender, we have an idea again, of the of the entire data set, we know the
columns we have, but we also know how large it is. And we don't care what's in between,
we will be cleaning it probably, but we don't need to actually start looking row per row,
right just with our very limited eyes, we have a better understanding of the structure
of our data in this way. And we're going one step further, we will also have a better understanding
of the statistical properties of this data frame with a describe method. For all those
numeric fields, I can have an idea
of the statistical properties of those. So for example, I know that the average age of
these data set is 35 years old. I also know that the maximum age in this case if these
Or is the sales data is 87 years old, I know the minimum is 17 years old. And again, I
can start building right if my understanding of this that physical properties of it. So
in this case, the median of my age is very close to the mean. So this is telling me,
all is telling me something, and the same thing is going to happen for each one of the
columns that we are using.
For example, we have a negative profit here, and we have very large values here are these
correct, is maybe there's a mistake, again, it's by having a quick statistical view of
our data, we're going to be driving the process of analysis without the need of constantly
looking at all the rows that we have. It's a, it's a more general holistic overview.
So we're gonna start with unit cost, let's, let's see what it looks like. And we're going
to do a describe only if you need coast, which is pretty much what we had right here. In
the previous in this line, what we did was for the entire data frame for the entire data,
in this case, we're just focusing in the unit coast, cost, sorry, column, the mean, the
median, all fields, we know already pretty much from this, and we're gonna quickly plot
them, we're going to use these tools to visualize them. And it's the same tool, it's paying
this that it's using on top, right? It's using matplotlib. So the visualization is created
with matplotlib. But we're doing it directly from pandas. And again, don't worry, this
is all explained in pandas lessons. So this is unit costs, right is what this is the box,
but we have just created, we have the whiskers that mean that shows us the the first and
third quartile, the median. And then we see all the outliers that we have right here.
So we see that our product study is around $500 is considered to be an outlier. And the
same thing if we do a density plot, right. So this is what it looks like. We're going
to draw two more charts, right, in which we're going to pretty much point out the mean and
the median, right in the distribution charts. And we're going to do a quick histogram of
the costs of our products. Moving forward, we're going to talk about age groups with
the age of a customer. And at any moment, we can always do something like sales sort
here to give a quick reference, we know that the the age of the customer is expressed in
actual years old they were but also they have been categorized with three, four, actually
four age groups, seniors, youth, young adults and adults, right. So they we have given categories
were creative, right to better understand these groups, and we do that with values.
Value counts, we can quickly get a pie chart out of it, or we could get a bar chart out
of it. As you can see, right here, we're doing an analysis of our data, we see that adults
right here are the largest group in our for our data at least. So moving forward, what
about a correlation analysis? What is a correlation between some of our properties, we will probably
have high correlation for example, between profit and unit cost, for example, or order
quantity, that's kind of expected, but that's all something that we can do right here. This
is matrix right of correlation showing in red high correlation. So order quantity, and
unit cost or where is profit right here. Profit is right here. So we see high correlation
with unit with cost with profit. Now with profit, actually, it's the opposite blue is
high correlation, I'm sorry, the diagonal, which is blue, is correlation is equals one.
So high correlation is blue. And we see that profit has huge correlate has a lot of correlation,
positive correlation with unit cost and unit price. And negative correlation is with dark
red. So we again can have a quick idea. Let's see, for example, here profit, it has negative
correlation with order quantity, which is interesting, right? It's we wouldn't dig deeper
into that, of course, the profit has a high correlation positive with revenue, right?
And again, it's just a quick correlation analysis. We can also do a quick scatterplot to analyze
the customer age and the revenue right to see if there is any, any correlation there.
Right? And the same thing for revenue and profit. This is obvious, right? We can we
can quickly draw a diagonal here, right. So there is a lot Linear depth and dependency
between these variables. So a form a few more box plots, in this case, understanding the
profit per age group, right, so we can see how the profit will be, will change depending
of the customer's age, and a few more box plots. And we're creating these these grid
of year customer age, unit costs, etc, for multiple things. So moving forward, something
that we can quickly do when we're working with Python, especially within this is Drew
shape or data or derive it from other columns, right. So this is pretty common in Excel,
we can create these revenue per age column, if you're here in Google spreadsheets, you're
going to do something like revenue, per age, and you're going to do something like equals,
right? Equals revenue, divided, I don't remember if this correct formula we're using, but just
for, for you to have a reference. And we're going to pretty much extend this whole thing.
There we go, Oh, well is processing, and I have 100,000 rows. So you can see how slow
it is, I let's compare that just to the way Python works, I'm gonna execute this thing.
It was instant, you know, extremely fast. And it was all calculated seems that we have
the same results as expected. same results as expected. And we can quickly plot both
the in a density plot and in a histogram, as you can see, right there, now that revenue
parade is going to be relevant. In any case, it's just to show you the capabilities of
what we can do. Let's annual analyze, well, we're gonna create a new column, which is
calculated cost is the total, the total orders the total, the quantity of the order, times
the cost, right, extremely simple formula, very fast process. And we're gonna get right
here, how many rows had a different value than what was provided by cost? So what we're
doing right here is like, we're quickly checking if the cost provided by the data set, at some
point doesn't align with the actual cost we are calculating. So is there any mistakes
that were made by the I don't know the original system, or people doing a data entry, if these
new column is different from cost, we want to know about that. And that doesn't happen.
So again, quick, quick, regression plot. In this case, it's very obvious that there is
some linear dependency between calculate cost and profit. So more formulas, in this case
costs part cost plus profit. So we're going to adding a little bit more, there is no difference
with the revenue and the calculated revenue that we are having. So that all makes sense,
we're going to do a quick histogram of the revenue. We can, for example, on 3%, to all
the prices that we are using, we need to increase prices. How are you going to do that? Well,
it's very simple with Python, we're just going to do increase everything by point 03. And
now all the prices have changed.
What else we're going to be able to do quick filtering, let's get all the sales from the
state of Kentucky right. So these are all the sales from the state of Kentucky, we can
get only the average of the sales by these age group on only revenue, right. So these,
all these filtering options, and extremely simple to get with Python. In this case, we
say, give me all the sales from these age group, and also from this country, right,
and we're gonna get the average revenue from these groups that we are selecting. And again,
to modify the data, we can make just a few quick modifications, like in this case, we're
going to say, all the sales from country right to revenue, we're going to increase it by
1.1. I don't know why, which is doing it arbitrarily. It's just for me to show you how it works.
So far, so good. Again, we've done a couple things, you don't need to know about the details,
we will actually go through that in the NumPy independence sections in this tutorial. So
just for you to have a quick reference of it. There are exercises associated with these
given lectures. So if you want to pause right now and get into the exercises, that's going
to be very helpful. We're going to move forward now with the second lecture in which we will
be using a database this Akila database and we're going Be erasing data, instead of from
a CSV file, as we did before, we're going to read data now from a database. Reading
data from a SQL database is as simple as it is from an Excel file or a CSV file, as we
were doing with our previous example. And once you've read the data, that's we're going
to do now the process is the same. So what we have right here is a query a SQL query,
if you don't know about SQL, you can check our courses or other courses online. Basically,
we're pulling the data from the database. This is one of the advantages of Python, it's
not, there are connectors for pretty much every database provided out there, Oracle,
Postgres, MySQL, SQL Server, etc. In this particular example, we're going to be using
MySQL. So once you construct the query, and you pull the data from the database, then
the process is the same, we have just converted these outside data into a data frame that
we can use with our Python skills. The first step, as usual, is to check the shape information
description of our data of our data frame. In this case, we want to, again understand
the structure of it. So we want to know how many rows we have 16,000, we want to know
a little bit more about our rows, we want to know about a little bit more about our
columns, and how many rows how many records we have for each one of them and the type
of each one of these columns. And we also want to have a better statistical understanding
of our data. So we do a quick describe, and we have more details about it. If we want
to focus in individual columns, right, we can just do that by in this case, we're gonna
focus in film rental rate, right, pretty much how much you pay to rent a film. Um, we're
gonna see the kind of distribution we have, we can call it distribution, it's pretty much
a categorical field in this case, but basically, the rentals are divided into three main categories
are prices, zero 99 299 499. So that's these box plot these pretty much perfect, never
seen in real life plot box plot gives you those prices. And move forward, we can also
check very quickly a categorical analysis, understanding the distribution of rentals
between cities, so we have two cities. And it's pretty much even as you can see right
here, creating new columns and reshaping the data for further analysis, etc, is relatively
simple. In this case, we're going to analyze their return in rentals, right, which, which
films are going to be more profitable for the company div, dividing the rental rate,
how much we charge, divided by the cost, how much it costs us to acquire the film. So in
this case, we can see the distribution of that, right. So most rentals are here in the
beginning. And then we have more profitable rentals, were making up to 60% above the rental.
And we can quickly analyze the mean and the median fit right to have a quick idea of all
that.
Finally, selection and indexing, if you want to start focusing, if you want to go into
data, right, you want to zoom in, you want to have a better understanding. So you start
filtering, in this case, we can filter by customer, but if you want to do it per city,
if you want to do it per state, if you want to do it per film, per price category, etc.
It's very simple to filter to filter and zooming, which is one particular characteristic of
your data. So you can perform a more detailed analysis. So in this case, we have all the
the films are rented by the customer last name, Hanson, which doesn't mean it's the
same person. But again, it's very simple to filter dot. And here, we can do we can very
quickly see which ones are the price, the film's sorry, that have the highest replacement
cost, right. So basically, what we're doing is we're going to isolate those films that
have the highest replacement cost. And also we can see right here just for you to have
an idea, all the films that are in the category PG or pG 13. It's very simple to to filter
that data. So this is the process we usually follow. we imported the data, we reshape it
somehow create columns, there is an important process of cleaning up or not highlighting
this part of the tutorial, we're going to talk about it in the tutorial itself. There's
the process of cleaning, then reshaping creating new columns, combining data and creating visualizations.
This is the process, right? We're following here with our Python skills, but it's a tone
more to odd as you might imagine, from creating reports to running machine learning processes,
creating linear regressions, etc. For now, this is just a quick understanding of the
process. We follow. Now starting now we're gonna move forward with more details of each
one of the individual tools we're going to talk about. We're going to talk about Jupyter
notebooks. We're going to talk about NumPy. We're going to talk about pandas, we're going
to talk about mapa, lib, seaborne, etc. Starting now, right? The first thing we're going to
see is, what is this whole thing that I've been using this Jupyter Notebook, I want you
to now too, if you want, if you if you don't have experience with it, I want you to have
an idea of how it works. And then we're going to move forward the individual tools, NumPy,
pandas, etc. Remember, there are exercises also associated with this particular lecture.
So you can always go back again, and work with them. Once you get more a better understanding
of the tools we are using.
Before we jump into the actual data analysis course, and we start talking about Python,
pandas, all the tools, we're going to use import files, read data from databases, etc,
I want to show you the environment that we work with. It's our primary environment, it's
the tool that we use 99% of the time on its Jupyter Notebook, there are going to be different
terms here, I'm going to be referring to it as Jupyter Notebook. But as you are going
to see in this, in this part of the of our tutorial, you can see that Jupiter is actually
a whole ecosystem of tools. And it's a very interesting project. Jupiter is a free and
open source, again, ecosystem of multiple tools. And primarily, we're gonna talk about
first, what is a Jupyter Notebook. What you're seeing right here, and you're gonna see live
in a second, I can actually show it to you is this thing we're going to use. And we are
also going to talk about Jupiter lab. Okay, which is the evolution of the regular Jupyter
Notebook. So, I think this could be familiar to you already. Usually the questions in the
question is, what's the difference between Jupyter Notebook and Jupiter lab? Well, the
difference is that Jupiter lab is just a nicer interface on top of Jupyter notebooks. It's
not just the plain notebook. This is a notebook, but I'm scrolling right now. It's also the
addition of tree view, it's an addition of get tools, as an addition of command to lead
and multiple other things. You can open some files with a nice preview in it, etc. So,
Jupiter lab Jupyter Notebook, they are similar Jupiter lab easy, again, the evolution of
a Jupyter Notebook. And that's what we're using. Again, Jupiter is a free and open source
project. So anybody can install it, anybody can download it, it's very simple to get it
set up in your local computer. In this case, we're using something we call notebooks AI,
it's a project that provides Jupiter environment for free in the cloud. So you don't need to
install things locally, you don't need to put things in sync in your own hard drive,
right you That means you don't need to buck it up, for example, because it's just a service,
it's all worked in the cloud. So said that, I want to tell you that we have compiled a
very quick list of everything, we're going to talk in this part of the tutorial, in this
list of two, it's just a thread of with multiple, multiple hints of how to use Jupyter notebooks.
So after the video after the course, if you forget some of these concepts, you can always
go back to this to it, it's a quick reference for you to have. So let's get started. Why
do we use a Jupyter Notebook? Because it's an interactive real time environment to produce
our or to to explore our data and to do our data analysis. It's a tool you're gonna fire
commands, and it will immediately respond with something back. It's a very interactive
tool, when we're working with data analysis, and this is mainly main difference with some
other tools like for example, Excel, tableau, etc, is that we are not constantly looking
at the data, there is no visual reference, like for example you have in Excel, right?
So in Excel, you're constantly looking at the data, you have it in front of you, there
are 100,000 cells and you can stroll and see them. The problem is that that's not scalable,
right? It's like nobody can work with 100,000 rows in their, in their, in their mind, we
will always forget something. So the way we work with Python indeed, analysis is by always
having a reference of how our data looks like but always at the back of our head and we're
not constantly looking at it. We're like this person from the matrix, you know, the, the
the commander of the matrix that commands people to get get in and out. We're basically
telling people telling people that basically asking data, right asking questions to the
data, and having a picture in our mind of how that's going to work, we're not constantly
looking at it, we're just having a reference, or in our in the back of our heads of what
our data looks like. So that's why this tool is very useful. This tool is useful Also,
if you're just training your Python skills, and or their permanent language skills, because
what you're gonna see is it's just a regular Python interpreter. In this case, I can execute
some code, that's two one times, actually one plus three, there we go. And the result
is four. Right. So this is a Python is a fully featured Python interpreter. The good thing
is that again, it's going to respond to us pretty much immediately I create a command
and I immediately get a response, I can do something a print here, hello world. And I
immediately get a response, I can do Hello, world, times, times three. Again, it's a again,
a Python interpreter, a fully feature Python interpreter, but it's not being accessed from
a terminal, which you can write this is the good thing about Jupiter lab to have a terminal,
I can do Python, right. And I can do two, time three, and I get an answer back. But
this is not convenient to work with our data, we need something a little bit more interactive,
we can also mix with documents, that's going to be the advantage of a Jupyter. notebook.
So what what's the way we work with Jupyter notebooks, there are a few concepts, very
important concepts that we are going to follow a Jupyter Notebook is just a sequence of multiple
cells, okay, everything is a cell. And as you can see, when I click on these cells,
even if even if it doesn't look like being a cell, it is, you will see that these blue
thing right here, right is pretty much following me because I'm clicking on the cell, and I'm
selecting that particular cell. Everything happens within a cell, if I want to execute
some code I can do, again, one plus five, and to get a result or a result back, right,
that's, that's how it works. So I'm creating a cell, I'm deleting a cell, I create another
cell again. So it's everything happens with a cell, and I'm going to tell you how to add
the cells, how to remove them how to execute code, etc. The interesting thing about a cell
is that it can either be Python code, or any other programming language you're using in
this case is a Python data analysis course. It can be Python code, as we're we were doing
before one plus three, this is Python code, or it can be what we call markdown, okay,
which is a formatting
format, right? To create text, that will be a render with sort of HTML ID at the output.
So in this case, this is what the source code of the markdown looks like in markdown, any
line that starts with this part, it's going to be a title, in this case, it's going to
be the largest, the biggest title you can have is just one pod, and then you keep adding
to reviews the size in this case, level three title. And then you can have for example,
this is a quote this is bold, this is it Alex, this is a link, right? So let me actually,
I could copy the cell and open the source code. There we go. So this is a link right
issue, issue is created or it's rendered as a link. So markdown, what is is that is a
text formatting tool, right or protocol, we could say that in this case, we just specify
us we have some
some rules to use in our in our text, and markdown knows how to interpret them and format
right or return a formatted document after them. So for example, here, we have green
divider, which is a picture and we know it's a picture because it starts with an exclamation
marks. And that's that what you're saying right here. So again, a cell can be either
Python code, or it can be markdown. markdown is an entire thing on its own. You can get
any tutorial online free, it's it's fairly simple to get started with. And it's also
very important because when you're formatting your reports, right, when you're creating
your reports, you want them to look pretty, you can use markdown for not and what we're
going to see later So you can export these notebooks and they will generate PDFs, right.
So this whole thing can be a PDF or an or an HTML page. So after you're done with your
data analysis, you can hand over to whoever asked for the analysis, a PDF report, which
is pretty neat. So moving forward, again, any cell is going to be either markdown, or
it's going to be code right here. So these ones code, and you can switch the modes, you
can say, this LS code, or actually, let's make it markdown. So right now, if it's a
code, it doesn't doesn't matter, or just, it's not executing anything, because the cell
is interpret as markdown. So now, I'm switched back to code. And now it works. Again, I said,
Sure. It can also be raw, but to be honest, we don't use raw very often. So again, you
have this this general cell type, this cell we're using, what type is it? Is it code is
it markdown, you can switch it with these with the selector right here. So a few more
things that I have to tell you right away, so you can start internalizing them, and it's
gonna take some time to get used to it. But once you get used to it, you're gonna move
very fast in your data analysis with Python Jupyter notebooks. The first thing is, as
you're seeing right here, every cell has been given an execution number. So any, the cells
will be moved, right, they will be moving around, you will be moving them around. But
you will always know which one executed before another one. And that's because every execution,
you run will be assigned an execution number. In this case, this is the seventh time I have
executed code. If I execute code again, for example, I don't know, two times two, this
is the eighth time that I've executed code. And if I move this thing, right here, if you're
reading this thing, top down, you will not be full, right? You will understand this thing.
It was moved, the cell was moved, the structure of the notebook changed. But these thing was
executed after this other cell, right? xact. And this is seven. So the execution order
is always preserved. So that's an important thing. Something else that you're seeing me
change the structure, and do things with the notebook without using any menu. And that's
because I know how to use keyboard commands keyword shortcuts to run most of these commands.
So for example, how can I add a new cell I have these is a markdown cell. This is a code
cell, if I need a cell before these one, what's what's that command that I'm going to issue
in order to create the cell, in this case, the command is going to be the letter A, I
just type A, and there is a new cell creative. How can I delete the cell, it can be two times
that the key two times the D key. And again, this is all these reference with built. So
for example, right here, whereas hit at some point, you can.
Here, you can type, you can press A to create a new cell, you can press B to create a new
cell, what we call below. So let me put something here, this is a reference. And I'm going to
put here the letter B and it's going to create a cell B below the currently selected one.
So the selection here is here in the blue, I hit let me delete this one, I hit B. And
again, it's going to create a cell below the previously selected one, if I hit a, it's
going to create a cell above that previously created one. So these, these are the mnemonics
of the creation. Something else and it's very important why when I'm in this cell, and I
hit the letter, a leader, literally it just hits the letter A in my keyword, no control,
no command, just a, it creates a new cell, and it doesn't type A inside the document,
right? So right here, if I type A, it's adding an actual a character in the cell. Why didn't
that happen before. And you're going to notice that when I change, when I'm going to call
a mode in a second, you're going to see that the content of the cell is grayed out, show
what now when I when I press on the letter A it actually creates to sell and it's not
adding content to the sell itself. If I go back again to the other mode, and I'm going
to give you a better explanation in a second. If I type anything, in this case, a it's actually
appended to the text within it. So this is my interaction to sell modes and this is very
important. The Jupyter Notebook is a mode base editor, right? So there are multiple
editors are, for example, vim or VI, vi, those are mode based editors, which basically, the
behavior of your work will change depending on the mode that it's currently activated.
So for example, in this case, I am in addition mode, because any character that I type will
be appended to the cell, A, B, C, D, etc. If I switch out of editing mode to what we're
gonna call command mode, I switch out of that mode. Now the cell is grayed out, and any
key that I hit, it's gonna do something different associated with that key. So A is going to
create a new cell above, B is going to create a new cell below, Double D is going to delete
this cell, right. So that's, that's the important part of Mo. That's one of the most important
parts in order to understand how to work with Jupyter notebooks, the mode that you're currently
working with, and there are only two modes, so it's fairly simple. This is command mode.
And we recognize command mode, because this cell is grayed out. When we get into edit
mode, there is a regular prompt, as you're saying before, the number one on the cell
is actually subjects of addition. So that's the way we can realize that, how are you going
to switch from modes, in this case, I'm in editing mode, if I'm using my mouse just pointing,
I can click outside, I'm gonna get out of the edit mode into command mode. If I point
inside and going back again, to the Edit Mode, but let me tell you something right away and
then say, we don't like to use our mouse, we don't like to point and click, because
that's very slow. We like to use our keyboard, we move very fast with our keyboard. So how
are you going to switch from, from editing mode back to command mode, that's going to
be with the Escape key to go from editing to command, edit as Escape key, it's going
to switch out of editing, but when mode. And if you actually want to make modifications
to the cell, basically, you want to get into edit mode, you're going to hit the return
key, that's going to get you into edit mode, again. So we have tackle multiple things are
writing, again, we said in Jupyter notebooks, we're going to use Python code very quickly
to interact with our data, we need a real time, you know, I'm asking a you're answering
type of editor. That's what the Jupyter Notebook is. The Jupyter Notebook has these two modes,
edit and, and command mode. And then the cells which is pretty much everything is the most
important, it's a fundamental part of the notebook, the cell is going to have two types
can be either code, or it's going to be markdown, right. And now I'm going to start showing
you more features. And I'm going to show you, I'm going to show you the most important commands.
And of course, how the what the keyboard shortcuts for those commands are, so you can move freely.
And and and work with Jupyter Notebooks in the most efficient way. So let's get started.
First of all, for for from the most important commands is moving right. So navigating, it's
very simple to navigate, just use your arrow keys, up and down, up and down. And you're
going to move around in your notebook. If you wanted to switch the type, right going
from markdown to code, etc, you can switch use these drop down or you can press the specific
key is to switch to either markdown or Python. So for my markdown, you're gonna switch sorry,
hit the M key, that's going to make it markdown. For Python, you're going to hit the Y key,
that's going to make it Python code. So M and y are going to switch you back and forth.
Keep an eye on the selector you're going to hit y m y m is going to switch it from code
to markdown.
What else how can you execute code once you are within your typing code and you want to
execute it, there are two types of executions you can run. The first one is going to keep
the selection the currently selected an active cell is going to stay the same place you are
and that's going to be my by keeping press the Ctrl key and hitting return that's going
to run decode on the cell there the prompt or the current selected cell will remain being
the same. So I'm running this thing a couple of times already on this selection or the
currently highlighted cell stays the same, I can change that by using shift return. So
I keep the shift key pressed. And I hit return and is going to execute the code. But it will
immediately switch the prompt or the currently selected cell to the following one. And that's
useful when you have multiple cells, you want to execute one after the other. So you can
keep hitting shift, return, return, return return, and it keeps you moving right from
top to bottom. Alright, so Ctrl return or shift return to change the execution is the
same is just what's going to happen with the currently selected cell. We already saw how
to create cells with the A key, we create a cell above with B key we create a cell below.
To delete a cell, you're going to hit the D key, the D key two times one after the other
very quickly, dd is going to delete these the cell. What happens if you made a mistake,
and you want to undo the previously issued commands? Well, the mnemonic here is going
to be Ctrl Z, you know the mnemonic, it's not the command, it's going to be Ctrl Z,
you only need to press the Z key, you know, you don't need Ctrl Z, and it's gonna undo
whatever you did in your previous command. Alright, so a B, D deletion, and then Z to
undo the all the commands were saying they all have a correspondence in this toolbar
or in this command palette. So for example, right here, I could run this code by pressing
these play button right here you see it, the execution is changing. There are multiple
ones and you can search them if you don't remember right here. And the neat thing about
it is that you actually have the shortcuts to issue the same command. So let's say you
don't remember how execute and stay stay in the same cell, or move whatever you can search
for run. And you can see what's the name, and what's the actual command that you have,
right there, right. So you can, at least for your first ad or a month working with Jupyter
notebooks, you will usually need to go back to these commands, and try to remember the
the quick shortcuts. And with time and practice, those will just come naturally. So moving
forward, what else, we have a few other commands, in this case, we have something to cut and
paste the cell somewhere else, just cut and paste, that's going to be x to cut it, or
you can also use the scissors here, x to cut it. And to paste it, you can use this button
or actually these buttons sorry, or you can just press the V key V is going to paste it
wherever you're currently standing it. So I'm going to cut it, I'm going to remove it
from here, and I'm going to paste it below there. Or you can also copy it. So instead
of cutting it, you can press the C key just going to cut, sorry, copy. And then you can
actually say where you want to paste it. In this case, we have duplicated the same cell.
And it looks something interesting here, the execution count remains the same. So again,
there is like this unique identifier for your executions, which means that you know, when
and where something was executed. Moving forward, we're going to use some code here, we're going
to import some tools, you can see some characteristics or advantages of Jupyter notebooks and why
we use it so often compared to, for example, the regular Python terminal.
One very important thing is visualizations, we as data analyst, we're constantly getting
data on expressing it through images, or animated animations, right. But most commonly, images.
The main library we use in Python is model live. And model lib is a first class citizen
in Jupyter notebooks, which means that you can just run the figures from matplotlib.
And they will just show up directly in your notebook without the need of doing anything.
Crazy. So can you imagine showing these these beautiful picture in this terminal? That's
that's very hard, of course. So again, that's one of the main advantages of a Jupyter Notebook.
Moving forward, what we're going to do is we're going to first we're going to get some
data from a public API. So there is these crypto watch service, which basically has
crypto information, Bitcoin, ether, etc. And you can check the docs, we can actually open
them. It's gonna give you market data Tesla. You can check the docs and How you can get
in this case it's BTC Bitcoin to euro, sexual see if we can change it to USD USD price.
There we go. So this is the current price of bitcoin results, surprise, etc. And we're
actually going to do markets do we have crack and BTC USD, let's do, let's actually issue
the same query we're going to use which is open high, low, close Oh h LC. And don't worry,
this looks ugly. But this is actually what we're using. There's a list of results write
for all different candles, we call them, we get the idea of the open price, close price,
high price and low price. So we're going to issue those, we're going to issue these requests
to the internet to these API, the crypto the crypto watch API, so you can get information
about bacon to do some analysis, I say they can, you can actually get it from ether for
for ether for author different types of crypto or currencies. So the function we're defining
is get history, get historic price, it's a very simple function that uses pandas is one
of the most important tools, we're going to be using this course. And the requests library,
which is also very famous library for Python. And what we're going to do here is we're going
to get Bitcoin on ether prize for an entire week. Right. So from ferreted that the second
February sorry, February 25, up to today, right? So depending on when I'm shooting this
video, and we're gonna get a quick reference of the prices open, high, low, close. So in
this case, we have four information per hour. Okay, so this is something you can actually
change in the in the, in the request you're making to the API, you can reuse the candles
eyes. In this case, we're keeping it per hour. So we have by the hour information about Bitcoin,
in this particular market, which is bitstamp. Here, we have these day these day, and these
are right, when I'm in the morning, open, close, highest price and lowest price, and
also the volume that was operated within this time period. And we're gonna immediately plot
the price. So we see that in these time, which I think is an entire day, we the price dropped,
it's actually a few days, like an entire week, the price dropped from $9,600 below, right
9000. So it was a pretty significant drop. Let's see ether highperformance. We have here
all the records, and how it moved. So this is what I tell you that when you're doing
data analysis with programming tool like Python rar, you're not constantly looking at the
data. So what I'm showing you right here are the first five records, we actually have.
Let's do that. We actually have 169. Records, okay, 169 Records. And this is per hour. So
if we do 169 hours divided by 24 hours, we have seven days, right? So we have seven days
of data 169 Records, and then we have a little bit more information keeps this to go. I'm
gonna get to that in a second. But basically, this is one I tell you 169 Records, to be
honest, something you could be saying in a spreadsheet. But I want you to get the concept
here. We're not just looking at our data, we have it in our brain, we know what did
it we know what shape it has. We know how many records it had, we know information standard
deviation, what's the mean of that? Right? So close price was the standard deviation,
right? What's the the average, the mean, the median, right? So we have information about
our data. It's sitting behind, you know, in our brain, but we're not looking at it. And
that's because with a very simple example, with only 169 Records, but in real life, we're
dealing with millions of records, so it's impossible to see it. Have you ever tried
is crawling in an Excel spreadsheet through millions of records. It's crazy. It's not
possible. It's just unusable. So that's again, the way we work with data analysis in Python
and R and other tools. We don't constantly keep an eye on the data. We know the shape
of it. And we just have these quick references like show me the first five records. I mean,
the last five records, show me this chunk here down there, but that's it. So again,
these are the visualizations we're creating on Jupyter notebooks. Again, it's just very
simple to get the plot done right there. We're going to also see in Jupyter notebooks, a
few other pretty neat things. The first one is that we can use another library, which
is called bokeem. And the difference is that boakye will have charts that are interactive.
So I'm moving it right here, it has JavaScript. And it's interactive, you look back again,
to what we had here. This is a static chart, it's just a PNG, you can actually export it
as a PNG, there is nothing you can do with it. With bokeem, it's actually a dynamic,
dynamically generated interactive charts. So I can, I can zoom in piece of data, right,
I can move it around, I can just do whatever I want with it. I can refresh and reset it
to whatever it was. And it's a dynamically generated chart. The difference is, if you're
working with data, dynamically in your analysis, sort of in your exploration, then boek is
a planning tool because you can zoom in, right, so what's going on here, let's, let's look
at these things. If we're working on a mean, reverting strategy, for example, we see a
high volume, we see a low volume, the mean is going to be here. So we see some mean reversion
in there. It's very interesting. If you need to, for example, export a PDF, export a huge
HTML file, then static images are going to be probably better. So that's the difference
between them. To be honest, model lib is a lot more popular than bogey, we use model
live a lot more because it's we actually have a few other tools like seaborne that make
it very easy to access and use it. What else Jupyter Notebooks work very well with some
Excel, Excel files with all the file formats csvs, XML, Excel files, etc. And that's also
the the availability of Jupiter lab. So Jupiter lab can immediately interpret and opens his
v files can open with some extensions, XLS files, XML files, JSON files has a very nice
editor and tree view for Jason. So the Jupiter lab environment combined with Python Jupyter
Notebooks will give you a good idea of Jupiter in general. So in this case, we have just
saved I'm not going to execute these you can try it out. But you can execute and run what
we have just done and export this crypto file us an Excel spreadsheet. So you can just click
on here and you can basically download it, you're going to open it and see what has
There we go.
So let me reduce the size of this thing. There we go. So you can see that we have just exported
to spread two sheets, in this case, Bitcoin on ether, right? With the data that we had
in our previous notebook, right. So that's all again, the combination of Jupiter, the
combination of Python and the combination of Jupiter lab, which are tools just work
very well together. So we're gonna keep moving forward, in this video, this tutorial, I'm
talking about more data analysis, in general, we're going to talk about Python, we're going
to do a quick review of Python. Maybe when we when I was running these commands, you
felt you felt a little bit lost what I was doing with it. So we're gonna do a quick review
of Python and all that. And of course, we're gonna get directly deep into data analysis
with pandas with some other tools, I want to tell you something before we finish this
chapter. And it's not, it's very important for you to get familiar with data analysis,
with sorry, with Jupyter notebooks, because you're going to spend a ton of time with it.
And it's a very, very valuable skill that you can get if you get proficient, comfortable
with Jupyter notebooks, you know, like creating cells, deleting cells, cutting, pasting, moving
things around, etc. For you to generate reports Jupyter notebooks are going to be excellent.
So keep an eye on it. Keep practicing, it's the only way to learn it to the to the analysis.
Keep practicing it, keep open the command Polat. So you can always want if you forgot,
how can it caught a cell? Well, there is here it is command x, right? It's gonna just tell
you upfront, keep an eye on it, keep working with it and practicing it. And once you get
familiar with Jupyter notebooks, you're going to move very, very fast. Remember, they have
these nice list of compiled commands and reference you can always access if you need extra help.
And we're going to keep moving forward now with more data analysis.
Now it's time to talk about NumPy, one of the most important libraries in the Python
ecosystem for data processing. In general, it's the one that got pretty much everything
started. And if you trace back NumPy, it, it's a very old developed library. 20 years,
maybe it's it's an extremely popular library and important library, I'm not gonna say popular.
And I'm going to explain why in just a second. But it's a very, very important library in
the Python ecosystem for data processing. NumPy is a library that will lead you it's
a numeric competing library, it's just to process numbers to calculate things with numbers.
And that's it. So NumPy has a very limited scope, we could say, and this is an on purpose,
a very simple library, when you look at it, and when you look at the API, which is very
consistent, by the way, why is NumPy so important? Well, in Python, numeric processing, and just
pure Python processing numbers is very slow. Okay, Python is not slow as itself compared
to other programming languages. But when you go down, right to very deep levels of performance,
when you are processing large amounts of data, right, and you need to squeeze, even, you
know, that tiny bite at the end of your pipeline, you need to squeeze every flow up from your
CPU, then Python is not the right tool for non Python as as a pure python programming
language. NumPy is actually solving that NumPy is a very efficient numeric processing library
that sits on top of Python, and gives you the same API as you're going to work with
with just writing Python code, as you're saying here. But low level, it's going to be using
high performance, numeric computations and, and arrays of numbers and representations,
etc. That's it. That's it for pi NumPy. It's extremely simple from from an API perspective,
but it's extremely powerful. Why did I say that? It's not so popular. But yes, it's so
important. Well, because in reality, we don't usually employ NumPy directly, you will not
see yourself using NumPy. So often, but you will be using other tools in Python, like
for example, pandas, and matplotlib. And they are all working on top of NumPy. They're all
relying on relying on NumPy for their numeric processing. So that's why NumPy is so important.
So the for, at least for this part of the tutorial NumPy. I'm going to divide it into
pieces. The first one is going to be a very detail, low level explanation of how NumPy
works, why we need to use NumPy. And what are the differences between different bite
sizes for numbers, we're going to talk about integers. But this is going to apply for decimals
and data types also. And why you need a very low level, optimize to us number. Now you
can, you can skip this part, you're going to find in the description of this tutorial,
the precise moment in time. So you can just skip and go directly to the second part, which
is when we actually start using NumPy. And I show you how to create arrays, how to make
computations, etc. So for now, we're going to divide it in two parts, we're going to
start first with the low level explanation which you can escape if you want, because
it's not going to be crucial, you can easily use NumPy. Without it. We have found that
for some of our students, it's it's important to understand the low level basics of it,
especially if you didn't have a computer science background, it can help you get you know,
raise your right your level of understanding of computers, and how to make your computations
more efficient. But don't worry if you if you don't want to go through that now it's
fine. You can skip this part and come back later or any other at any other moment. You
don't need the ease to use NumPy seriously, you don't need it. It's going to be beneficial,
but you don't absolutely lead so you can just skip and come later. So with that said, let's
actually go into into a deep
understanding and explanation of how computers store integers, numbers in memory and what
are bytes bits etc. In order to understand why NumPy is so important. We have to go back
again to the basics. What are numbers, how they are represented in computers, etc. As
you might know already a computer can only process ones and zeros bits, it can't process
numbers or just decimal numbers to be more correct, sorry, it only can process ones and
zeros. A computer is just always storing and processing ones and zeros. It's a binary machine.
Your memory is the central place around the random access memory in your computer is the
the central place where your computer is storing the data that it's actively processing, right.
So you have, for example, a hard drive, which stores long term data. But the computer can
process data directly from your hard drive. Before doing that, it has to load it into
your ram into your random access memory again, usually, right a computer is going to have
what eight gigabytes 1632 doesn't matter. Let's say you have eight gigabytes of memory,
that at some point is going to translate to number of bits that your computer can store.
So if you follow, if you follow these we have right here, you can see the total number of
bits available in a regular computer with eight gigabytes of memory. Why is this important?
Because again, the objective of these of these tutorial is the objective of this bar, at
least is to explain how you can squeeze out of every single bit you can in your computer,
right? How can you make it more efficient? For your numeric processing, both in storage?
use less memory for the same data? And also how to make it faster, right for your calculations.
So in terms of physical storage, or actually memory storage, right? How can we make it?
How can we optimize to use the least amount of memory for this given problem? That's the
objective of optimizing it, we need to understand how numbers decimals or sorry, integers into
decimal numeric system are represented in binary, right. So these table right here shows
you the first nine numbers, 01234, etc. and their binary representation. In your computer.
Let's say you want to store the age of user age of a user, which is 32. You can't store
32 in here, because your computer again doesn't know about decimals, it only knows about binary.
To do that, you will need to find the correct representation in ones and zeros of 3030.
All right, sorry, two, which is not this one, to be honest, I'm just making it up as we
go. But again, you need to know the correct binary representation of these number in norther.
To store that data, how can you know that? Well, there is this whole binary arithmetic,
right? There's a whole part of math dedicated to binary doesn't matter for now. But I'm
going to just drive the intuition of it so you can have a better understanding. And if
you're interested, you can dig deeper later. So basically, any decimal number needs to
be stored in a binary format, which of course only steaks ones and zeros. And what we usually
do is just we keep increasing zeros and ones in positions, right. So in this case, we have
the number zero, the number one, that's fine. Once we need to store the number two, winning
now to increase the number, the position right here we need to increase, right, so we need
to go from two to one zero, we'd go to the number three, it's one one, and then we need
to go to number four, we need to increase positions again, because we only have two
symbols, zero and one. So as you're seeing right here, up to these level, we need only
one position. Up to this level, we need two positions. This level, we need three positions.
And these levels going to need four positions. And you'll see how the size of each of these
is increasing. And it has a
an explanation behind it that we're going to see in a second. So the question is how
many decimal numbers you can store with n bytes and bits, sorry, bits. So let's say
we have n bits. And let's say n is equals to three. That means that you only have three
positions, right three bits, how many total decimal numbers, you can store with it? Well
we can store 000 we can store zero, we can store 100 we can start stores are you one
zero, right? So in this size, we can store up to here, we can store up to seven numbers
111 is equals to seven was, once we've filled all the positions, right, we've reached the
limit, right? The largest number, the largest binary for this amount of symbols or positions.
That's the number seven. So these means that with three numbers, you can start from zero
from zero, here, zero up to one, one. In total, you can store eight decimal numbers, here
you have eight decimal numbers 012345678, total decimal numbers from zero to seven.
The
equation if you want behind this is as follows. If you have n equals three, and it's, in order
to know how many decimal numbers you can store with those bits, it's two to the power of
n, in this case, is total a bit. So if we go back into our drawings, we said that with
three bits, we can store up to eight decimal numbers. And again, the equation is two to
the power of n is going to give you how many decimal numbers you need. You can always do
the opposite process using logarithm and get how many bits you're going to need to create
to store a given decimal number. I'm, I'm not going to get into that. So we don't complicate
it. But again, the math behind it is extremely simple. So now, moving forward, we're going
to delete this whole thing. Moving forward. Why is this important? When you're working
with your data, when you're doing your data analysis, you know what, what data you're
what type of data, you're working with their own numbers, but numbers only usually have
a connotation behind, right? So let's say that you have here it's a table of people,
and you have the total net worth of the person. And also you have the age of the person. The
age is a value that will range between what zero, right? Just born
to,
I don't know, 120, we can say I don't know, what's the maximum age registered right now,
the oldest human being but zero to 120, it seems, seems reasonable. In your other column
net worth for this person, the range is it's completely difference. We can go from something
like $0 up to, I don't know $60 billion, I think Mark Zuckerberg or Jeff Bezos or one
of those. So we go from zero to 62 billions in this case, if there are dollars, what happened
if this is a highly devaluated currency? Would we have to go to trillions, right? So these
two even though they're just plain numbers, and we can say they are integers, even though
these are pulling numbers, they have an integers, they have a different connotation, and they
will need different requirements in terms of storage size, right? So if we say that
nh goes from zero to 120, we don't need so many. So many bits to store it in memory,
right? We can do the math, actually, how many bits Do we need in order to store 120 100?
And what do we say 120. Right? Well, if you do the math, you will see that two to the
power two to the power of seven is 128. So if you have if you have seven bits here, seven
bits, you're going to store from zero, up to 1111111, which is actually 127. Okay, these
number, all ones, seven ones in binary is equals to 127. in decimal, in total, we can
store 128 numbers 00 matters, up to 127. So that means that for our column right to column,
age, here age, we only we can use the size of the men We need to use is going to be seven
bits per user, or costumer or person, whatever. What about these number right here, if we
have to go up to a couple billions? Well, in that case, the numbers a little bit more
complicated, we're going to need, for example, we can say 64, or 3232. It's actually 64,
probably, but with 32 bits, right, you can store up to from zero up to these volume.
So again, I don't know about the currency we're using or anything, so we can assume.
But here, we need 32
bits in order to store that. And now you can do the math, how many how much memory space
you need, in order to process this data? How many records Do you have, if you have only
1000 Records, that's not significant. You can use whatever, you can use 64 bits here
to store the age, and you're not going to have a problem. But what happens if you have
more what happens? What happens if you have the entire population of the earth, you have
7 billion records here 7 billion records, then every bit that you're saving in these
columns is going to be important? Because he's going to take a ton of data. And of course,
you have a ton more columns, right? What happens if you are processing trillions of records
from financial transactions, right, you want to be very, you want to be very efficient
and optimize every single bit, you can't. And that means again, selecting the correct
number of have a bit per the columns you're currently processing. So so far, so good,
again, when there's 10, that the the number in decimal that we need to store has a correspondence
with emits, right? eight bits is one byte. And the more we can optimize that, the less
memory we're going to use for our obligations. Where does NumPy come in place? Why are we
talking about data in these NumPy lessons? Well, they're right. The idea is that NumPy
is a library that will lead you has a very advanced numeric processing, in order to let
you select the number of bits you want to take for an integer. Even more, let's say
you for forget about NumPy, you want to process this thing with pure Python. So you x equals
five, for example, working with Python, you create a number, we're storing age as a five,
how many bytes? How many bits? Do you think the simple variable takes in memory? How many?
Well, in reality, even though we think it should be around, what, three, three bits,
eight, let's say to be simple, too simplistic. In reality, for Python, this is going to take
around 20 bytes. Okay, so we are wasting a ton of memory in order to store this number.
And why is that? Well, because Python is a high level, object oriented programming language.
The reasoning behind it is that Python is simple to write, write simple to also read
and, and, and code on top of it. But again, in order to create that simplicity, in its
syrup, all the numbers in objects, which have all these attributes, that if you know, advanced
Python, you're going to recognize that are not necessary. So these is taking a ton of
memory. And a regular, very simple number in Python ends up consuming 100 times more
memory than what it should be consumed. And this one NumPy comes in place in NumPy, you
can create numbers that are for example, you can control the size, in terms of bits, you
can say I want to create a number that has only eight bits. And that's it, that you're
going to create a one byte integer, and you're very precise and how much memory it takes,
you can create a number that it's actually need a little bit more space, we're going
to do NP int, and we can hear us a talkie, you're going to get auto completion 6016 bit
or eight or 32 or 64, right. So we can actually be a lot more precise in the number of bits
that we need. And this is extremely important for again, our high level processing. On top
of that, NumPy is our array processing library at NumPy is 99%, about processing a race constantly
processing erase the data structures we have in Python, the built in data structures we
have in Python, for example, the list dictionary, they are not optimized for high level computing.
So if you have a list of numbers in Python, let's say you have, I don't
know, l equals 3224, right, you have three numbers in your list. In Python there, it's
not guaranteed that the least they'll the list is gonna contain all the numbers, three
to four in contiguous positions is gonna, it might put them in separate positions in
memory. On top of that, you can't rely on advanced CPU directives and instructions for
processing matrix matrices, sorry, because Python, again, is wrapping these things in
objects. So there is no access to these high performance, low level instructions with NumPy,
that changes because when you create an array NumPy, you say, I want to create an array
of three numbers, and they are all into eight, then imposition forget about this is not these
are not bytes I am, I'm using these drawing as a general representation of memory. So
in that case, in NumPy, when you create these three element, int, eight array, it's going
to create those three elements in contiguous positions in memory, three to four, and they
are only going to take that amount of memory the police said they were going to take and
on top of that, we can rely on a bunch of very efficient low level instructions from
your CPU for matrix matrix calculation, this is something that it's a little bit more advanced.
And it's something that has exploded in the past 10 years CPUs with more with richer instruction
sets, and the same thing for GPUs, you might have heard, especially with machine learning
and all that we need, we need fast array processing, when we are storing features and weights and
all that's a topic for for different story. But again, the idea is we need right a ton
of week, sorry, we can use all these important and very efficient, low level directives from
our CPU, which makes our computations a lot faster. So again, as a recap, you don't need
to know all these to work with NumPy. That's the first thing. Second, you don't need to
get extremely, extremely conscious about all the numbers you use. At the beginning, you're
just going to use NumPy as it is, and you're going to use just the default types that it
picks in 38 cents or in 32. In 64, that's fine.
But then, with when you get into bottlenecks, when you're working with with larger amount
of with more amount of data, then you might need to get into the details of that size
of the integers that you're using. And this all applies to float. So I'm just using integers
because it's simpler. But it's all applies to floats. So again, NumPy, the main advantage
is that it's it has built in very fast and I raised kit, take advantage of CPU instructions
for matrices and arrays and all that. And it also has a very efficient representations
of numbers, right are not the regular objects of Python. Again, recap, you don't need a
list. If you want to get into more details, I recommend you to get a little bit more understanding
about binary arithmetic, and how numbers are uncomputable architecture, how numbers are
stored in memory, etc, especially for floats and all that's a completely different representations.
So with that said, we're going to see now how we actually use NumPy without worrying
so much about the low level details. And that's the beauty of NumPy. So we have already done
our low level explanation of binary arithmetic, why unknown vice important and all that if
you skipped it, that's perfectly fine, you will not need it. The reasoning was to include
was that if you're in this tutorial, you're probably looking for fast and efficient options
to process large volumes of data. And that's when all those things come into play. So let's
without further ado, let's just get started and start using NumPy as a library. So again,
as I told you, a NumPy is a very simple library for array, processing and numeric powers.
To sing, it has a few objects, numbers, floats, integer floats, arrays, and that's it. And
it's very simple, but it's extremely powerful. So, in NumPy, we're going to create these
arrays, which look a lot like Python lists, but there are going to be significant differences.
The first one is, of course, performance. If you go to the previous part, when we were
discussing the binary representation of an array of numbers, in Python and NumPy, you're
going to see the difference between them. So in this case, we're creating two arrays.
And you will see right that the creation is extremely simple. The only thing that changes
we need to add this NP dot array, and then we're passing in this case, a list of numbers.
This is something we will usually be reading from external sources. Now, how can you access
individual elements of a NumPy array is this works in the same way as with a Python list,
you can say give me the first element, give me the second element. And it's zero index,
like, again, in a Python list. Slicing works the same way. So in this case, up a zero to
something, a one up to three rights, just getting low level, right, on high level of
the index, negative indexing, and steps, they all work in the same way as with a Python
list. So if you know how to use a Python list, you will know how to use a NumPy array. There
is one new thing right here so differently from a Python list. And it's what it's called
multi indexing. Let's say you have a, an array this case B, and you need to extract three
elements out of it, you need the element of the first position, third position and last
position, you can just type B of zero, B, A to B at minus one, or, and this works, this
also works for a list. Or you can use again, mod the indexing, which is from B, I want
to select the elements in zero to n minus one first element, third element on last element,
right, so you pass an int, another list containing the indices of the elements that you want
to select. And in this case, the important part is the result. It's another NumPy array,
it's not just individual elements, you're creating another NumPy array, which again,
if you're processing, it's gonna be a lot faster. So arrays have types associated. And
this is related to what we were speaking before. As a NumPy array is a continuous is continuously
assigning memory, the NumPy library needs to know what's the type of the object you're
storing, you can't just or you know, anything, a string a number within it, because it will
not be able to
provide performance and optimizations for arrays or non consistence insights. So for
example, when we create these arrays only had injures by default NumPy selected in 64,
is because of the platform, it's a 64 bit platform, you can tune this, and you can select
us, we're going to see other sizes in a second, when we created the array B that contain decimals
or floats, it assign a different type, which is float 64. Again, the default type is always
six, at least in this platform that is 64 bits, it's going to be float 64 and integer
64. You can always change that you can say Actually, I want these, even though these
are all integers, I want you
to
create them using a float type, or as we saw in our previous video, we can say it should
be actually type integer x. So smaller integers, for performance, for performance for better
performance. Alright. So moving forward, we were also going to see a few other types like
for example, strings on the regular objects. But as you're going to see this, there is
no point of storing these things in NumPy NumPy, stores numbers date Booleans, but not
a regular individual objects as we're seeing right here. There is a way to store strings,
it's perfectly valid and it has its own time. Its own type sorry, and it's related to the
Unicode representation memory etc. But again, NumPy is usually used for numeric processing.
So the idea of NumPy arrays is we can create multi dimensional arrays we can create the
what we had created before. This is a one dimensional array right? Just one dimension,
you can create matrices, which in this case are two dimensional, we have two rows and
three columns. And NumPy has a ton of attributes and functions to work with multi dimensional
arrays. So the first thing we're going to see is the shape of an array, which is two
rows by three columns, how many dimensions it has, it has one vertical and one horizontal,
we have two dimensions. And what's the total size of the array in this case, the total
size is six, the total number of elements we have, let's go one dimension. Further,
let's create a three dimensional object, a three dimensional array, which is basically
a cube. In this case, for B, we have that the shape is two by two by three, the number
of dimensions is three, and the size is a total count of elements. 12, you always have
to be careful when you're creating these multi multi dimensional arrays. If the dimension
dimensions don't much, like in this case, right here, where we have this second list
that only has one less than bits in it, then the dimensions will not match. And it will
just tape you they'll use sorry, that the array is of type objects. And the shape is
only two only has two elements, these one element, and there's another element. So in
this case, we've done we've done it wrong, basically. And you have to be careful when
creating these these objects by hand. So how can you index and slice matrices? We've done
it for a one dimensional array. So we were selecting elements, individual elements, give
me the first element, give me the second element cetera? How can we do it with a matrix with
a matrix, what we're going to do is going to be very similar to what we did before.
The difference is that now we have to account for multiple dimensions when I do give me
a at one, is it the column add one, or is it the row at one? Well, as you can see, it's
the row. So this is going to be right here. 012. Right. And there is also another dimension,
right? So this is 012. In terms of index, index positions for our slicing. So here,
how can you get the first element, the first element of this second? rope. In that case,
you're going to first select the first row, the sorry, the second row, and then select
the first element. And that's what you get number four. But there is a better way, which
is by using the multi dimensional selection of NumPy. In this case, you're going to say
from this matrix, I want to select and here you're going to pass a in this case, you're
going to pass dimension one dimension to dimension three, dimension four, etc, right. And these
are selectors for each one of those dimensions that you're passing. In this case, we say,
for a row level one, the element, the position one second element, and for a column level,
we want the first element in it. And it's the same thing as we did before. The advantage
of this index and keeping in mind and remaining it is that it will also let you add slicing,
right, so you'd say I want to select every thing from dimension one, which is rows. So
in this case, you say from zero up to two is these two ones, the two is not included
upper limit the same as as Python. And then
you can also pass other other dimensions, you say, I want to select every row, that's
fine. But then I want to select from column level, I only want to select the elements
up to two. So these two and these two, and the two, right, so 124578. These all works
as intuitive as it gets. Remember this syntax is the important that you need to keep in
mind. Moving forward for modification, you can say I want to assign these new array to
this entire row, right? So if the dimensions match, that is going to work now 10 is equals
it's added to the second row, or you can just use what we call usually an expand operation.
We're just going to say for row number two, I want to assign the number 99 and NumPy is
going to take care of expanding it into this corresponding array, given the number of dimensions
that you have So so far that selection, it's simple, we're going to see also is that NumPy
has a huge advantage of containing a ton of operations you can perform on top of your
arrays and matrices, your multi dimensional arrays in general. So the first one is the
all the summers basic methods we have. So given an array, all these methods are already
built in the sum, the mean average, right, standard deviation, variance, etc. And that
also works for matrices. So in this case, we can get the sum the mean standard deviation,
or we can do it per axis. So this is very useful, we can get the, the here, let's compare
these two, there we go, we can get the some of these, what is this first column, the second
column or the third column, we can get it the first row, second row and the third row.
So it's either this dimension, this dimension one, or it's a vertical dimension, which is
x equals one, right? So per row per column. Or, if you have more dimensions, you can just
keep increasing the number of this answers. And that's just going to work as expected.
Broadcasting vectorized operations, this is a fundamental topic that we're going to talk
about. And it's going to be extremely related to Boolean arrays. And these are a few new
things that you have to keep in mind with working with NumPy. And now we're going to
talk about vectorized operations and broadcasting, which can be a counterintuitive topic at the
beginning, but then you're going to understand how much sense it makes. It's one of the fundamental
pieces of NumPy. We've seen how NumPy works in a very general way we saw the multi dimensional
arrays and all those advantages. But you might be thinking, I mean, I don't need another
library just to complete the summer domain. When I show you the vectorized operations
and broadcasting part, this is going to make a little bit more sense of why NumPy is so
important. So to get started, we're going to have these array, which is a right, that's
just very simple array vectorize vectorized operations are operations performed between
both arrays and arrays and arrays and scalars, like in this case right here, which are extremely
fast, they're optimized to be extremely fast. In this case, what we're going to do is we're
going to sum the entire array plus 10. And what it means we're going to see an example
of what happens without with Python.
But what it means is that let me show you the results, that each one of the elements
within the array will be applied the same operation. So usually, that's the concept
of vectorizing an operation you have the number and then this operation is applied to each
one of the elements in here are actually in these other one, right, so here and here and
here. And here to result in these new array, the operation is expressed at an array level,
right, we say a plus 10. That's it. But then again, internally, this is broadcast said
to each one of the individual elements within the array. And this gives me how a plus 10?
Well, a times 10, for example, which also in this case is we're playing the times 10
operations to each one of the elements in the array, resulting in a new array with the
result of that operation. And these resulting in a new array is very important, because
as we're going to see, NumPy is an immutable first library, it will not any operation,
you performing an array will not modify it, but it will return a new array, if we check
the status of a, you're going to see that the elements are the same, it has never changed,
we are creating a new array and returning it. There are ways to override these behavior
if you want. And this they all these operations were performing these way always have the
interface of plus equals minus equals times equals etc, which will indeed modify their
rights. In this case, we're making a broadcasting operation, adding 100 to each one of the elements
in this array. And now this operation was immutable. A was modified and did it hasn't
returned a new operation. If you remember from your pure Python skills write the correspondence
of vectorized operations are list comprehensions, in which you're expressing an operation for
each one of the elements in your collection. Right. So that's a list comprehension. It's
a it's pretty similar to what we're doing with NumPy. The main difference is that this
is all optimized and extreme. It's extremely fast. So, the operations are these vectorized
operations are reduced broadcasting doesn't need to be only between arrays and scalars
can only be between arrays and arrays. So in this case, we have a and we have B and
showing you right here. And we can do something like a plus b. And what you're saying is that
if there is a correspondence, right, so zero plus 10, one plus 10, two plus 10, right?
Let me, let me do it in this way. 110 210 and 310. There we go. And that's the result
that we get right here. So these for these to work, you of course, need the arrays to
be online and to have the same shape.
But when that does work, then the operation is extremely fast in memory. And it's aligned,
it's a vectorized operations with seen so far. Why is this topic of vectorize operations
so important? Well, because of the following, which is bull in a race. And this is a very,
very, very important thing. If you don't completely get it now, I asked you please, to go and
check the exercises we have for this lesson, because we're gonna use it a ton. And we're
gonna, we're gonna see that in pan, this, the same syntax, the same primitives of Boolean
arrays, a play apply, and we're going to use the same things. So why are Boolean arrays
similar to vectorize? operations? Well, all these operations we've had performed here
are just arithmetic operations, mathematical operations, plus something times something,
etc. If you look at the operators that you have in your programming language, it's it's
not only mathematical operators, like plus or minus, or times, you also have Boolean
operators. And the question now is going to be what happens when you apply Boolean operations,
when you apply Boolean operators to it. So given our right, we had, what ways we had
to select different numbers. For example, in this case, we need the first and last element,
we do zero and minus one. That's, that's the way we saw with NumPy. We also saw the traditional
Python one, right, so we can say a zero, and also want to get a minus one. So this is the
first, the first way of selecting these elements, we know there's a second way with multi index
selection. And there is a third way and this is new, which is with Boolean arrays right
here. So in this case, we're gonna say I want to select the elements in this order, right?
And you're gonna pass either true or false if you want to actually select the element
or not, right, so if you have four elements, you have to pass four Boolean values, saying,
I want to select this element, I don't want to select these ones. I mean, I don't want
to like this element. And I do want to select this element right here. So I want the first
one, and the last one, and the result will be the same 030303. So so far, it's it's nothing
terribly new, right? So this is new, but it's not extremely complicated. We are showing
you a brand new way of selecting data, you can select regular Python multi index, or
a Boolean array. Now, you might be thinking, well, I manually write true false false, true,
true false, for I don't know how many records you have a million records, this is not scalable,
right, you will not say to write all the strong forces. But this is actually very important,
because these arrays are the ones that are the result of broadcasting Boolean operations.
So we saw again, regular arithmetic operation like this, but we also have it for Boolean
operations. So we what happens if we ask a greater than or equals to the number two,
right, and array A is this right here is 0123, then the result is false for zero, false for
one, because they are not greater or equal to do true for number two, of course, and
two untrue for number three. So all the individual elements that match this condition will have
true and false. In other cases, this is the power of Boolean arrays, we will be able now
to combine these operations. So now we can do a greater than or equals to two, right
that a equals A being greater than or equals to the number two. The
advantage of this is just filtering, we're filtering No, no numeric arrays very quickly
with a very familiar syntax a greater than equals to and we just provide that as the
index of the operation. It's pretty much What is happening right here? We're saying use
these Boolean array. It's a willing list, right? is a Python list with Boolean, to filter
or sorry to select elements based on that. But the question is, how do we construct that
list of Boolean? Well, in this case, we have constructed it by including a predicate by
including a condition that needs to be matched. The result, again, is filtering. It's a query
method, you're looking, looking up some data, you're saying, Give me all the elements that
match this condition. So you can say, for example, these values can be of course calculated,
you can say, give me all the elements that are greater than the mean. Or you can actually
provide other Boolean appraiser operators like for example, all the elements that are
not greater than the mean. So that means they're less or equals and the mean, or you can also
include all their Boolean operators like or, or, and so or n and in NumPy, are expressed
with a pipe or an ampersand ampersand. Because we can't use just the regular or and then
in Python, we can, but it's a good choice, they've selected this. So again, this is the
concept of Boolean arrays, we are going to construct these arrays that artist Boolean
representations or Booleans, based on conditions, right, so we have this matrix, and we're gonna
say I want to select these one, and these one end is one, etc. So in that case, this
is the result right here. This is the result of that. And we can generate a dynamic Boolean
array, we never manually type all these right, we don't sit and say true, false false through
etc. We just Run Query filtering option, a Boolean operation, which results in a Boolean
array. And now we can use it as filtering. So again, the idea here is that the operations
we saw in broadcasting before, a timestamp are also defined for Boolean operators. Boolean
operators return Boolean, a race, which can be used in filtering, that's the idea of all
of it. And you can even combine these operations, you can say, A equals zero, or a equals one,
a less or equal to two. And it's also divisible by zero, you can combine all these queries.
So now it looks a lot more powerful than when we were doing before. So moving forward, talking
about linear algebra very quickly. And this is we're approaching the end of the NumPy
lesson. The part the important part of of linear algebra is that NumPy already contains
all the most important operations for it already optimized with low level semantics, it's going
to be extremely fast, adult product cross products, and all that transposing majors
is all that works as expected. And again, these might be very important, specially,
for example, machine learning, etc. It's it's extremely important. And finally, to wrap
up what we saw in our, in our binary explanation at the beginning, what you might have escaped
is the difference in sizes between NumPy and Python, the differences in terms of performance
between them. So in Python, a regular number, this is just a regular engine in Python, that
total size is 28 bytes in order and just let this thing for a second. The total number
of bytes, not bits bytes that you need in Python to store a simple number, as the number
one is 28 unit 28 bytes to store just the number one is extremely,
super space consuming, right? It's not very efficient, larger numbers will even take more
bytes to store them. What's the size of the integers? Well, we've seen it we have, for
example, we can create integers with eight bytes. We can create integers with one byte
right which were something like here we have np.int eight will already know how many bytes
has only one byte, right, but you can have control of how many bytes or bits write your
numbers will take. And you can see here the difference between the size of an integer
in Python which is extremely large 28 byte on NumPy and also the difference in performance.
Let's say for example, we want here you have the ultimate difference in size of lists,
which is also significant. But I want to focus on performance, we have two elements two,
we have one list that has the first 1000 numbers, I will have a NumPy array that has the first
1000 numbers, we're going to perform the same operation in both of them. Let's use the Python
one. First, we're going to do the Python one first. In this case, we're, we're squaring
all the elements in the list, okay, the elements A squared, and then we're summing all the
operations might so we express it at saying, create a new list, x times x, sorry, squared,
4x, nl, and then some everything, how much time it takes 321 microseconds, we're gonna
do the same thing with NumPy, we're gonna say NP dot sum, a square. And you're gonna
see that it's a lot faster in the NumPy perspective, then the Python perspective. And these are
all very, very tiny, tiny operations with small numbers. What happens if we add more
numbers, let's add two more numbers here. That's odd. Two more numbers here. And we're
going to do the same two operations. So as you see here, that that the units have even
changed, we're still in the microsecond layer here with NumPy, we've gone to the millisecond
layer in Python. So as the size of your objects increase, NumPy will prove to be extremely
fast compared to Python. So there are a few other functions you can see here, for example,
extracting normal, random numbers, etc. I'm going to live let these for you to look, if
you're interested in them, I remember you have the exercises, which can help you solidify
all the concepts we discussed. And we're going to move forward now to work with pandas, we're
going to see also visualizations are gonna keep moving forward this data analysis with
Python tutorial.
Now, it's finally time to talk about pandas is the most important library that we use
for data analysis in our day to day basis with Python. It's a library that will aid
in the entire process of your data analysis project, you're going to start getting the
data, step one, getting the data from multiple sources, like databases, Excel files, CSV,
files, etc. That's all gonna get into pandas, you're going to be processing the data, right?
So you're going to be combining merging, doing different types of analysis, you're going
to be visualizing the data, right, so a bar chart, you're going to be visualizing the
data with pandas, and you're going to be creating reports, you're going to be also doing simple
statistical analysis, you're going to be doing machine learning close to it, with the help
of other libraries, but everything from the platform that provides the pandas library,
it's, again, one of the most important libraries in in in the data analysis data science ecosystem
with Python. pandas has recently released the version 1.0. So we are talking about a
very mature library. It's been around for a long time now. And again, it's the primary
library that we use in Python for data analysis and data science. So I'm going to do a quick
introduction to the data structures of pandas house, and we're gonna understand how they
work. So you can start building right the phone, we're gonna start building the foundations,
I need you to be very familiar with the way the data structures from pandas are processed.
And then we're going to move into other things like reading files, grouping data, etc. So
to get things started, we're going to talk about the first data structure to pandas house,
which is this series. In reality, pandas has two main data structures that it uses all
the time, and it's the series under the data frame. The data frame is the one you will
probably be more familiar with. It looks just like an Excel table. But we're gonna start
first with a series. Okay, so just stay with me here. We're going to talk about a series
for a second. In this case, we have important pandas, and we have also imported NumPy. As,
as you might imagine, as I told you before, in the NumPy part of this tutorial, we're
saying NumPy is fundamental for data analysis because every other library pandas, matplotlib,
they all sit on top of NumPy and you can see it right here. We're gonna be using some features
from NumPy within this lesson, too. So these is a series in pandas, what you see right
here, it's The concept of a series is this ordered sequence of elements right? Or indexed
right with they are all indexed by a given index, of course. And you might think that
this looks a lot like a Python list, right? So in this case, we're storing the population
of countries, right in millions of inhabitants. In this case, it's jevelin. g7. pub is because
we're getting the population of the Group of Seven, you can console the Wikipedia page.
But basically, we are storing population in here in this series. And again, it looks a
lot like a list, but we're gonna find a ton of differences in here. So the first one is
that the the series has an associated data type. And this is something we saw in NumPy,
when a NumPy array couldn't hold different types of objects, we were all we were only
having one type of object. In this case, it's float 64. So all the numbers of the series
will be of type float 64, the underlying data structure, the 10, this is using to store
these objects is a NumPy array. So a second difference we see very quickly is that zeros
can have a name, right. So now when we display the series, we see that it has a name. Now
it might not make a ton of sense. But once this series is part of a data frame in the
form of a column, then the name is going to make a lot more sense. So moving forward,
again, we saw that A has a type. And again, this is because the backed the data is backed
by a NumPy array that you can always consult, you can check values of a series. And you're
going to get the array that it's backing up that pandas series, right, so you can see
that it's a NumPy array.
Once you have these series, we were just consulting here, design pop, you can in you can select
elements as you good in a regular list, right? So for example, give me the first element,
give me the second element, the last element, etc. And that's because a series inherently
has an index, similar to list a list when you create a list in Python, right? So if
I create L equals a, b, and see, but there is something wrong here missing, quote, this
list, we don't say it right. But the idea is that there is an index here, zero, this
is one, and this is two, right? In the pendous series, this is a lot more explicit, each
element has an associated value within it. And you might think that is pretty much the
same thing. They're all they're both the list on the series, there are both sequences, they're
ordered sequences of elements. But we're going to see that there is a fundamental difference,
and is that we can arbitrarily change the index of a series. So by default, when we
created it, we didn't assign any indices. So by default, it was a range index from zero
up to n minus one elements. But you can actually arbitrarily again, say, what is the index
of your series. And in this case, these data structure these series has now these indices
that we're seeing right here. Why is this important? Because now we're going to be referring
to these values, not by a sequential position, but by a name, but by a label by the index,
which has a meaningful name for us humans. Okay. So now, these thing looks a little bit
more like a dictionary we could say, than a list, we started thinking that a series
was similar to list but now, we can think that a series is limit similar to a dictionary.
But wait, don't get me wrong here. The series has a fundamental trait, and it's that it's
still ordered something that didn't happen with. With dictionaries, dictionaries in Python,
are not ordered, actually, in python 3.7. They're ordered, but we shouldn't be thinking
that they are ordered their unordered data structures. In this case, a series is in the
order. So it has both those advantages. It's ordered candidates always before friends,
that's as we decided to create it, but also it has names or labels or keys associated
with the values as a dictionary. So this is creating the series from scratch. Right? All
these methods, you can see you can create a series bypassing the index it doesn't have
To be a two step process where you first created the series, and then add the index, in this
case, you can do everything at once. And the indexing is now going to be done by those
indices, right. So those labels that make up the index will be used to index specific
data. So g7 pop, we see has these countries with these population. And now, before the
index, we were saying, I want to get what's the population of Canada, and then we had
to remember, what was the position of Canada, oh, it's the first help countries, we have
to do g7, pop zero. With the index, now we can just consult what's the population of
Canada, what's the population of Japan. And as you can see, the syntax is the same as
with a Python dictionary, it's just pretty much same, you pass the key and is going to
get the value. So again, summary, the advantage of a series is it's it's a ordered sequence
of elements, backed by a NumPy array, very efficient very fast. But it also has
an index that can take any labels we pass, so it's going to make it a lot better for
indexing, you can steal when you have a series, you can still get the elements by the sequential
ordering. After all, it's a sequential data structure, and doesn't matter if you have
in an index, you can still say, Hey, I know we have on the index. But if you want to get
the last element, or the first element or the second element, you're going to do that
by using the attributes, I look at it and say to this series from this series, I'm going
to ilok locate by sequential position, these element the element in position zero or the
last element. And that still works as expected series also support multiple indices as we
saw with NumPy. So in this case, we can get two elements out of two, three n elements,
you can pass multiple indices. And the same thing happens with more with sequential multi
index series also support range or selection or slices. But there is a fundamental difference
here, this is very important here attention, there's a fundamental difference with Python,
and it's not in Python, the upper limit of a slice is not returned. So from our list
that we created before, if I do l, up to number two, I don't get the index See, right, so
this is zero. This is one, this is two, two is not included in our pandas series, the
upper limit is indeed included. So if when you asked from Canada up to Italy, Italy is
in the result. Okay, so this is something to consider when using index selection in
pandas, I think this is still valid, it's very, I understand the reasoning behind it's
just different from Python. So, you should remember, Boolean arrays, which was a topic
we discussed in our previous lesson of NumPy. Boolean arrays is still a thing in pandas,
the difference is we instead of saying Boolean arrays, we should say Boolean series right,
the idea is that we will be able to perform operations on top of series. So for example,
right here we have mathematical operations on top of series in this case, we have the
zero D seven pop, which as I told you the beginning is in millions of inhabitants. If
we want to get the series of interest units, we will need to do Jessamine pop times 1 million
and there we go now is in terms of units these operations right these vectorized operations
the bras these broadcasting operations can also be performed with Boolean operands. So
instead of a multiplication, a summation and subtraction, etc. We can add we can use a
Boolean operators. So in this case, we get asked
what
are the countries that have more than 770 million inhabitants we will receive receive
their assault is a bull in aerates, Nebraska, right? Well, let's hear it you know, but basically,
it's the same concept of with us with a NumPy Boolean array. Canada, friends, they do not
have more than 70 million inhabitants in Germany does have seven more than 70 million inhabitants
here. 80 on the same for Japan, so Japan Here is the same on the same for the US, the US
also have past more than 70 million inhabitants. So again, the Boolean array or Boolean series
in this case, works in the same way, as with NumPy. And selection also applies. So I can
now select, I can say, give me from these series g7 pop, all the countries that have
more than 70 million inhabitants, the value is more than 70. So now, again, we are building
filtering, we're building a query language if you want on top of pandas, we're selecting
data based on this condition. Remember, when if you ever have trouble remember all these,
the idea is that you can always track down the way this index is being built. In this
case, we are it's not that the selection knows anything, these first election knows anything
about how to select countries with more than 70 these operation was performed first, which
resulted in these series. And now this series will be indexed by these array, this Boolean
array. And the result is as you can see it, and again, these operations can be run with
calculator methods, and all the operators we saw in our previous lesson, which was not,
which was or this irregular pipe, or, and amberson, which is the and all these can be
applied in any order you want. So if we read this thing, which is complicated in purpose,
it's worth saying give me all the elements that are above the mean, minus two standard
deviations or below the mean, actually, above the mean, and here was below the mean, or
if this isn't correct, but it doesn't matter. It's just an OR operation between two ends
of the it's actually, it's above the mean, minus the standard deviation. So we are applying
this operation or right, that operation we have before so they're not the or, and the
and they all work with Boolean selection as well. The operations we saw from a mathematical
perspective mean in in statistical operations, we saw a NumPy. Some mean, average standard
deviation, we're actually using standard deviation before, they're all still relevant in this
case, but also you can use traditional NumPy functions with our pandas series, because
again, a panda's series is internally backed by a NumPy array.
So this is all the same, as you can see, here is an example that it's a little bit more
clear, we're getting all the countries that have more than 80 million inhabitants, and
all the countries have less than 200 million inhabitants. So it has to be above 80. But
it also has to be below 200. Okay, or in this case, we say either above 80, or below 40,
or below 40. Right. So that's with the OR operator or the NOT operator. Modifying series
is relatively simple. Whenever you have a value, you can just assign it all together.
In this case, we're saying Canada is now 40.5. I don't know why we just wanted to do it.
This is by index, you can also do it by sequential positions. So in this case, we're going to
say the last country should have 500 now. So we're going to see a right here, the last
country has 500 now, or you can also modify elements based now bool and selection. So
you can say all the countries that have less than 70 million inhabitants, all these from
our previous query, all these will now be 99.9. So as you can see, it has changed all
these countries. So this the assignment works by direct indexing, or also works by Boolean
indexing. And this is going to be extremely important when we are cleaning data. So let's
move forward and start talking about data frames now before you have exercises in for
series, and also for data frames, so I recommend you to check them out. So talking about data
frames, this is what a data frame is going to look like. It's pretty much the same thing.
us an Excel table. So this was our series and this is going to be our data frame. It's
a table. So it looks a lot like an Excel spreadsheet. And actually, it's very common to create pandas
data frames out of CSV files, which are tables basically. And I'm going to create it we created
with these data frame object I created. There you go, these are data frame. And as you can
see, right, it has columns that we have assigned. In this case, we were designing the columns,
and we have rows of values right below each one of these columns. Why? What's the similarity
with with series, and it's not a data frame column will be basically a series. So we can
think a data frame is a combination of multiple series one per column, we're going to assign
an index to the data frame the same way that we did with our series. So in this case, this
is our data frame. Sorry, right here. This is our data frame that has the index, right?
And it has the columns as we had before, what columns Do we have, what's the index of the
data frame, these are all attributes that you can consult, there are a couple of very
interesting methods from data frames that we use all the time. The first one is the
info method. That's going to give you quick quick information about the structure of your
data frame. Right. So it's going to tell you what columns you have population GDP surface
area, HDI continent. And it's also going to tell you the types and how many no values
you have, it's actually telling you how many non null values you have. But we use these
when we're cleaning data to quickly then define those columns and have missing values, we
can check for the size of the data frame, we can check for the shape. And this is similar
to a matrix right, a two dimensional array in NumPy is pretty much a data frame. And
also similar to info the voice again, to check a summary of the structure of the data frame,
we can also use this cribe, which is going to give you a summary of the statistics of
the data frame. And in this case, what we see is that for each numeric column, only
those columns are numeric continent is not here, for example, this is continent so you
can see the type is object is a string, basically, all the numeric columns, we're going to have
summary statistics for them. So for example, for population, how many elements we have,
what's the mean, right? What's the average Romney, what's the standard deviation, the
minimum, the maximum, and in between a couple of percentiles 25th 50th and 75th percentiles.
So this is quick summary statistics. And we do this a lot. So keep in mind, his crime
method is very popular.
As you could see, in the in the info method, the columns have associated types, okay, so
this is very important. They continent is an object that means that it's basically a
string HDI is a float and surface area is an integer. And that's because NumPy, pandas
is automatically with through NumPy, is automatically recognizing the correct type to assign to
each one of the columns. This is similar to what we saw with a series in which the series
contain natural datatype, a series was part of a given data type. So that's something
you cannot change. And in this case, checking value counts, you can have a quick reference
of the types of your series. So moving forward, how will we we will be selecting data from
series Well, there are a couple of methods. And this might be a little bit confusing.
So what I'm going to do is I'm going to skip and just going to give you a quick reference
first, and then you can read if you want through the process we follow here, given a data frame,
and this is just two quick rules, given a data frame, you're going to select by index
using the lock attributes. So the lock attribute is will let you select individual rows. So
for example, when I get Canada and that's the value of Canada, when I lock attribute
will let you select similar to the series, the row by sequential position. So let's say
We want to select the last row. In this case, it's the United States of America. So again,
look lets you select a select rows by route by index, give me the row under this index,
I log will let you select rows by sequential position, give me the last row, the first
row, the second row, etc. And finally, without using lock without using a lock, just by saying
the f up something, you are selecting that column, give me the entire give me a V and
tire column population right here, the entire column population. So what you're seeing here,
first, first of all, this is a quick reference dot dot Lok will give you an element by index,
I look we'll give you an element by position, I wrote by position and just doing df, on
some things gonna give you the element, the column sorry that you are passing. So it's
like, both look on I look, look and I look work in a horizontal ladder, give me this,
while bf art, whatever works in in a vertical montanus, which is getting you a given row.
But something more interesting here is that all the results, these one and these one and
these one, they're all series, what are being returned our series. So that's what we saw
before. And the way it works is first, if we focus in this last example, we're going
to see that it's pretty standard, just these series right here was is a one return I remember
it has a type and everything. So that's, that's fine. If If we ask for a row, like in this
case, we can get for example, here easily. There you go. The result is also series. But
what you can see here is that this thing is kind of transposed in a way dot here was the
volume of this year is population is here, and GDP is here and surface area is here HDI
on continent on here you have volleys. So it's it's again, it's it's being transposed,
right from vertical to horizontal, in our regular series manner on the index of this
series is extracted as the name
that the column hot. So in this case, the name right there is the value of the index
that it had. So you can read more about it right here. But I just want you to remember
these rules don't lock you select by index dot I lock you select by sequential position,
the F at something you go by column, there are times when these might not apply. So or
not want to apply, there will be some issues. So for example, if your rows if your index
is numeric, you might have issues with these form or dot form, just respecting these three.
For now, it's gonna get you any element you want to get either by row or column. So from
what we've seen, the oldest slicing also works as expected. So we can get, for example easily,
or we can get friends up too easily. So the upper limit is included. But again, it's built
look and we select by indices from France to Italy, we can also do the second dimension
similar to the way we worked with NumPy, we can do second dimension here. And we can get
all the countries that are from France, or to Italy, including Italy, but only the population
column or population and GDP. So here you can see the second dimension being applied
at the concept of of multiple dimensions in selection being applied also to famous for
ilok. It works in the same way that in that then multi index and the slicing. So we get
for example, from one to three right in sequential positions. In this case, the upper limit is
not included. So that's another difference from what we have. And we can also do multi
dimensions we can say give me the countries from one to three and the column should be
0123 should be the third column, the fourth column, the column under index three which
is HDI, so that also works as expected. And again, recommended, always use Look, I like
to select rows and just use the naked data frame to select columns as we saw before.
Now moving forward, conditional selection Boolean arrays will series, whatever you want
to call it. This also works for data frames. And it's very important, it's a way to filter
data, it's a way for us to consult the data when the when, when it so in this case, what
we have is, we want to select all those countries, which the population is greater than 70. Okay,
so all the countries that have more than 70 million habitants, similar to what we were
we did with a series, but in this case, we want to do it with a data frame. So what you're
going to see here is that we're going to construct a Boolean series as we did in our previous
video, right? So every country with more than 70, false false, true false. And we're going
to inject that result, that Boolean series in a dot lock selection, give me all the countries
which match here than that the true value in it. And remember, just this is kind of
mnemonics are a way to remember, the way pandas knows how to filter things is by matching
this index, right from the resulting series. With these index of the resulting data frame.
These are two different objects, they are completely different objects, but their index
much. So here, Japan, March, Germany March, so here, Germany, on Japan, they are the same,
and that's why that thing is working us expect that they This is just the first dimension,
which is give me these rows, you can also on the second dimension, saying give me these
column, or these columns, right. So that's steel, that's the awards us desire. So what
about dropping stuff, you can say, whenever you have data from you can say give me just
these pieces, or you can say drop the others, right, it's just pretty much the same. Dropping
is very simple, you can drop by index, drop this value drop Canada altogether, period,
or drop these indices can in Japan, or you can also drop columns, drop population, and
HDI as columns. These ways also have a more advanced usage, which is with access similar
to NumPy. I don't recommend them so much, but you can still use them and see them here.
So all the operations we've seen, so far, they're all working. The most important part
here is the broadcasting operation that we're going to do between series. So we're going
to create a new series crisis. And I'm gonna
show you what it looks like. So we have here crisis. And we're going to perform a broadcasting
operation between between these, I'm going to show you what this thing looks like first,
between that two, these data frame on the crisis. And the result will be that we will
subtract, I don't know what's this number 1 million, subtract 1 million for each volume
in here. And we're gonna subtract 0.3 HDI for each one of those. So what you can see
here is again, this alignment between columns and indices, the GDP here is matched with
these GDP and the HDI is much with these HDI. So there are two different objects, two independent
objects, these series and these data frame here. But when we combine them with an operation
like this, the the columns in this case are aligned GDP, and HDI and they work together.
So you're gonna subtract these value in all these column, let me remove this, you can
subtract these values in all this column for all these values, I'm going to subtract these
value here in these column for all these values. That's the way it's going to work. So moving
forward, what about modifying data frames? Now I wanna I want to show you something.
And that's when we were dropping stuff before. We were not actually modifying the data frame.
So here we did df dot drop Canada, but df still has Canada in it. And that's because
similar to what happened with NumPy these operations are all immutable. They are not
changing the underlying data frame. We are storing. We are storing we're creating new
data frames that store the result of the given operation. So in this case, you have to drop
Canada. The result is that the these new data frame but the underlying That iframe is not
changed. That's because again, they are immutable operations. 99.9 operations in pandas, it
are immutable, there are ways to change it, there are ways to make the changes permanent.
But for now, I want you just to think that everything is immutable. Whenever you want
to perform an operation, it's going to create a new series. If you want to keep track of
this, you will just need to do something like df two equals that, or even df equals, you
know, just to modify the current data frame. Again, there will be a way to not do that.
But we're going to save in a sec. So modifying series more explicitly, that affrontare modifying
data frame more explicitly, how can you create a new column? Well, very simple. Assign a
column, I said, let's say in this, this column right here, it says similar to say, here,
language. Oh, it's just read only. But if I say language equals, and I can just write
whatever I want. In this case, what we've done is that the language, let me show you
what Lynx had, in this case, was a tiny series, it didn't have elements for all the indices
in the data frames, but that doesn't matter. pandas will match all the indices of our chill
exist. And it will live like the rest. This na n is what we use for a blank. It's another
number from NumPy. We're going to talk more about it when we start doing cleaning data.
Data cleaning, sorry. So again, links France, Germany, Italy, you can see the volleys are
all up there. What happens if you want to change a value the language series already
exist, you want to change it or column or read exist, you want to change it. So in this
case, we're going to say df language equals English. So we're going to change it all together,
df now will be affected, and all the values of language will be English. How can you relate
How can you realize when there is an operation that is changing the underlying data from
the underlying series or than the line NumPy array, it's usually when you have an equal
symbol, remember, NumPy, we saw something plus equals, in this case, whenever you have
a plus and equals symbol is you're modifying the underlying data frame.
So for example, check this out, the Rename function or method of a data frame will let
you pass columns and indices to rename. So in this case, we want to change the United
States to USA, the EU, United Kingdom to UK and Argentina to AR, Argentina doesn't exist
in this data frame. But that doesn't cause a problem. And that's why we want to show
you, the US, UK were modified correctly, and HDI was modified correctly. And a PC which
doesn't exist, didn't cause any problems. Now, why am I showing you this because remember,
these operations are immutable. If I check what's the state of the data frame, we see
that the original data frame has not been changed HDI a steel HDI, it doesn't matter
if we renamed it before, it's still the same data from the same thing for days, indices,
all these operations are immutable. A few more examples of modifying data just for you
to look at. And something that is very common for us is creating columns that are combinations
of other columns. So again, this is read only, but you can you can imagine, that I could
do is hear something like for example, GDP per capita, right? If I go here, and I do
GDP per capita, GDP, p per capita, per capita, and here I say is equals to the GDP, this
column divided by this column, right? So I do something like B, B three, actually, C
three, C three, divided by b three, right. And then we would extend the values all the
way along here. In pen this, we could do something very similar. We can do just any column, we
can just perform operations, broadcasting operations between them, in this case, GDP
by population. And we can assign that series which is a result right there. So it's a series
we are going to assign that series to a new column. So GDP per capita Now, there you go
is now a column of our data for. Again, all these broadcasting operations are extremely
fast, they are backed by their NumPy array, and they result in a series. So very quick
statistical information, a few methods, right to do summary statistics. We saw them with
this crime method. But minimum maximums mean, median, all that works as expected. Something
that I want you to note here, if possible, is that with pandas, we have, I'm going to
change colors here, we're going to use red. With pandas, you have this concept of a data
frame, right data frame that has multiple columns, multiple rows. And these operations
are resulting operations are resulting in just one series. So in pandas, you have your
data frame, and you have your series. And we could say we have individual numbers. And
it's like always, the data frame is always resorting back to this, it's like some operations
will just return a series. And the series can be used in a data frame, right. So in
this case, these resulted in a series, but then we merely use the series to set the value
of a column. Right. So that's why understanding series is so important. So there are a few
more assignment exercises for you here. So you can check them out and complete them if
it's going to make a little bit more sense once you're working with it.
Finally, I want to give you a very quick introduction to reading the external data on plotting.
And to do that, we're going to use a few methods that are very popular in there, maybe we can
look them up very quickly here, we can say read CSV, use the read CSV function from pandas.
So these function, read CSV. And as we have read CSV, we actually have a few others read
sequel, read Excel, read XML, there are multiple adjacent or multiple ones, read HTML will
be able to automatically parse an HTML page and read it. So a few functions like these
like, what we're going to do with these read CSV, right here is the structure of it. A
few of these functions will let us import data from an external source into our pain
this workflow. So in this case, what we're going to read is these BTC market prize volumes,
so it's right here, if I open the CSV, this is what it looks like. It's the date of the
price taken a read and devalue the bread, the timestamp, and the value the timestamp
of the value no decide the price of bitcoin 2017. Now it's close to $9,000, I think. But
just note inside, but again, this is a CSV, and this is a CSV that we're going to be writing.
To do that, again, we're going to use this method read CSV, the method will automatically
parse the CSV, as expected. And there you go. And the process now will be for us to
start tuning it to get to the right point. So I'm going to show you a few customization
SP customizations, we can do with the receipt, read CSV function. So the first one, and sorry,
let me tell you first, we have a ton of attributes here. So we have a ton of customization to
do with read CSV, you will not remember all this, you will not remember everything out
of the top of your head. So don't worry, you can always go back again to the documentation
and just practice, it's going to come naturally. So the first thing, the first row of the CSV
was considered to be the column names. So in this case, this fine lesson have a column
name, let's say I add it, I'm going to do timestamp, timestamp price, you're going to
save it, I'm going to rearrange the file and re re read it. There you go. So by default,
pandas is assuming that the first line of the CSV is the rd columns. I'm going to go
back into what it was. Right, and I'm gonna show you again, that's the assumption that
pandas is doing. We're gonna Of course, of course, change that assumption, because in
this case, our CSV file does not have column names. So we're going to just say Heather
equals none. And this is when we start seeing the attributes that we're going to use from
the read CSV function, read CSV. When I do hether equals none for us going to be known.
That means don't infer don't read a header. Don't try to infer a header, a header from
the CSV file. And the columns are zero and one. So now I'm going to change the columns.
And I say, actually to be time something prize. And now what I'm going to do is show you the
first rows. So you're saying here that I have these df dot head method that I'm doing. That's
because this is a significantly large file. So we're going to say not not that long, but
at least it doesn't fit in my screen. What's the shape of the day CSV or the data frame?
It has 365 rows, and we have two columns. So we can do df the info, for example, to
have a little bit more reference about we have 365 values, there are no no values, and
price is actually float, that Tamsin is an object and we're going to fix that in a second.
I'm sorry, that the F that head on the F dot tail, are the methods we used to get either
the first and files or the end row sorry, or the last n rows, which are five rows, by
default, you can change that and say, Show me the last three rows, for example, that's
something you can do. And again, the types so the types is the timestamp in this case,
the timestamp column was not properly parsed as a date, he was parsed as an object as a
string, which we don't want. So we're going to use the function PD dot today time, something
we're gonna explore in more detail in the reading in the cleaning data cleaning course.
Part sorry, if it weren't tutorial, we're gonna use it today time function to turn these
column D f, the timestamp into an actual date. And now we're going to say, the F that timestamp
equals to this function resulting, and now everything looks as expected, there is one
more change that we want to do, we want to set the index of the data frame to be the
timestamp, because by doing so, we can quickly access price information led me see what was
the price of bitcoin in 2000 1709 29. And I make a mistake here, I forgot to do the
LLC. There you go. So we have the value of Bitcoin. On these particular date, forgot,
look, remember that to get value from a particular row, you have to do dot lock. There we go.
So we are getting Dodd's particular value. Because we've made a timestamp the index,
we get artists value directly from the index. So what happens if you want to turn this thing
into an automated script, for example, when I run this process, every day at 5am, whatever
we can, we want to read the CSV, strip the columns, rename them turn into timestamps,
etc. This is what we've done so far. Read the CSV without a header, create the columns,
turn it into a daytime timestamp into a daytime and assign it to the index. And that's the
result again, well, actually, the read CSV, oh, sorry, the read CSV method is so powerful
that it will let us do all these actions in just one call of the read CSV method, we there
are parameters that will let you customize the behavior to achieve the same results that
we did with four lines of code right here. So in this case, we're gonna say, read this
CSV, don't assign a header, that's something we do already or don't don't infer our header
from the first line. These are the column names. So we don't need an extra line, we
can just say these are the columns names. Oh, and by the way, the first column is going
to be the index of the data frame. Oh, and also part of the date. They've the index,
it's a date, so part of the date, and we have the same result as before. So now I'm going
to pro try and same thing. There we go. So you can see it's work. So very quickly pan
this plotting. Alright, so we're going to be doing here is I want to show you very quickly,
I don't know what's this thing is as a vertical scrolling. I want to show you very quickly
that you can create plots with Hannah's interest a breeze. It's so simple to create a block.
So in this case, what we're going to be doing is, given a data frame, you can always invoke
the plot method. And the plot method, what it's doing, it's using the map plot live library,
something that you can check if you want in the docs. But for now, it's not necessary
with these, we're going to be more than enough. What it's doing is just using, again, the
regular plug library, as you can see dimopoulos Library, which is part of the standard PI
Data stack. And again, for us to access using pandas is extremely simple, just df dot plot,
you're done, you can set the plot as you want, we're gonna see more details of matplotlib.
So don't worry too much about that later. So there is a more challenging example here
that I can just run very quickly, you can inspect the process we follow to fix the data.
But this is what we have, there we go.
And what you can see right here is the difference between the Bitcoin and ether in this period
of time right here, and they are both loaded in the same chart. And that's because this
is the resulting data frame, we have Bitcoin on one side, and we have ether on the other
side on we are plotting it right here, we're creating one plot with all of it. And we are
noticing these empty value right here. So what we can do is we can go from December
1 up to January the first these period, so we can select that period, is in that lock.
And we can just go ahead and plot it again. And this is what you see right here, the gap
that we're seeing. So again, this was the introduction to pindus. We have a real life
example of pandas following up. Also we have a little bit of data, more data cleaning on
reading all the interesting files and sources of data for in getting more data into the
pipeline, right. So the idea is going to be showing you how you can import data from Excel
from SQL and then do the actual processing and analysis.
Now it's time to talk about data cleaning, we have arrived to that point in our tutorial,
in which we have pulled the data, I've shown you how to manipulate it with pandas, the
beginning at least the introduction to data manipulation with pandas, and now it's time
to properly fix it. For the sake of brevity, we are skipping a few parts of the process
of data cleaning, especially you're going to find it in this first notebook that we
talked about basics, conceptual, missing data with Python with NumPy. And we're going to
miss a few other things. But I'm just going to mention them. pretty generic, pretty general
form. And then you can of course dig deeper, you can check our courses if you want to know
more about it. Usually when we talk about data cleaning, where it's in from a more conceptual
level, we're going to talk about a four step process. The first step is usually finding
missing data, which is the simplest problem to identify from a data set when something
is missing. So you have car sales data. And there is a car that has no name right? Or
there is a card has no price, right? So there is an number missing or there is a category
missing and there's a string missing. And of course, each one of those is going to have
a different meaning how to solve how to fix data set that is missing data, it can be very
simple. If you can just for example, drop the record, if you can fill the value, right.
So for example, the prices fill in these missing, you can fill it with the average value of
the sales data or something like that. Or it can be very complicated if the volume is
important if you can't move forward until you actually find that missing volume. And
it can involve something like picking up the phone calling your ETL team asking what's
going on that the data is missing. Or even if you're buying the data, you have to call
the vendor, ask them why their ID if you've you're paying for that and there is data mentioning
etc. So it can be a very political process. It depends what's your use case. But again,
from a technical perspective, identifying missing data and fixing it is going to be
extremely simple. Once you have fixed the missing values, then you start looking for
the data is assuming the data is not clean yet in this process of data cleaning. The
second step is when there are invalid values. So you have for example,
column that is price and there is a string within it right here. We're expecting only
numbers and there are strings in it. So then It's not going to be complicated to identify,
it's not going to be too complicated to fix it. But again, we're increasing the complexity
until a deeann of these data cleaning process, we're gonna reach problems that have to do
with the domain of the day you're looking right. So for example, you have a column that
is customer age, and there is a value that is 170. Right? So that is not an invalid value,
it's a perfectly valid integer. The problem is that given the domain, right, but speaking
about customer age, is highly unlikely that a customer is 170 years old, right? So in
that case, the vowel is completely valid, there is no missing data, there is no invalid
values, etc, is just about the domain. And this is when things get very complicated,
because in this case, that example of age is something that resonates with all of us,
we know about age of humans. But if you're working in a domain, if you're working as
a data analyst, in a domain that you don't know much about, right, then you might not
be able to judge if a value is invalid or not. If I am working in a biology lab, and
I have something like white cells count per milliliter of blood, I don't know what's what
it's a good value, or what's an invalid value, right. So it's, it's something you need to
know the domain. So that's usually the the most complicated part of data cleaning, when
you reach the limit of everything is valid, everything checks out. And now I need to make
sure that these value is valid for these domain that we're working. So again, this is the
spectrum that we're going to be revisiting today. So to get things started, the way pin
this works with no values is is it has four functions, which actually there are synonyms,
it's going to be it's going to be relatively simple, just trust me on that. There are a
few things first, everything that pandas does in the process of missing values, is related
to the way NumPy works. So again, we're skipping it, you can go to that notebook, check it
out by yourself. But it's extremely simple. NumPy has these objects and n not a number
to identify a missing value or no value in Python world to have the non value. But again,
in pandas and NumPy, we're going to use na n none on there, or in this case, at the beginning,
we have these two functions is no n is na, which are complete synonyms, we're going to
find also is no and no we have it isn't a and they're also complete synonyms. So no
n na for pan, this is the same. You can use the one you prefer. Sadly, I like is na because
it's the way I learned it. I think for my students I usually recommend is no, because
it feels more correct. And it feels more self explanatory. So you can use the one you prefer,
if you can use is no, I think that's going to be better. If you get used to ease in a
then you're going to be on my side, just do whatever you prefer. So again, it's no one's
gonna say true or false, depending if the value is no or none, right? And of course
not No, it's going to be or not na is going to be the opposite. So not na have not a number
is false, and not an A of three is true. If you get to this first notebook, you're going
to set all the false e values on the true fi values in detail in terms of Python, anything
that is not empty or non etc is going to be considered to be truthy. So anything you pass
here that again, is not an empty string or a no is going to be considered a true fi value.
So it's no not no or is it a and none an A, they both work also with entire series or
entire data frames, right? So it's not just for one of Valley you can pass an entire series.
And the result back is going to be if the series is if the series what values in this
series are either no or not no, depending what's the question you're asking either is
null or not null. So in this case, we say which one is of the series are no, this is
not, no this is not No, this is no so this is only true. And the opposite for the following
method we are applying are actually function. And again the same thing works with not
entire entire data frame. So something we do usually is if you look in to not know unknown,
a few hacks that we usually apply are the count on actually this be the sum of all the
no values or not no value. So we have this entire series, we can say how many not null
values we have. And if we sum those, not no values. In this case, we're going to get a
result out which is the entire the entire summary Have the nod no bounds we have asked,
and the same thing is gonna happen if we say is no. So if I do here is no, some, we're
gonna get how many novels we have? And it's pretty much the opposite of this question
is no. And the way it works is in Python bullions are pretty much integers, they're ones and
zeros. So every true Val is going to count as one and every four is going to count as
zero. So if you ask for the sum of a Boolean series, you're going to get out the result
of the number of truths that are available in that series right. So, in this case, we
have to know values we ask how many knows value we have is know that some we get two
out, you can use these tricks to filter the data with a series. So in this case, we can
say give me all the values that are not known. Right? Just not know. Also, something interesting
is that both for data frames are for series. The not not no is no isn't a not an A methods
also, sorry, functions also work as methods. So in this case, we can say instead of PV
dot know, we can say s.is, no load s, that is no. So now, it gets a little bit more,
a little bit simpler. But if the final objective of these core as equals alzarri, s selecting
only the boundaries are no no, was to drop the null values, then there is a simpler form,
which is dropping, okay, so in this case, we can say s dot drop in a, and we're basically
invoking the same thing that is happening here, we're missing we're just excluding sorry,
all the missing values in the series or the data frame, because this also works for data
frames. So what's the one, one important thing to remember here is that all these methods
are immutable, we are not actually changing or modifying the original series, the underlying
series is not being modified, there is a new series that is returned. So if I invoke s,
again, this thing has is not modifying their series, you're creating a new series, and
that's the one that hasn't, that doesn't have the missing values. Everything we've said
also works for data frames. So right here, with these on a frame, we can say how many,
right? The first thing usually is to start with an info method, right? So we have info,
and we see that there are in total, four entries, four rows, we can also do a shape, if we need
more information about the structure of our data frame. So there are four rows, four entries
in our index, column A has only two no no values. So that means there are two values
that are actually no no, sorry, no, there is column B that has three nought non null
values. So that means that one value must be known, and that's for column B, again,
so usually info gets you very close to understand the structure of your data frame and how many
values there are missing. The same thing happens with some, we can just do df.is, null isn't
a and then some, we're gonna get a quick reference of how many null values we have in that given
data frame. Drop in a works in the same way, but there is a significant difference. The
way drop in a works in a data frame by default is by dropping any row that has at least one,
no value. So these row has no value dropped, these row has no value dropped, these row
has two new values dropped, this is the only one that it's not being dropped, right. So
it's very harsh in that respect, you can change that to make it to the column only, only keep
the column that has no no values, and that's by switching the axis equals to one. And there
is also a way to select a subset or thresholds. So only delete rows that have less than three
valid values. For example, in that case, you're going to use something like the strategy of
the drop in a you're gonna say, drop the columns, the rows, sorry, are the columns because it
is also works for columns that have all the values and no, or drop. The This is the default
behavior, drop all the rows that have any value in an NA or specify a threshold, which
you mean by basically saying, I need this amount of valid values in order to keep the
rope it's the way it works. Now, which ones to drop is which wants to keep based on the
fresco. So once you have identified it No values, it's extremely simple to clean them
to sorry, fix them. So the first method we're going to see is fill in a within a particular
value, we're going to say from this series, I want you to fill the blanks or fill the
missing values with or fill the anaise. fill them with numbers zero in this case. So these
two are numbers zero, or, of course, you can use any statistical method you want. In this
case, we can use the main. Remember, this is not altering the series, the original series
is still the same, we're not changing it, it's creating new series because all these
methods are immutable.
The following method is or this the following way This method works is by passing a method
which is for field or backward fields, these are the possibilities. And basically the way
it works is it's overflowing, all the values top down, at least in Fairfield, right starting
here, it's dropping this value here, dropping this volley here. And dropping now three here,
as this thing is a nun, it gets replaced. So this thing is three now, which gets throw
up here. And now this thing is three again. So that's what we have right there. And of
course, backward fields works in the other way, starts with four and moves, it moves
it here and then moves here, etc. You have to be careful when using these. Because if
you have no no values at the beginning or the end, then you're gonna end up again, with
no values because there is nothing to fifth forward, right, this is the first volley you
have India. And all we've seen also works for Donna friend. So both boggler fail for
field or both in terms of rows for feeling, right, so we have, we have these, these data
sets. So we do for field row base is going to be one to here too. And then five. So that's
going to be for field x is one, if you use for field x zero, then it's a vertical filling,
right? It's going to go here, one 130 30. So that's for the column, that is y here,
one 130 30. So it's either for filling in, in, sorry, this direction for failing, or
it's going to be in this direction, depending on the axes that you are passing. And actually,
let me we're going to put the correct forms with axes equals zero, it's going to be columns,
it's going to be visit direction with axes equals one, it's going to be row based. So
it's this direction, right? So we had a no volley here, that got fail in this way. Okay,
moving forward, we what else we have, we have here, checking for values. And we've pretty
much seen this already, you can use the is know, the sum method to get how many values
you have. And there is also any an old, which will give you very quick. These are usually
called Boolean tests, you can say ask if there are any values are valid, or all the values
are valid is just to build more complicated queries. So so far, so good. So the process
we said was at the beginning, we were fixing missing data, missing values, there is nothing
in there. We have read a data frame, where's our data frame right here? We have read our
data frame from CSV from a database, and the value is missing. No, there is a hole in it.
So we have quickly identified it with isn't a or is no, we were able to drop the ones
we didn't want to keep dropping a or we were able to fill the volume we wanted to fill
fill a name that was simple isn't a drop in a fill in a what happens when you're cleaning
data that actually has a value, so there is no nothing missing.
But those warnings are invalid. So for example, here, the sex column is a categorical column
that only accepts an on f. d on question mark, those are invalid they are, it's very simple
to see an invalid value here because it's completely out of the scope. The same thing
as we have, for example, question mark in the age column where we have we have a string
in the age column, it's very simple to identify that, how we're going to clean those. Let's
start with sex first, because it's simpler in this case, the first check we can do is
with either unique or with volley counts, I'm going to use value counts. We've seen
this method before. It's a quick summary of all the unique values you have. And in this
case, volley counts also gives you a total count for those values. How can you fix them?
Well, there is a replace method which is extremely intuitive. You can just replace in this case,
we're changing all of these two F's and The End two M's, and it can work in multiple columns.
For those volleys, that again, we said were more complicated to fix, like, in this case,
we know age, in this case, is 290. And we know because we know the domain, that 290
as an invalid age for a human. So we will need usually in those cases, we're going to
need more complicated fixing, and it will involve more programming, that's the reality,
you have to be better coding. In this case, we know that these volley is invalid, because
it's probably an extra zero. So all these values, you're pulling a CSV with ages, and
there are a total of 180 290 32 320, for example, invalid values out of 100, right in the 100
places. And that's because there were typos when they were creating the ages. So how are
you going to fix that? Well, in this case, it involves a little bit more programming,
we're dividing everything by 10.
So also, something that may be useful is dealing with duplicates. And we need to first define
what's going to be a duplicate value. So this is, this is usually a little bit more political,
if you want, you have to define what's going to be a duplicate. In this case, we have a
series that contains ambassadors, and each, their master is the index, the country of
the ambassador is going to be the value, right? This is usually the important part. The rating
here says the word conducting a party, and we want to invite one Ambassador per country,
we don't want to repeat ambassadors, ambassadors. So in this case, what's going to happen is
that these two in our humanize at least, we can click clearly and quickly see that these
two belong to the same country. And these three belong to the same country. But here
again, we have to define which ones are the duplicate, if you want, and which ones are
not duplicate. So for example, maybe we can say the first one is duplicate, or we can
say the last one is duplicate. So this is the first one not duplicate, or actually can
say this, the last one is one, and when I bite, it's not to duplicate. So we're going
to have political rules if you want for each one of those. So let's see the duplicated
method and the way it works by default. By default, duplicated method is going to return
true for duplicate for all the it's it, I'm going to invert it, it's going to not treat
it as a duplicate as the first instance that it says. So the method is actually walking
top down right now saying, Do I have friends? No, I don't have friends. I'm going to keep
it here. Because it's the first time I see friends. Do I have the UK? No, I don't have
the UK, it's just gonna keep it here. Then it sees the UK again realizes the UK is already
there, too. It's already present. So this one is going to be considered a duplicate.
Italy is here, it's fine. The first occurrence of Germany, it's fine wrightstown, Germany,
but then it says Germany two more times. And it realizes that Germany was there. So those
are now duplicates, right. So the way it works by default, we can change that and change
it to last to the last element is not considered to be duplicate, and the other two are considered
to be duplicate. And the same thing here. Kim, here is the one consider duplicate. So
it's either top down or bottom up depending the way the parameter you're passing, it's
either keep default or keep last, or you can be a little bit more harsh on say everything
duplicate it is actually to be needs to be considered duplicate. So these two are duplicates,
and these three are all duplicates, as you can see, right there. Similar to the duplicated
method, which pretty much tells you which values are duplicated, it's it helps you identify
them, you also have the drop duplicates, and in this case, what this method is going to
do is basically the same thing as before, but dropping all the values are checked for
true, right if the method is if the value is missing, it's gonna just drop it. And the
same rules apply default, last and false. For subsets in this case, we have Ace, we
have multiple, we have multiple players in the data frame. But what happens is that these
player Colby is present three times for humanize we see Kobe three times. What is going to
happen here is that the The way we're going to think about duplicates is by understanding
the correct subset that we should check. In this case, Coby plain as sn SG is duplicated
two times but COBie, playing us in SF could be considered a different player if you want,
because maybe it's a different season, or it's a different, a different position they
played. So in that case, we need to pass What's this subset that we are going to consider
duplicate, only check for the column name, or check for the column name on or not check
for the column name, which is the default is going to check the entire data frame. And
when that happens, then these two are considered to be duplicate. So these one is a duplicate
with this rule, if we put keep last, sorry, keep false, both are going to be considered
duplicate. So this second occurrence is the duplicate one. And the last one is a completely
different row, because the the value in position is different. That's the way it works here.
Moving forward with more cleaning of values, we're going to talk about string handling.
And this is a very neat feature of panelists, that special types of columns will have special
attributes. So given the column type, so df info, which is an object, which is a string,
right, in pandas, that all the strings columns are going to have these special attribute
which is str, all the daytime columns, something we're not going to cover, but you need to
know, all the daytime columns have a.dt, Math attribute, all the categorical columns don't
have a.ca t cat attributes. And those attributes, str DT cart, they have a special methods associated
the domain of that column. All the methods associated with string are of course, we're
string handling, or the methods associated with DT r for data handling. So in this case,
we're going to review all not all very good subset of the string methods we can apply.
And something interesting is that all these methods have a very good have a lot of relevance.
And they're related to the ones in pure Python. So if you have a pure Python string, there's
a split method. There is a contains method or I don't know if there was a contain actual,
it's actually, I think it's the in operator, but there is a strip, and there is a replace,
right, so most of the methods under the str attribute in pandas have,
have an analogy in the standard library of string handling with Python. So starting at
the beginning, this data we have, I'm going to delete this this data we have, what we
are going to do is split the values right by an underscore. So in this case, that's
what we have, we have split all the volleys with that underscore, and we're going to use
the special attribute is expand, expand sorry, equals true. And what it's going to do, it's
going to create a data frame out of that. So we create a data frame with 70 columns.
And this is what we have now. So we can keep applying methods. So for example, contains
or content contains, regular or contains with regular expressions rights for you to see
the power of it, we can just strip replace, and we can do even regular expressions with
replacing so we could fix something like this question mark in a string, we could fix it
with regular expressions if you know how to handle them. And finally, something that is
going to be very helpful when you're doing data cleaning, is looking at the data from
a visualization perspective. data cleaning has a ton to do with statistical understanding
of your data to when a volume is considered an outlier. For example, it might be invalid,
and you want to claim it. So but that's a lot more about statistics. And this case,
I want to show you very quickly, the mottled leave library, I've been promising for some
some time now, the mapa lib library. So far, we've accessed it directly from pandas, from
pandas, or we're doing a data frame dot plot. It's these library mapper lib is the one backing
all those methods and we're going to see how to use it directly. Now. The model live library
has two important API's we're gonna call him one is the one that I don't prefer, which
is the global API, but it's the most common one. It's the one you're gonna find around
the global API. And the second one is the object oriented API. So it's around here.
And usually there are there are ways it's just two different ways of doing the same
thing. Okay. The global API is an API that it's in part inspired in MATLAB. It's been
around for a long time on sadly Most of the answers you find in Stack Overflow tutorials
and books will be using these global API. The way the word the one I prefer the most.
And I'm gonna explain you why in a second. It's going to be the object oriented API.
But I want to show you both. So you have a reference. If you follow me in this feeling
of preferring the object oriented API, you will always have to translate global to Opie.
Why is it considered a global API? Well, we have imported matplotlib.pi plot as PLT. So
we haven't imported the whole module, the whole Python module, depending how much you
know about Python programming is going to make sense or not. We have important the whole
module. And now what we're doing is we're invoking PLT dot figure. And finally, and
then we're going to do a title. And then finally, we're planning two things. We're plotting
x, our plotting x squared and minus x squared. And why is this global because we're invoking
functions that are at module level. And there is an object, the final plot, that it's being
modified by these very generalistic and global courts, right. So by by doing these call right
here, I'm modifying the final result of the plot. Let me show you a more complicated example.
So you see the problems with the global API.
If you look at these line, if you could delete everything, let's actually delete everything.
What is this line doing which plot is affecting, you do not know, there is no object oriented
way of saying in this second plot the plot on the right or the figure on the right, or
actually the sub plot on the right, I want you to plot this thing, you're just saying
it to the entire module. And depending the order that you set it, is where it's going
to land, that particular figure where it's going to land in which plot, it's going to
lend. Again, it's a global API. So we start saying, I'm going to create a figure, trust
me from So from now on, I'm going to start drawing on it, there's going to be the title.
And hey, by the way, it's going to have one row, it's going to have two columns. And I'm
gonna start drawing in the first plot these one right here, these one right here on the
left, okay. So now I have kind of activated if you want that plot, it's active. So now
I'm going to start drawing on it. So every action that happens after this line is going
to be affecting these blocks, these blocks, right. So then I plot x and x square, I plot
this vertical line, I put a legend, I set labels, etc. And at some point, I just stop
and say, Hey, now I want to switch the plot, I want to now start plotting. Sorry, I want
to start plotting here in this second one, because I have just changed that the first
line these one. Oh, sorry, the way it works is by saying the first row, second column,
but second plot. So now I want to start plotting in here, every successive line will affect
that line. And again, you can see that understanding a code, given the order that the order in
the sequence of lines is very hard. If you have to debug a report that has a plot that
takes 100 lines, then you have to keep in your brain, what's happening top down, a different
approach is going to be the object oriented approach, in which we're creating a figure.
And we're creating axes. So in this case, we have in this case, we have right here,
one entire figure in red. And we have in here, purple, we have two axes. So these axes one,
and this is access to so we have two axes. We're going to create those using an object
oriented approach. And we're going to keep references to them. So we're going to say
later, to these blocks to these artists, sorry, I want to plot something. And that will be
very explicit, it's going to be an object oriented way. So the first thing is creating
the figure on DCE. The axis in this case, we have just one axis, that's it, but you
can have more and then you say in this axis, I want to plug this thing in this axis, I
want to pull up that thing, etc. When you have multiple axes, so I could show you. I'm
going to go back again to that in a second. But In this case in which we have four axes,
right, so we create one figure. And it has four axes, we do it with this subplots, method
saying and rows and columns. Now we say to the axes number one, I want to put this thing
to axis number two, I don't want to put that thing, right. So it's 1234. And now it's a
lot more explicit, it's not depending on the order, I could change this order, that doesn't
matter.
They're that the results are gonna be the same oxes number four has yellow, regardless
of the position that we're following. So the map will live. And now that we have clear
out the differences in both API's, maple leaf has this very simple plot function, or method,
depending on sugar enter global, that we'll plot something you specify. In this case,
we're passing all the values in x and all the values in y. And in this case, we're passing
a given line style, this can change with these type of syntax, you're saying, I'm plotting
this thing in X, I'm blowing this thing in y second parameter and why. And I want you
to use a straight line, it's a straight line, yes, with this marker, the dot and in green.
So this is if you are very familiar with it. If you're very familiar with my bullet you
can use to send links in other games, you can just say line style market marker, sorry,
color specific keyword arguments for each one of those. So do we only have line plots
in APA live? No, of course not. We have a huge variety of plots. And by the way, there
is another one here, if you want to see more events are grids, you can create these grids
and put different things in it. And again, not only land plots, one good example is a
nice scatterplot. So basically, we're plotting X and Y correlation. And there is also our
value, our color map, right. So given the volume, there is going to be a change in color.
So these kind of lets you plot three to four dimensions of your data, the volume x, the
volume, y, the size of the bubble, and the color of the bubble. So where you're pretty
much encoding four dimensions in just one figure, right. So in this case, we're just
using two different scatter plots, there's more information here, we can also block histograms,
that we've very quickly seen that with pandas with pandas is, is very simple with just plot
type histogram, current histogram hist, actually, you can look it up in our previous lessons.
So just go back into the index in the video. And the histogram is extremely simple just
takes the valleys you're plotting and how many bends you want, or some more advanced
arguments here, like the alpha level, etc. But it's simple. And similar to the histogram,
you can also create kernel density estimator diagrams, which is very similar to distance
to simulate if you want a continuous distribution. You can combine these plots if you want, in
this case, we are creating the plots were plotting a histogram. And they were plotting
the lines and they were plotting our changing limits. But that's pretty much it. And you
can also create bar plots, right? So in this case, we have PLT dot bar, or here we have
two bars are stacked, right? That's the different way to look at it. And finally, check in outliers.
You can always plot histograms or box plots, right? So box plots are also a nice feature
to have in here. So this was all with data cleaning, we're gonna keep moving forward
this tutorial, I want to mention one more thing here. And it's there are notes here
for kind of a task that you can follow with data cleaning, which where we are identifying
where indentifying missing values in given positions with is known as an A. And right
here, we're looking into more detail about some statistical properties of the data, in
case we need to clean it. Okay, so this is little bit more events. And it's it's related
to the concept of cleaning data given the domain. So the statistical analysis can tell
you that this value is an outlier. For this distribution, the value might be valid. So
for example, a human being is 90 years old. That's, that's valid, that's a valid age.
But if you're analyzing data about high school students, and a human that it's not a year
soul, it's going to be completely invalid or it's going to be an outlier in that distribution.
And you can treat it as such You valid valid and clean it out, remove it, for example.
So that's, that's deal a little bit more with the whole statistical analysis you can follow
here, it's a little bit more advanced for the scenario. So let's move forward with the
rest of the videos.
Now it's time to get into more advanced features of pandas to import external data. So we've
seen already in our real life example, the way we can import data from CSV files, and
from SQL databases, right, we had actually those two lessons, the objective of these
part of the tutorial is to show you how you can improve or get into more advanced use
cases of importing data. So we're going to start for example, with csvs, and text files.
And again, you've seen it already. But here, we're gonna give it an extra twist. So I'm
going to show you more advanced features. And for special use cases, txt files, CSV
files, is, conceptually speaking, a CSV file is a text file, it's just human readable text,
right? That it's encoding information. The idea for CSV file is that it's tabular. Right?
So it's a plain text file that contains tabular data in it, and it's separated. csv stands
for comma separated, but it can be separated can be anything, we can see more examples
later. But basically, the idea is that it's a text file that it's tabular into in a tabular
format. So though, both CSV files and text files will be read with the same method. So
to get things started, I want to show you the basic way we import will read data from,
from from external sources using Python without even starting yet, with pandas. So you don't
need to know this, it's usually it's usually productive if you want for data scientists
or data analysts to understand a little bit more how fire reading and writing works in
computers, because there are multiple, multiple concepts align, here, they evolved, operating
systems processes your language, right, it's not same thing to read a file with our or
with Python or with another language. So there are multiple concepts here. And even though
pandas in this case can make it simple, very simple to read and write data, you can get
a little bit of a more advanced use case, if you know the internals of again, both the
operating system processes on your language. So this the way we read data with a reader
file, sorry, using pure Python, we use a function open. And in this case, we're using a context
manager, just a security feature, again, related to to the advanced usage of reading and writing
files. But it creates a file pointer, right. And with a file pointer, you can then use
the very simple API x point post. But they but that pointer, which is something like
red line, red lines, read a number of bytes or characters, or you can just even trade
FP as an iterator, just do a four line in FP. But basically, we're going to do something
like this, we'll start reading data from top to bottom, just a month to, I don't know,
we hit I've given in this case, we're doing it just for a couple of lines. What else we
can, it gets very difficult when you're reading text files to process them, because it's usually
hard to parse the structure of the file. So it's not the same thing to have a funnel that
is separated by comma separated by colons separated by pipes, spaces, etc. So you're
gonna see that once you want to get a little bit more, I don't know a little bit more with
an advanced usage, right, or a little bit more fancy your calculations and and the way
you parse the data, it's gonna, it's gonna get harder. So that's why we're going to use
pandas, or I'm going to show you in a second, this is the module that is part of, of Python.
So this is the file that we're going to be reading. It's the XM review file, and I'm
going to open it. And even though it doesn't look like a CSV, it isn't either CSV. The
difference is that here the separator is the greater sign, it's not the comma, it's a greater
sign. That's going to be what marks the elimination between different fields in our CSV file.
So we're gonna use the CSV module. And
the way right here to parse the data using that module is by passing a special delegator,
right? So that's gonna be the type of work you might need to do when you're parsing the
data. It's not the same thing to have that limiter dates a greater sign. It's not the
same thing to have numbers for example, that are enclosed in quotes. All those things right
will change the way you work on all days is going to be abstracted away by the pandas
module. So to get things started, again, with pandas, at least, pandas has multiple read
underscore something methods that will work for different sources, right. So we saw already
have read sequel we've seen read CSV, there's also a read HTML to directly parse information
from a table, it's literally you can just you pass a website's going to read information
from a table, or read Jason read more advanced formats like pocket, or Stata, etc. And, again,
each file format will usually have a correspondence in pandas, it's, I've never had the chance
to rewrite my own stuff. To be honest, the same thing is going to happen for something
like Excel, which might need external modules, it's not directly provided by pandas, but
by installing those modules, you can easily incorporate Excel files in your day to day
work. So the read CSV file methods already has a ton of parameters. So this day, the
main characteristic of all these rate something methods, given the amount of possibilities
you're going to have with these files, there exist a ton of different ways to customize
the method invocation. Alright, so again, CSV files, we saw, there are multiple things
happen. csv is a passage that have a header don't have a header, different delimiters
different and closing of strings or numbers, multiple things, blank lines, etc, multiple
things are going to happen. And that's all you're able to customize all that with the
read CSV method. So this is the reference of all the
attributes you can pass to it, usually something that I do, and I do this very often, and I
use pandas a lot, and I still do something like read CSV, and I get the documentation
right here, to look into the, the parameters that I think I need to pass to my particular
use case. So keep an eye always in the docs, because it's impossible to remember all the
parameters in the CSV. So in this case, what we're gonna do is something very interesting
is we're gonna parse a CSV file, but it's not located in this computer, it's not locally
available in the computer. The CSV file is these one right here, which actually is the
source, if I get the raw version is this thing. So this is CSV file, what I could do here
is download the file, right, so just do File, Save, get the CSV file on my computer uploaded
here, right, so just copy and paste here, drag and drop it here. But actually pain this
has this nice characteristic that it will read a CSV that it's either locally as we
did with BTC market price, or you can also do it remotely, it's automatically going to
download the content of those files. And it's going to provide, it's going to save it in
memory for further usage. So there's a very neat feature. And again, this is the the CSV
file that we are using. And again, the same thing, if it's a local file, it works in the
same way. So a few features you've seen already, in this case, we can do Heather known, if
you don't want to treat the first row as a header. Or what about missing values, we can
treat some of these values like a question mark, or like an exclamation mark, or dash
etc. us not a number, not a value, right, so it's a missing value. And now any of these
values we have passed, will be transformed into another number for easier and easier
process cleaning, we can pass names, which is going to be basically the column names
for each one. And we can also specify column types, as you can see, right there. So now
the types are going to be float. And object. We've done this already in one of our lessons,
we are parsing the time and there you go. So putting all together, we get to these advanced
forms of reading csvs where we're passing column names were passing types, were asking
to read dates, were passing no values, Heather's etc. So this is a pretty common thing we are
doing. So what about XM review, if we try parsing this thing, we get this very ugly
format. In this case, they put the parameter to specify the what we used to call delimiter
in CSV is now set from separator so the separator, it's going to be the greatest sign and that
just works as it needs. So, a few more examples you can check on here the most important part
is following right, the documentation to find those particular use cases that you are having
so for example, some Like skip blank lines, or whenever there are like empty rows at the
beginning, right. So if you have empty rows at the beginning is something you can also
say skip rows. So you don't need to parse those out, it's not going to break, etc. So
that is all part of the read CSV file. And to finalize these part, at least csvs, I'm
going to tell you something that applies to pretty much every other data format. As you
have a read something method, there's going to be a to something method, it's basically
the process of writing. So you can do read CSV, or you can do to CSV. So these CSV that
we imported from the external source and the remote source, we can just do to CSV and it's
going to store it locally. Alright, and there are multiple options also to pass the CSV
delimiter, or actually the separator, if you want to include a header if you want include
an index, etc. They're pretty much the same as the other one. But the idea is that for
every read something method, there's gonna exist a to something method that it's basically
the process of writing. So let's move forward with a few more data formats. And interesting,
we're gonna get to read directly HTML pages in just a couple of minutes. And now it's
time to read data from databases. We have already done that in our real example with
Panis part of the tutorial. But I want to show you a little bit more details details
for you understand how data is being processed in case, this is a common scenario for me
importing data from databases. So the libraries you will need first thing, depending on what
database engine, you're using Postgres, MySQL, Oracle, etc, you will need to install different
libraries. But the API's, once you have installed, those libraries are going to be the same.
There's actually p Ep from Python that actually defines the interface for databases, libraries,
unpin, this can work with pretty much any any database of these SQL common database
that comply with that interface. In this example, we're going to use SQL lite because the database
right here, there's nothing, no server to connect, etc, is extremely simple to get started.
And the example we're going to use, or the danavas example we're going to use is actually
different one from our previous video is reading in the previous one, we were using circular,
in this case, we're going to be using chinuch, which is smaller both in structure and in
size. So it's going to be a little bit simpler. So to get things going here, the same thing
that we did with our previous part, that was how to read data from files, I show you how
to actually read data using Python. So forget about pandas for a second, I told you, if
we go back again, to the beginning of time, there was no pain this, this was the way we
were writing, finance, open FP, FP, the red lines, etc. So I now want to show you what
predates to pin this, what was the default way to read data before paying this, which
is with the regular again, interface from Python. So the way it works is we're gonna
import SQL lite three, we're gonna create a connection. And now with this connection,
we have these common interface that again, it's common for pretty much any other database
that you're used to. And the default behavior is we're going to create a cursor. And we're
going to execute queries using that cursor. In this case, we're going to execute a regular
Select star from employees limit, Fox will want to have five, five records out of the
table employees, once you have executed a query, it's like they're waiting, you can
do a fetch all to get all the results of that query. And here are all these results. As
you are noticing this is the result is a list of tables. So it's not extremely useful. Now,
if you combine it with pain, this you can just create a data frame out of that info.
And we're close. It's not perfect, but we're close. So let me show you now before we were
gonna close it Kurt Dickerson on the connection. Let me show you now how we work with pandas.
With pandas we have as we have a read CSV method, we also have a read, see as read SQL
method, and in this case, what this method is going to receive is the first parameter
is going to be the query that we're passing and the second parameter is going to be the
connection. That's the object the connection object to actually issue the connection by
panelists. So it gets a simple as writing the query. And now everything has been imported
into a data frame, including column names and all that if you want to get a little bit
fancier, you can either specify the index column, there's going to be use, of course
as a index, and also what types to parse for a specific column. So now we have pretty much
all the work down. So we're going from something very manual as processing things with a coarser
etc, which might also be as low to using pain this to do Actually imported data from the
database. There is actually a caveat here that I'm going to tell you is kind of a very
deep detail of the way pandas works, and is that the read SQL method is actually a shell
for two other methods, read SQL query and read SQL table. Alright, so right SQL table
on read SQL query, when you're using read SQL, it's actually kind of forward in the
work to either query or table, or an SQL query is the default behavior, what we've done so
far, so in this case, it's just going to issue a query and the connection is going to read
it for you. In contrast rate SQL table is can I read an entire table, you just pass
a name, and it's going to automatically give you all the information for it. So in this
case, all the column names, etc. So it's a lot simpler to read an entire table, the only
thing to keep in mind is that to use this method, you need to install these libraries,
SQL alchemy, and the connections generated from it. So in this case, we create an engine
on we create a connection objects, and now we can pass an actual auction object sorry
for pandas to do it. So again, it's pretty much the same, if you find yourself doing
Red Star from this table, Red Star from that table, it's a lot easier just to write SQL
table, and that's going to do it just advance. As we saw that read CSV files hard to CSV,
sorry, read CSV method had a to CSV method, the same thing happens with read SQL, there
is a read SQL and the results are to SQL, what's what it's going to let you do is get
the from the database and write it down into a database table directly. So it's going to
also receive the connection, right? So to SQL, it's gonna receive what he will name
of these data frame, what table name is going to be, and a connection object. Now something
to keep in mind is that to SQL has an important parameter, which is what happens if the table
already exists, that in the default way, it's going to fail, just going to throw an error
when you are trying to save data to a table. And this makes sense, because as data analysts
were usually reading data and processing it, we're not so much writing it. So we want to
meet make sure that it's not by mistake. But if you do actually want to write data, you
can just change this parameter if exists something like replace or append. Usually, we're writing
to intermediate intermediary table tables, again, you can choose either to replace the
whole concept of the table, be careful here, or to append, write, just write it a dn of
the current table. So that's just for to see. So this was the way to read data from databases,
of course, we're not touching on anything like SQL and all that, that it's a lot more
advanced, it's just for you, if you already know SQL, if you're already working with databases,
you can pretty much copy and paste what we're doing here. And you're gonna, you're gonna
get your data import into Python. So let's move forward to read some HTML files. And
now very quickly, I'm going to show you how to read tables or data frames directly from
HTML web pages. To be honest, this is a simple method is going to be just read HTML, but
it depends a lot on the structure of the web page. So if it's not well structured, or the
tables are not correctly created, you're going to have issues and you will have to do a ton
of data cleaning. In my experience, whenever I try to parse a table from a well structure
site like Wikipedia, or some stats site, it usually works very well. And it's a very quick
way of hacking. You know, whenever you have questions, you know, like, I don't know, I
need to know the GDP of countries. Instead of looking for a GDP data set, you can just
go to Wikipedia page, there is usually a table there, you can directly parse it and you are
done. So again, it's it's a relatively simple way to get some data for quick hacking and
exploration. The way it's going to work is we have these HTML creative. It's just for
testing purposes. To get started, usually, of course, you will try to read something
from a live website. So you're going to pass the URL to the read HTML method. And the read
HTML method will download the content of the page and parse it. Let's suppose we have the
the content already the HTML, and this is what it looks like. This is a exactly the
same HTML we have on top, I'm just displaying it here in a book. And what we're gonna do
is we're gonna invoke the method, read the HTML. And the read HTML method is going to
parse the entire HTML and look for multiple tables, not just one site will potentially
have multiple tables, even if you don't see them. The is a common way to structure things
in HTML to use tables. That's why it's going to pause multiple tables. In this case, we
stored them all in a DFS, multiple player like multiple data frames. And we see that
there is only one. So in that case, we're just going to get the first data frame. And
it has correctly parsed what we had before just working in the same way.
The same is going to happen with for example, things for headings and all that if the table
doesn't have a header, it's gonna automatically right understand.in that case. So that's pretty
much as we know it already. In this case, what you're going to see is what I told you
before about data cleaning process that these table does not have a header like the previous
one that has a T head to head attribute, in this case, a header is just another row. So
that's why read HTML is going to have issues and you have to provide a little bit extra
information. So let's see another more realistic example. And we're going to parse data directly
from a website, let me tell you here, just just for educational purposes, you always
need to understand if you have if the data is public, so you can actually parse it. Again,
for Wikipedia, at least what I do, the content is created comments, so you can get a hand
on it. There. What we want to show you here is that a very complicated table that has
multiple headers, etc. So that's why we're using this example. So we're gonna get the
URL, and we're gonna directly do NBA tables. Equals read HTML, the only table in this page
is this one, the large one. So that works. And now we're gonna get NBA is going to be
that and we see that the all the players in this case have been parsed. What about something
else, let's actually open this page right here to Wikipedia, for the Simpsons. And here,
we will probably find several tables. See, we have one right here, this one. So I'm going
to import it. We have 27 tables, again, you don't see it. You don't see them, sorry, but
they are there. And the most important one is the one we care is these one right here.
So the problem you're gonna have with this table is that each using both columns, pans
and rows pans. So in this case, this column here is pans for one to three columns. And
these row here stands for 123, at least three rows. So that column spans results in these
very ugly data frame, and you will need a little bit of extra cleaning. So that probably
you're going to find with HTML tables that usually there are things that are not well
formatted for machines that are formatted for humans. So for example, in this case,
we have this header repeated, when you parse this data, you're going to find that every
20 rows, there is going to be header row, and you will have to clinic every for in this
case, to enter rows, you will need to drop it you will do something like df the drop,
let's see, actually, if we can see it haven't tried this, but let's just do it like that
head, and you're going to find 25 Records now. So here, record 22, we find that, Heather,
so what we're going to do is you will need to do something like df the drop df dot drop,
range 22 starting in 22, up to the F the shape, zero, right, these many rows plus one plus
one and every 20 rows, I don't know this is going to work, just run it. Hope didn't it
didn't even work. It didn't compile. Oh, this is NBA actually. There you go.
So maybe it works, you can check it. But what I'm going to say is, again, there is some
cleaning to do because HTML pages are optimized for humans, not for machines. So usually,
it's going to take a little bit more time. The good news is that there is usually a service
associated that you can consult. So for example, there is a Wikipedia API that you can use
instead of a page. But again, sometimes just easier to pull the data directly from Wikipedia.
So that's it. You can also write data to CSV or of course or HTML. That's pretty much the
standard. As we've said, this is up all we had for the read data portion. And we're gonna
move forward now with a few other methods, especially what we call data wrangling. We're
going to do a little bit of grouping and keep moving forward with our tutorial. We have
decided kind of last minute to our final source of external data that it's going to be an
Excel file. It's just a common Excel files, you know it, because we imagine that you might
come from an Excel backgrounds, you can just export the data you have in your Excel files,
Excel spreadsheets, and load them into Jupyter Notebook and start working with them with
him this so you can try things out and kind of draw the pearls in between Excel and what
you do with pandas and Python. So the first thing is, an Excel file is not a text file.
So if you try getting the content of it, it's not a text file, it's not so simple to parse
it. So that's why it's gonna require external tools that they already installed in notebooks
AI, there might be a student's holding goal up, but it depends on your computer, how you're
going to install it. So just keep in mind that there might be issues when importing
data from Excel, if they if there is low compatibility between the library you're using another spreadsheet
version you're using. But without those without getting into those details, there is read
Excel method, which pretty much takes care of everything for you has different parameters,
like the finding the the sheet that you're reading from, of course, the path, etc. So
we're going to start reading these file, which is products file that has three sheets, products,
descriptions, and merchants. And it's actually something we use in an Excel file to sorry,
in our data analysis, from Excel to pandas course, to show how to merge data and all
that. And from this file, what we're gonna do is just read Excel. And what you're gonna
see is that it reads the first sheet of the Excel file, I mean, a data frame is just corresponds
to one sheet only, right? And the first one is product. So that's what we are writing.
There are different behaviors for it, you can change the way you parse, Heather's etc,
you notoriety defining and specific index, that's pretty much everything we have seen.
So far, it's selecting specific shifts is simple, just pause the sheet name, and you
can share the rate story either products, merchants, whatever is available in the current
Excel file. There is another format or a new specific class that it's a little bit more
advanced. But it's the Excel file class. So it's not, as we were doing here, right, Excel
directly is going to read thought Excel file into a data frame, but you're going to instantiate
this Excel file class, with the parameter being the file name. And now these files gonna
have just a reference of everything you have. In this case, we can do for example, sheet
names, it's going to tell you how product descriptions merchants, there's a little bit
more explanatory data analysis. So let's say you can't use Excel to actually see the contents
of the Excel file, this is going to be helpful, you're going to first parse the Excel file,
get the sheet names, and a little bit more of an understanding of it. And now we can
say from these files we have previously parsed right here or instantiated, we can parse the
product, the product sheet, and that's going to get you that that frame. And the same thing
is going to happen with all the parameters weekend pass, they are the same as read Excel.
Finally, you can see that the results are to excel file. And it works pretty much the
same way as to CSV, and decide if you pass an index or not. And also you can define if
you're going to pass a sheet name or not, are just going to be the default one. So as
you can see, getting your data into a from an Excel file into a CSV, data frames array
is extremely simple. There are more customizations to do, let's say all your file is shifted
array, either rows or columns, you can change that with Star row or column that's going
to work, too. So that's pretty much the only thing we need. If your writing process is
a little bit more complicated. Like for example, you want to write specific sheets in our multi
sheets. Excel file, you can use what we call an Excel right and it's also part of fantasy,
you instantiate the rider, and then you can start the ride process saying which shades
you want to ride with each one of those, that friend. So again, reading and writing data
from on to Excel files is relatively simple. It all depends on the libraries are installed.
It depends on on what libraries you have in your current environment, if it's windows
or if it's a Linux slash slash mark, the documentation of PD dot read Excel
might have more details for the given platform that you have. So let's see if it names per
document, if it's not here, it's gonna be in the pandas documentation, but there might
be a requirement For each one of the platforms, that pan This
is supported. So just check it out, check for your own for your own platform if you're
in Windows, Mac Linux, how to get those libraries installed.
So in case you're just getting started with Python, and you might come from another language,
the objective of this quick section is to show you Python. Ideally, in under 10 minutes,
I think it's going to take a little bit more. But there's a very, very, very quick reference
of Python, again, just the high level features of the language, how to use it, how to code
functions, how to import modules, variables, data types, collections, etc. You can just
scroll through this notebook, if you want to take less time, I will be providing an
explanation on top of all the topics, but there's a very good reference of the entire
language. So to get things started, Python is an old language period. It has card, it
has caught more attention in the past five to 10 years. But it's a very old language.
It's even older than Java. It's up here in 1990s. And it was created by this person good
by Guido van Rossum. And it's an important actor in our ecosystem he is used to be I
think he still the one deciding discussions etc, when it comes to defining features of
the language, etc. Python is a high level interpreted dynamic language. And this means
a tone actually, if we read these entire sentence, interpreted high level, general purpose, this
is basically high level programming language, it's object oriented. And it also includes
functional attributes or functional features like functions as first class objects, etc.
And it also, of course, it supports imperative programming. And it has a wide variety of
applications, you can do web development with Python, you can do scripting, it's a lot use
for system development for configuring machines in general. And of course, you can also do
data science, it has multiple applications has a couple of interesting features like
indentation, for defining blocks, etc, that make it and very good language to get started
with programming. So if Python is your first language, you should be comfortable with it.
It's a very good idea for me, it wasn't my first language. And I hope it was, it wasn't.
But I, I have taught people programming with Python as their first language. Seriously,
it's always been very good for them, because Python doesn't have weird things like my have
in JavaScript or Java. So it's a very concise language and consistent language to be honest.
So let's get started very quickly. First of all, when you're going to install Python,
your own computer or you can use notebooks AI or Google call up. But if you're installing
in your own computer, you might see that you can install either Python two, or Python three,
or actually, if you're reading tutorials online, etc, you might see Python two and Python three,
the reality is that Python two was deprecated in 2020, so the you cannot you should not
use it anymore. There are still ways to install Python two, but it was deprecated. So you
shouldn't use Python two, you should stick with Python three, which is the evolution
of the language. So ton of fixes from Python to the bay where, where things happen in the
language and used to confuse beginners. So that's no longer a problem. Python three,
again, is what you should use, you will read in multiple tutorials, etc. What they are
using Python two, you should try using Python three, and sometimes the code will break,
but the changes to fix it are not very hard. So to get things started here, I will be drawing
the problem of this and with regular syntaxes. For example, this is the way you will define
a function in for example, JavaScript. And it's also very similar to something like C
or Java based languages, the function keyword, curly braces, etc. So I will be drawing a
parlors and with these sort of languages. So to get things out of the way to defined
function in Python is in this way. And the main characteristic of this language is that
the way we're going to define blocks is by Using different indentation levels. So this
is a valid function in Python def is the key where we use the name of the function the
parameters it receives. And the way to define that the body of the function is by just indenting.
Everything one level to the right. Usually, this is just for spaces.
Another example is an if else statement. So if this thing happens, do that if else do
something else, right? This is JavaScript. In Python, again, it's defined by indentation.
If this thing happens, we indent one level to the right, do this else do something else,
if there was another if statement here, if I don't know, language, ends with something
like I don't know, three, then do something else. Print pi, three, for example. So we're
indenting everything to the right, every time we start a new block, whenever the block finishes
is just when you go back again, print this as first block, right, that's the way it's
going to work it by indenting. Our blocks, this is very good, because first, we don't
have debates of where we should place the curly braces. And also, because it makes it
a lot more readable, it's a lot easier to read these code because there is obligated
obligatory indentation to even make the code work to. So you can see that's that's just
how it works. How we're going to make comments in Python, just by using the number pad symbol,
there we go. And the way to define variables is just by specifying the name. So it Python
is a language that you don't need to declare variables, you just declare and define everything
and just one pass, you know, you find a variable, as it goes. Python is dynamically dynamically
typed. But it's also strongly typed. And these might kind of cause confusions. But basically,
you can assign variables to any value you want. And you will see that collections etc,
are heterogeneous in terms of types, etc. It is a very dynamic language. Talking about
types, I'm going to show you the most important types that we have in Python, especially we
have numbers, of course, integers, we don't have so many like, like, you might find that
other languages, like different precision cetera, we have integers, there is also the
concept of Long's that has changed with Python two. To be honest, on Python three, to be
honest, we use just integers, that's the way we work. It's a, it's a smart enough type
to save storage when needed. So that's, that's good. And it will also have floats, right,
which is the regular float type for floating point arithmetic in other languages. And of
course, it suffers if you want from strange behavior from float floating point arithmetic,
like in this case, you can prevent that by using the decimal module, which, as you can
see, doesn't suffer from from this issue. So numbers, we have integers floats, and we
also have decimals, strings are just a type str, and they are defined literal, right,
as in this in the st, you can see right here, you can just type the string as it goes. There
is a difference between there was a difference already in Python two, between Unicode and
strings, etc. In Python three, that has all been fixed. So we Python three, this is all
Unicode. And there is the concept or the difference in terms of the concept of something being
the type. The Unicode code points as it's this string, and the underlying encoding will
turn it into binary. So in Python three still have we have a few ways to differentiate between
whether it's a binary string or whether it's a text based string. For you shouldn't worry
about it, I just want you to know, if you're writing a Python tutorial, for example, you
might find a difference between Unicode strings and regular strings, which is, is no longer
something that we should be worrying about. If you have a string that it's too long and
it expands multiple lines, you can always write it using three quotes can be double
quotes or single or single quotes.
So just to create multi line strings is extremely simple. Boolean there are two Boolean type
do Boolean objects are unique, right? It's kind of a single tone which is the true or
false objects. For example, They are of type Bo. There is also the concept of No, in Python,
which is none, we don't have no, we have none, but it serves pretty much the same purpose.
In Python, everything is an object. So even this strange, strange objects, like none will
have an associated class, if you want, everything in Python is an object. So all these types
of you have seen. So for example, we have this string, which is H of a string. The type
is str, you can use the int, str float bool types, right, but it's the result of the type
also as function. So in order to cast in this case, a string into in order to cast a string
into an integer, you will use it you will do it using the end function, which is the
same thing that you get with these, for example, so this is the same as this, as you can see,
what we have to show. So functions again, death is the key word we use, we don't use
function, we use death I, you can use define, as a mnemonic, the name of the function parameters
are optional, and finally have the return keyword, you should always include a return
you usually 99% of the time, the function should return something. Because that's going
to be the result assigned once we invoke the function just this is pretty regular. If your
function doesn't return anything explicitly, if that means if you haven't written down
a return statement anywhere in your function, the function will still return something so
that the fact that you haven't included a return statement explicitly doesn't mean that
the function is not returning anything implicitly, actually, it is returning something, it's
returning none, right. So by default, if you don't include a return, Python will do this.
Just for you to know a function always returns something as specified parameters and passing
parameters is pretty standard. Python has some advanced features with parameters like
for example, variable length arguments, we can pass as many arguments, we want to make
it very dynamic keyword arguments, named arguments, etc. So all their ethic operators, you know,
already, the shin modulus, in this case, were doing a power its operation, all this is pretty
standard. And the same thing happens with all our Boolean operators greater than greater
or equals then etc, there are type checking. So this is when we have the strongly typed
feature, even though Python is dynamically typed. It is the types are enforced. In this
case, you cannot compare a two with this doesn't make any sense. And Python is going to complain
about that. So this is an example of an error in Python. The exception type error was raised
on the same thing with bolens and not on or operators. As we saw before control flow is
defined by the indentation so every new block is defined with an indentation level. Python
includes if else and also l F, which is very convenient. And this is an example If this
happens, Elif, Elif, etc. Python does not have a switch statement. For example, loops,
how are you going to loop through something in Python loops on lists,
or
collections in general, are very interconnected. Because in reality, when you're looping the
Python, you're not doing a regular in Python, we don't have something like in, in Java,
you're gonna have something like int i equals zero.
What else I
it's been decades. And I this is I haven't coding in Java. So I, I don't know, minus
10, less than 10 less than 10. And here we do I put last There you go. So we don't, we
don't have these in Python. We have a way to mimic it. But we in Python, we always eat
iterate over a collection. So what we're going to do is we're going to create a range elements,
and we're going to iterate over it. So the way it works is very close to one other language
is going to be a for each. Alright, so in this case, we have all these elements and
we're going to do for name in names, that's it. And at any moment, the name is going to
be associated with an element in the list. while loops are part of the language, they
are usually discouraged in favor of for loops. If something can be coded with a for loop,
it should be coded with a for loop and not a while loop. Because as you might know, already,
these my trigger or these might result in an infinite loop if you're not checking the
conditions correctly. So the collections we have in Python, are the fundamental ones,
the primitive ones, the most important ones are first the list Python is we do a heavy
usage of lists. And it's just a heterogeneous data structure. So you can put anything in
it. And actually, all these collections are heterogeneous, you can mix volumes as you
want. And in this case, we have three elements that we have added one string, one integer,
one string, and one Boolean. And let me say something here. Even though pythons, Python
supports mixed types in the collections, it doesn't mean that you should do it. To be
honest, we should, you should usually avoid mixing types in collections, because that
means we don't, we don't know what we're putting in it, right. So it's, we should be consistent.
So it's possible, revisit your code, if you have too many different types in it. I'm checking
the length length function accessing elements is by by zero indexed, and we use square brackets.
So in this case, give me the first element given the second element. And also we can
index starting from the from behind from the end. So in this case, minus one, minus two,
minus three. So in this case, minus one minus two, again, give you different elements, you
can check the operations associated with all these elements. Very quickly, a list is L
dot append, we're going to append the new element. So the list now has that element
at the end. And we can check if that element is part of the list in this case is true in
this case is false. topples are similar to lists, they are also sequences, but the main
difference is that they are immutable, there is no way to add new elements to a tupple,
or remove elements from a tupple once it has been created. So in this case, we have created
a list with three elements. Now tupple, sorry, with three elements, we can access it, we
can check if something is in it in the same way that we did with a list. But in this case,
with a tupple. Again, you cannot modify it tupple never changes, you can't add elements
to it. Another important data structure is a dictionary. In Python, a dictionary is a
key value, right and mapping, it's similar to an object in JavaScript or hash table in
in, in Java, it's a key value mapping type. And in this case, we are going to associate
values to names. So you can see this, the way I like to explain it is if you create
a topo list, right? So let's say we're going to create a list, out of all these elements,
give me one second, we're going to create a list. There we go, we're gonna copy these
elements. And we're gonna associate that to our list. There you go. So these are a list,
we could very well store the information about our customers in a list, right? That works.
I mean, I can get it done. The problem is that whenever I need to access information
about this list, we're going to say, for example, I don't know I want to give me the email for
this
customer, I have to remember the position that the email is located so in this case
is going to be position number one, if these information grows, instead of having 1234
values or four pieces of information for our user, we have 100. Right, then it's gonna
be very hard to access those individual volleys. So that's why we create dictionaries, dictionaries
are collections of values. The important part is on the right, the important part is the
value. But they are instead of just indexed by the precision, we give them arbitrary names,
we tell them very explicit names. This is the name, this is the email. This is the age.
And this is if they are subscribed or not. So once we create these dictionary, we can
access those values by the name, give me the email of these user or is the age present
of the user is the last name present of the user in the user in the user dictionary. So
again, it's a way to store information associating later In order to make it simpler for us later,
let me delete this. And I move four sets sets are very common data structure, he is when
you're learning about a collections and, and and yeah, the instructions in general, it's
not so common in too many languages. I mean, it's not very popular in Python, we use it
often because it has a very interesting feature, first of all, and it's something that I forgot
to tell you about dictionaries, both sets and dictionaries, are what we call unordered.
data structures, you never know, the order of the elements. In Python, with recent versions,
there have been changes, which make Python dictionaries ordered. But for now, I'm going
to say you shouldn't rely on it, you should think your dictionaries as they are completely
unordered data structures, and the same thing for sets, sets are, it's a bag that contains
elements, you know, it's a big bag, you keep throwing elements inside of the set, there
is no orphan in it. And what's gonna happen with it, you're gonna odd elements, for example,
to the set, or you're going to remove elements to the set. And there is one important thing
that makes this set so useful, and it's the membership operation, I'm gonna
write it down here, membership, ship operation, there you go. So you can access these notebooks
later.
So in the membership operation, the the, the process of checking if something now, nine
in s, the process s of checking this is extremely fast, it will be called oh one. And this is
because as you might have seen here, when I created this set, I included a couple of
repeated elements, 333, write 11179, the resulting set doesn't have those repeated elements,
these are two features of the set, the set will only contain unique values. And by the
way, it's implemented behind the scenes will make dot these unique values are extremely
simple to check whenever you pass these membership operation is extremely simple, or sorry, is
extremely performant. It's very fast, different from for example, a list. So keep it in mind
sets are very, very useful when you're checking for members. So again, as I told you before,
we're going to iterate over collections with the for loop. So in this case is if we have
a list, it's going to be for element in list. There you go. If you have a user dictionary,
use a dictionary, sorry, in this case user, we're going to the default iteration is by
key, we're going to get for name email age subscribed, and we have to extract the value
out of the of the dictionary, we could also do for value in user dot values. Oh, there
you go. Or you can iterate over both key and value with items. Key. And value. There you
go. So each iteration in in in Python is very readable to put it in a way. And again, remember,
we're always using the for loop that assumes that you're iterating over a collection, we
don't have the for Ei equals zero equals zero, I equals zero, i less than 10. i plus plus
we don't have that right in Python, we can simulate it with for i in range. Five, for
example. Print. We've got simulated with the range function, which generates pretty much
those elements. Something that you might have heard about Python is that it has a huge library
of built in modules, right that you can just import and just gonna work. There are so many
things already coded in Python, that it makes it very simple for you to create something
on top. Do you want an a library for I don't know security cryptography Math, numeric processing
NumPy, right? machine learning web development, creating games through is pi game, do you
want to create a graphical user interface, whatever you want to do, there is usually
a library that has already been coded and will make your job easier. On top of that,
the bill team is down there library, right, which is already included with Python, it's
not third party. In this case, it's already created by the Python core team. It's a huge
library, so many modules. And the way it works is by importing this module, so this is the
way we work with packages and modules, there are differences between modules and packages,
third party ability, and this is a little bit more advanced. But again, this gives that
random number generator, it's already built in. And you can check the docs
right here.
exceptions, whenever you do something that doesn't work. So in this case, we say, if
the age is greater than 21, but age is a string, it's an it's not an integer, this is going
to fail. We can catch exceptions before they happen, that's going to be with a try and
accept lock. Right. In that case, if this fails, if anything here fails, these blocks
going to be kicked in. And you can catch the exception without the program fail failing.
And you can be more explicit about the error aspect. So again, this is just an introduction.
It might be useful if you're coming from another language, especially to keep this notebook
as a reference. We're going to be using Python a lot, of course, and it's a great language
if you want to do scripting, work development, of course processing with data, data analysis,
etc, visualizations, machine learning, Python is just great. So I hope this tiny tiny reuse
lesson helps you port your knowledge from other languages into Python. And that's it.
UNLOCK MORE
Sign up free to access premium features
INTERACTIVE VIEWER
Watch the video with synced subtitles, adjustable overlay, and full playback control.
AI SUMMARY
Get an instant AI-generated summary of the video content, key points, and takeaways.
TRANSLATE
Translate the transcript to 100+ languages with one click. Download in any format.
MIND MAP
Visualize the transcript as an interactive mind map. Understand structure at a glance.
CHAT WITH TRANSCRIPT
Ask questions about the video content. Get answers powered by AI directly from the transcript.
GET MORE FROM YOUR TRANSCRIPTS
Sign up for free and unlock interactive viewer, AI summaries, translations, mind maps, and more. No credit card required.