TRANSCRIPTEnglish

Data Analysis with Python - Full Course for Beginners (Numpy, Pandas, Matplotlib, Seaborn)

4h 22m 10s40,890 words2,491 segmentsEnglish

FULL TRANSCRIPT

0:00

Welcome to our data analysis with Python tutorial. My name is Santiago and I will be your instructor.

0:04

This is a joint initiative between Free Code Camp and remoter. In this tutorial, we'll

0:09

explore the capabilities of Python on the entire PI Data stack to perform data analysis,

0:13

we'll learn how to read data from multiple sources such as databases, CSV and Excel files,

0:19

how to clean and transform it by applying statistical functions and how to create beautiful

0:23

visualizations will show you all the important tools of the PI Data stack pandas, matplotlib,

0:29

Seabourn and many others. This tutorial is going to be useful both for Python beginners

0:33

that want to learn how to manage data with Python, and also traditional data analysts

0:38

coming from Excel tableau, etc. You learn how programming can power up your day to day

0:43

analysis. So let's get started.

0:57

Welcome to our data analysis with Python tutorial My name is Santiago and I am an instructor@remoter.com

1:03

an online Data Science Academy. This tutorial is a result of a joint effort by remoter and

1:09

Free Code Camp, and it's totally free. It includes slides, Jupyter, notebooks and coding

1:15

exercises. Let me tell you a little bit more about remoter were an online hands on Data

1:20

Science Academy. We specialize in data science, including data analysis, programming and machine

1:26

learning. We have a complete course catalog and we're adding more content every month.

1:31

If you're interested in learning data science or data analysis, check us out. As part of

1:36

this joint effort between Free Code Camp and remoter you can get a 10% discount in your

1:42

first month by using the following discount coupon. Let's quickly review the contents

1:47

of this tutorial. In the description of this video, we have included direct links to each

1:52

section, so you can jump between them. This is the first section and we are going to discuss

1:58

one is data analysis. We'll also talk about data analysis with Python and why programming

2:03

tools like Python SQL and pandas are important. In the following section will show you a real

2:09

example of data analysis using Python. So you can see the power of it will not explain

2:15

the tools in detail. It's just a quick demonstration for you to understand what this tutorial is

2:20

about. The following sections will be the ones explaining each tool in detail, there

2:25

are two more sections that I want to especially point out. The first one is section number

2:30

three Jupiter tutorial. This is not mandatory, and you can skip it if you already know how

2:35

to use Jupyter notebooks. Also the last section Python in under 10 minutes. This is just a

2:41

recap of Python. If you're coming from other languages, you might want to take this first.

2:46

If that's the case, again, you can use the links in the video description to jump straight

2:51

to it. All right now let's define what is data analysis. I think the Wikipedia article

2:57

summarizes perfectly the process of inspecting, cleansing, transforming and modeling data

3:03

with the goal of discovering useful information, you forming conclusions and support decision

3:09

making. Let's analyze this definition piece by piece. The first part of the process of

3:15

data analysis is usually tedious. It starts by gathering the data and cleaning it and

3:20

transforming it for further analysis. This is where Python and the PI Data Tools Excel.

3:26

We're going to be using pandas to read, clean and transform our data. Modeling data means

3:32

adapting real life scenarios to information systems using inferential statistics to see

3:37

if any pattern or model arise. For this we're going to be using the statistical analysis

3:44

features panelists and visualizations for matplotlib and Seabourn. Once we have processed

3:48

the data and created models out of it, we'll try to drive conclusions from it finding interesting

3:55

patterns or anomalies that might arise. The word information here is key. We're trying

4:00

to transform data into information. Our data might be a huge list of all the purchases

4:07

made in Walmart in the last year, the information will be something like pop tarts sell better

4:13

on Tuesdays. This is the final objective data analysis we need to provide evidence of our

4:19

findings, create a readable reports and dashboards and aid other departments with the information

4:24

we've gathered. Multiple actors will use your analysis, marketing sales, accounting executives,

4:29

etc. They might need to see a different view of the same information. They might all need

4:35

different reports or level of detail what tools are available today for data analysis.

4:41

We've broken these down into two main categories, our managed tools, our close products, tools

4:47

you can buy and start using right out of the box. Excel is a good example. Tableau and

4:52

luchar are probably the most popular ones for data analysis. In the other extreme, we

4:57

have what we call programming languages or we Call them open tools. These are not sold

5:03

by an individual vendor, but they are a combination of languages open source libraries and products.

5:09

Python R and Giulia are the most popular ones in this category. Let's explore the advantages

5:15

and disadvantages of them. The main advantage of close tools like Tableau or Excel is that

5:21

they are generally easy to learn. There is a company writing documentation providing

5:25

support and driving the creation of the product. The biggest disadvantage is that the scope

5:31

of the tool is limited, you can cross the boundaries of it. In contrast, using Python

5:36

and the universe of PI Data Tools gives you amazing flexibility. Do you need to read data

5:42

from a closed API using secret key authentication for example, you can do it? Do you need to

5:48

consume data directly from AWS kinases, you can do it a programming language is the most

5:54

powerful tool you can learn. Another important advantage is a general scope of a programming

5:59

language. What happens if Tableau for example, goes out of business. Or if you just get bored

6:04

from it and feel like your career is taught you need a career change? learning how to

6:08

process data, using a programming language gives you freedom? The main disadvantage of

6:14

a programming language is that it's not as simple to learn as with a tool, you need to

6:20

learn the basics of coding first, and it takes time. Why are we choosing Python to do data

6:26

analysis? Python is the best programming language to learn to code. It's simple, intuitive,

6:32

and unreadable. It includes 1000s of libraries to do virtually anything from cryptography

6:37

to IoT. Python is free and open source. That means that there are 1000s of PI's very smart

6:44

people seeing the internals of the language under libraries. from Google to Bank of America,

6:49

major institutions rely on Python every day, which means that it's very hard for it to

6:54

go away. Finally, Python has a great open source spirit. The community is amazing, the

7:00

documentation, so exhaustive, and there are a lot of free tutorials around checkout for

7:05

conferences in your area, it's very likely that there is a local group of Python developers

7:10

in your city. We couldn't be talking about data analysis without mentioning r r is also

7:17

a great programming language. We prefer Python because it's easier to get started and more

7:22

general in the libraries and tools it includes. R has a huge library of statistical functions.

7:27

And if you're in a highly technical discipline, you should check it out. Let's quickly review

7:32

the data analysis process. The process starts by getting the data where is your data coming

7:38

from? Usually it's in your own database, but it could also come from files stored in a

7:44

different format, or a web API. Once you've collected the data, you will need to clean

7:49

it. If the source of the data is your own database, then it's probably in writing shape.

7:54

If you're using more extreme sources like web scraping, then the process will be more

7:59

tedious. With your data clean, you'll now need to rearrange and reshape the data for

8:05

better analysis, transforming fields merging tables, combining data from multiple sources,

8:11

etc. The objective of this process to get the data ready for the next step. The process

8:17

of analysis involves extracting patterns from the data that is now clean and in shape. Capturing

8:22

trends or anomalies. statistical analysis will be fundamental in this process. Finally,

8:28

it's time to do something with data analysis. If this was a data science project, we could

8:33

be ready to implement machine learning models. If we focus strictly on data analysis, we'll

8:39

probably need to build reports communicate our results, and support decision making.

8:44

Let's finish by saying that in real life, this process isn't so linear, we're usually

8:49

jumping back and forth between the step and it looks more like a cycle than a straight

8:55

line. What is the difference between data analysis and data science? The boundaries

9:00

between data analysis and data science are not very clear. The main differences are that

9:05

data scientists usually have more programming and math skills, they can then apply these

9:11

skills in machine learning on ETL processes. The analysts on the other hand, have a better

9:16

communication skills creating better reports with stronger storytelling abilities. By the

9:22

way, these Weiler chart you're seeing right here is available in the notes in case you

9:26

want to check out the source code. Let's explore the Python and PI Data ecosystem, all the

9:31

tools and libraries that we will be using. The most important libraries that we will

9:36

be using are pandas for data analysis, and matplotlib and Seabourn for visualizations.

9:41

But the ecosystem is large and there are many useful libraries for specific use cases. How

9:46

do Python data analysts think if you're coming from a traditional data analysis place using

9:52

tools like Excel and Tableau you're probably used to have a constant visual reference of

9:57

your data. All these tools are point on Click. This works great for a small amount of data.

10:03

But it's less useful when the amount of records grow. It's just impossible for humans to visually

10:09

reference too much data, and the processing gets incredibly slow. In contrast, when we

10:14

work with Python, we don't have a constant visual reference of the data we're working

10:19

with. We know it's there. We know how it looks like. We know the main statistical properties

10:24

of it, but we're not constantly looking at it. These allows us to work with millions

10:29

of records incredibly fast. This also means you can move your data analysis processes

10:34

from one computer to the other, and for example, to the cloud without much overhead. And finally,

10:41

why would you like to add Python to your data analysis skills aside from the advantages

10:45

of freedom and power theories, another important reason, according to PayScale, data analysts

10:52

that no Python and SQL are better paid than the ones that don't know how to use programming

10:57

tools. So that's it. Let's get started in our following section will show you a real

11:02

world example of data analysis with Python, we want you to see right away what you will

11:07

be able to do after this tutorial.

11:15

We're gonna start this tutorial by working with a real example of data analysis and data

11:20

processing with Python, we're not going to get into the details yet, the following sections

11:26

will explain what each one of the tools does, and what is the best way to apply them combining

11:34

and the details of them. In general, this is just for you to have a quick on high level

11:39

reference of our day to day processes, data analysts, data managers, data scientist using

11:46

Python. So the first data set that we're going to use is a CSV file that has this form, you

11:53

can find it right here, under the data directory, the data we're going to be used is this, I

12:00

have just transformed it into a spreadsheet. So we can pretty much look at it from a more

12:05

visual perspective. But remember, as we said in the introduction, as data analysts are

12:11

not constantly looking at the data, right, we don't have a constant visual reference,

12:17

we are more driven by the understanding of the data right in the back of our head, and

12:22

we understand how what the data looks like, what's the shape of it. And that's what it's

12:26

conducting our analysis. So the first thing we're going to do is we're going to read it

12:30

this CSV into Python, and you can see how simple it is just one line of code gets us

12:36

the CSV read into byte, then we're going to give a quick reference. And this is what the

12:41

data frame that we have created looks like data frame is a special word is a special

12:48

data structure, we use independent tool. And again, we're going to see that in detail in

12:53

the pan this part of this tutorial. The data frame is pretty much the CSV representation,

12:58

but it has a few more enforced things like for example, each column has a strict data

13:06

type. And we will not be able to change it to tetra, it's a better way to conduct our

13:11

analysis, the shape of our data frame tells us how many rows and how many columns we have.

13:17

So you can imagine that with these amount of rows, it's not so simple to again, follow

13:23

a visual representation of it's like, it's pretty much infants crawling, in this point

13:30

100,000 rows. But the way we work is by immediately after we load our data we have we want to

13:40

find some sort of reference in the shape and the the properties of the data we're working

13:45

with. And for that we're going to do first an info to quickly understand the columns

13:50

we're working with. In this case, we have date, which is a date time field, we have

13:54

day, month year on that are just complimentary to date, we have the customer age, which is

14:00

uninjured, which makes sense right? age group, you can say it's right here. It's age group

14:07

youth, customer gender, we have an idea again, of the of the entire data set, we know the

14:13

columns we have, but we also know how large it is. And we don't care what's in between,

14:19

we will be cleaning it probably, but we don't need to actually start looking row per row,

14:26

right just with our very limited eyes, we have a better understanding of the structure

14:32

of our data in this way. And we're going one step further, we will also have a better understanding

14:38

of the statistical properties of this data frame with a describe method. For all those

14:43

numeric fields, I can have an idea

14:45

of the statistical properties of those. So for example, I know that the average age of

14:52

these data set is 35 years old. I also know that the maximum age in this case if these

14:59

Or is the sales data is 87 years old, I know the minimum is 17 years old. And again, I

15:07

can start building right if my understanding of this that physical properties of it. So

15:13

in this case, the median of my age is very close to the mean. So this is telling me,

15:18

all is telling me something, and the same thing is going to happen for each one of the

15:22

columns that we are using.

15:26

For example, we have a negative profit here, and we have very large values here are these

15:33

correct, is maybe there's a mistake, again, it's by having a quick statistical view of

15:39

our data, we're going to be driving the process of analysis without the need of constantly

15:45

looking at all the rows that we have. It's a, it's a more general holistic overview.

15:50

So we're gonna start with unit cost, let's, let's see what it looks like. And we're going

15:54

to do a describe only if you need coast, which is pretty much what we had right here. In

15:58

the previous in this line, what we did was for the entire data frame for the entire data,

16:03

in this case, we're just focusing in the unit coast, cost, sorry, column, the mean, the

16:09

median, all fields, we know already pretty much from this, and we're gonna quickly plot

16:14

them, we're going to use these tools to visualize them. And it's the same tool, it's paying

16:18

this that it's using on top, right? It's using matplotlib. So the visualization is created

16:25

with matplotlib. But we're doing it directly from pandas. And again, don't worry, this

16:29

is all explained in pandas lessons. So this is unit costs, right is what this is the box,

16:34

but we have just created, we have the whiskers that mean that shows us the the first and

16:41

third quartile, the median. And then we see all the outliers that we have right here.

16:48

So we see that our product study is around $500 is considered to be an outlier. And the

16:55

same thing if we do a density plot, right. So this is what it looks like. We're going

17:00

to draw two more charts, right, in which we're going to pretty much point out the mean and

17:05

the median, right in the distribution charts. And we're going to do a quick histogram of

17:11

the costs of our products. Moving forward, we're going to talk about age groups with

17:17

the age of a customer. And at any moment, we can always do something like sales sort

17:23

here to give a quick reference, we know that the the age of the customer is expressed in

17:30

actual years old they were but also they have been categorized with three, four, actually

17:40

four age groups, seniors, youth, young adults and adults, right. So they we have given categories

17:48

were creative, right to better understand these groups, and we do that with values.

17:53

Value counts, we can quickly get a pie chart out of it, or we could get a bar chart out

17:59

of it. As you can see, right here, we're doing an analysis of our data, we see that adults

18:06

right here are the largest group in our for our data at least. So moving forward, what

18:14

about a correlation analysis? What is a correlation between some of our properties, we will probably

18:22

have high correlation for example, between profit and unit cost, for example, or order

18:30

quantity, that's kind of expected, but that's all something that we can do right here. This

18:35

is matrix right of correlation showing in red high correlation. So order quantity, and

18:43

unit cost or where is profit right here. Profit is right here. So we see high correlation

18:50

with unit with cost with profit. Now with profit, actually, it's the opposite blue is

18:57

high correlation, I'm sorry, the diagonal, which is blue, is correlation is equals one.

19:03

So high correlation is blue. And we see that profit has huge correlate has a lot of correlation,

19:08

positive correlation with unit cost and unit price. And negative correlation is with dark

19:15

red. So we again can have a quick idea. Let's see, for example, here profit, it has negative

19:23

correlation with order quantity, which is interesting, right? It's we wouldn't dig deeper

19:30

into that, of course, the profit has a high correlation positive with revenue, right?

19:35

And again, it's just a quick correlation analysis. We can also do a quick scatterplot to analyze

19:41

the customer age and the revenue right to see if there is any, any correlation there.

19:48

Right? And the same thing for revenue and profit. This is obvious, right? We can we

19:54

can quickly draw a diagonal here, right. So there is a lot Linear depth and dependency

20:02

between these variables. So a form a few more box plots, in this case, understanding the

20:07

profit per age group, right, so we can see how the profit will be, will change depending

20:17

of the customer's age, and a few more box plots. And we're creating these these grid

20:24

of year customer age, unit costs, etc, for multiple things. So moving forward, something

20:33

that we can quickly do when we're working with Python, especially within this is Drew

20:38

shape or data or derive it from other columns, right. So this is pretty common in Excel,

20:43

we can create these revenue per age column, if you're here in Google spreadsheets, you're

20:47

going to do something like revenue, per age, and you're going to do something like equals,

20:56

right? Equals revenue, divided, I don't remember if this correct formula we're using, but just

21:04

for, for you to have a reference. And we're going to pretty much extend this whole thing.

21:10

There we go, Oh, well is processing, and I have 100,000 rows. So you can see how slow

21:19

it is, I let's compare that just to the way Python works, I'm gonna execute this thing.

21:24

It was instant, you know, extremely fast. And it was all calculated seems that we have

21:30

the same results as expected. same results as expected. And we can quickly plot both

21:38

the in a density plot and in a histogram, as you can see, right there, now that revenue

21:43

parade is going to be relevant. In any case, it's just to show you the capabilities of

21:48

what we can do. Let's annual analyze, well, we're gonna create a new column, which is

21:53

calculated cost is the total, the total orders the total, the quantity of the order, times

22:02

the cost, right, extremely simple formula, very fast process. And we're gonna get right

22:10

here, how many rows had a different value than what was provided by cost? So what we're

22:18

doing right here is like, we're quickly checking if the cost provided by the data set, at some

22:25

point doesn't align with the actual cost we are calculating. So is there any mistakes

22:30

that were made by the I don't know the original system, or people doing a data entry, if these

22:37

new column is different from cost, we want to know about that. And that doesn't happen.

22:41

So again, quick, quick, regression plot. In this case, it's very obvious that there is

22:48

some linear dependency between calculate cost and profit. So more formulas, in this case

22:56

costs part cost plus profit. So we're going to adding a little bit more, there is no difference

23:01

with the revenue and the calculated revenue that we are having. So that all makes sense,

23:06

we're going to do a quick histogram of the revenue. We can, for example, on 3%, to all

23:15

the prices that we are using, we need to increase prices. How are you going to do that? Well,

23:21

it's very simple with Python, we're just going to do increase everything by point 03. And

23:28

now all the prices have changed.

23:33

What else we're going to be able to do quick filtering, let's get all the sales from the

23:37

state of Kentucky right. So these are all the sales from the state of Kentucky, we can

23:42

get only the average of the sales by these age group on only revenue, right. So these,

23:51

all these filtering options, and extremely simple to get with Python. In this case, we

23:57

say, give me all the sales from these age group, and also from this country, right,

24:04

and we're gonna get the average revenue from these groups that we are selecting. And again,

24:12

to modify the data, we can make just a few quick modifications, like in this case, we're

24:17

going to say, all the sales from country right to revenue, we're going to increase it by

24:23

1.1. I don't know why, which is doing it arbitrarily. It's just for me to show you how it works.

24:29

So far, so good. Again, we've done a couple things, you don't need to know about the details,

24:33

we will actually go through that in the NumPy independence sections in this tutorial. So

24:38

just for you to have a quick reference of it. There are exercises associated with these

24:44

given lectures. So if you want to pause right now and get into the exercises, that's going

24:50

to be very helpful. We're going to move forward now with the second lecture in which we will

24:56

be using a database this Akila database and we're going Be erasing data, instead of from

25:02

a CSV file, as we did before, we're going to read data now from a database. Reading

25:08

data from a SQL database is as simple as it is from an Excel file or a CSV file, as we

25:14

were doing with our previous example. And once you've read the data, that's we're going

25:18

to do now the process is the same. So what we have right here is a query a SQL query,

25:24

if you don't know about SQL, you can check our courses or other courses online. Basically,

25:29

we're pulling the data from the database. This is one of the advantages of Python, it's

25:33

not, there are connectors for pretty much every database provided out there, Oracle,

25:39

Postgres, MySQL, SQL Server, etc. In this particular example, we're going to be using

25:43

MySQL. So once you construct the query, and you pull the data from the database, then

25:49

the process is the same, we have just converted these outside data into a data frame that

25:54

we can use with our Python skills. The first step, as usual, is to check the shape information

26:01

description of our data of our data frame. In this case, we want to, again understand

26:05

the structure of it. So we want to know how many rows we have 16,000, we want to know

26:10

a little bit more about our rows, we want to know about a little bit more about our

26:14

columns, and how many rows how many records we have for each one of them and the type

26:19

of each one of these columns. And we also want to have a better statistical understanding

26:24

of our data. So we do a quick describe, and we have more details about it. If we want

26:30

to focus in individual columns, right, we can just do that by in this case, we're gonna

26:35

focus in film rental rate, right, pretty much how much you pay to rent a film. Um, we're

26:40

gonna see the kind of distribution we have, we can call it distribution, it's pretty much

26:45

a categorical field in this case, but basically, the rentals are divided into three main categories

26:51

are prices, zero 99 299 499. So that's these box plot these pretty much perfect, never

26:58

seen in real life plot box plot gives you those prices. And move forward, we can also

27:05

check very quickly a categorical analysis, understanding the distribution of rentals

27:11

between cities, so we have two cities. And it's pretty much even as you can see right

27:15

here, creating new columns and reshaping the data for further analysis, etc, is relatively

27:22

simple. In this case, we're going to analyze their return in rentals, right, which, which

27:28

films are going to be more profitable for the company div, dividing the rental rate,

27:33

how much we charge, divided by the cost, how much it costs us to acquire the film. So in

27:41

this case, we can see the distribution of that, right. So most rentals are here in the

27:47

beginning. And then we have more profitable rentals, were making up to 60% above the rental.

27:55

And we can quickly analyze the mean and the median fit right to have a quick idea of all

28:05

that.

28:06

Finally, selection and indexing, if you want to start focusing, if you want to go into

28:10

data, right, you want to zoom in, you want to have a better understanding. So you start

28:14

filtering, in this case, we can filter by customer, but if you want to do it per city,

28:19

if you want to do it per state, if you want to do it per film, per price category, etc.

28:24

It's very simple to filter to filter and zooming, which is one particular characteristic of

28:30

your data. So you can perform a more detailed analysis. So in this case, we have all the

28:36

the films are rented by the customer last name, Hanson, which doesn't mean it's the

28:41

same person. But again, it's very simple to filter dot. And here, we can do we can very

28:50

quickly see which ones are the price, the film's sorry, that have the highest replacement

29:01

cost, right. So basically, what we're doing is we're going to isolate those films that

29:07

have the highest replacement cost. And also we can see right here just for you to have

29:14

an idea, all the films that are in the category PG or pG 13. It's very simple to to filter

29:22

that data. So this is the process we usually follow. we imported the data, we reshape it

29:26

somehow create columns, there is an important process of cleaning up or not highlighting

29:31

this part of the tutorial, we're going to talk about it in the tutorial itself. There's

29:35

the process of cleaning, then reshaping creating new columns, combining data and creating visualizations.

29:40

This is the process, right? We're following here with our Python skills, but it's a tone

29:46

more to odd as you might imagine, from creating reports to running machine learning processes,

29:53

creating linear regressions, etc. For now, this is just a quick understanding of the

29:59

process. We follow. Now starting now we're gonna move forward with more details of each

30:05

one of the individual tools we're going to talk about. We're going to talk about Jupyter

30:10

notebooks. We're going to talk about NumPy. We're going to talk about pandas, we're going

30:14

to talk about mapa, lib, seaborne, etc. Starting now, right? The first thing we're going to

30:20

see is, what is this whole thing that I've been using this Jupyter Notebook, I want you

30:24

to now too, if you want, if you if you don't have experience with it, I want you to have

30:29

an idea of how it works. And then we're going to move forward the individual tools, NumPy,

30:33

pandas, etc. Remember, there are exercises also associated with this particular lecture.

30:39

So you can always go back again, and work with them. Once you get more a better understanding

30:45

of the tools we are using.

30:52

Before we jump into the actual data analysis course, and we start talking about Python,

30:58

pandas, all the tools, we're going to use import files, read data from databases, etc,

31:04

I want to show you the environment that we work with. It's our primary environment, it's

31:11

the tool that we use 99% of the time on its Jupyter Notebook, there are going to be different

31:18

terms here, I'm going to be referring to it as Jupyter Notebook. But as you are going

31:22

to see in this, in this part of the of our tutorial, you can see that Jupiter is actually

31:27

a whole ecosystem of tools. And it's a very interesting project. Jupiter is a free and

31:36

open source, again, ecosystem of multiple tools. And primarily, we're gonna talk about

31:43

first, what is a Jupyter Notebook. What you're seeing right here, and you're gonna see live

31:48

in a second, I can actually show it to you is this thing we're going to use. And we are

31:54

also going to talk about Jupiter lab. Okay, which is the evolution of the regular Jupyter

32:01

Notebook. So, I think this could be familiar to you already. Usually the questions in the

32:07

question is, what's the difference between Jupyter Notebook and Jupiter lab? Well, the

32:13

difference is that Jupiter lab is just a nicer interface on top of Jupyter notebooks. It's

32:19

not just the plain notebook. This is a notebook, but I'm scrolling right now. It's also the

32:24

addition of tree view, it's an addition of get tools, as an addition of command to lead

32:31

and multiple other things. You can open some files with a nice preview in it, etc. So,

32:38

Jupiter lab Jupyter Notebook, they are similar Jupiter lab easy, again, the evolution of

32:43

a Jupyter Notebook. And that's what we're using. Again, Jupiter is a free and open source

32:49

project. So anybody can install it, anybody can download it, it's very simple to get it

32:53

set up in your local computer. In this case, we're using something we call notebooks AI,

32:59

it's a project that provides Jupiter environment for free in the cloud. So you don't need to

33:05

install things locally, you don't need to put things in sync in your own hard drive,

33:10

right you That means you don't need to buck it up, for example, because it's just a service,

33:15

it's all worked in the cloud. So said that, I want to tell you that we have compiled a

33:23

very quick list of everything, we're going to talk in this part of the tutorial, in this

33:28

list of two, it's just a thread of with multiple, multiple hints of how to use Jupyter notebooks.

33:35

So after the video after the course, if you forget some of these concepts, you can always

33:41

go back to this to it, it's a quick reference for you to have. So let's get started. Why

33:47

do we use a Jupyter Notebook? Because it's an interactive real time environment to produce

33:55

our or to to explore our data and to do our data analysis. It's a tool you're gonna fire

34:02

commands, and it will immediately respond with something back. It's a very interactive

34:07

tool, when we're working with data analysis, and this is mainly main difference with some

34:13

other tools like for example, Excel, tableau, etc, is that we are not constantly looking

34:19

at the data, there is no visual reference, like for example you have in Excel, right?

34:24

So in Excel, you're constantly looking at the data, you have it in front of you, there

34:29

are 100,000 cells and you can stroll and see them. The problem is that that's not scalable,

34:36

right? It's like nobody can work with 100,000 rows in their, in their, in their mind, we

34:43

will always forget something. So the way we work with Python indeed, analysis is by always

34:50

having a reference of how our data looks like but always at the back of our head and we're

34:58

not constantly looking at it. We're like this person from the matrix, you know, the, the

35:02

the commander of the matrix that commands people to get get in and out. We're basically

35:10

telling people telling people that basically asking data, right asking questions to the

35:17

data, and having a picture in our mind of how that's going to work, we're not constantly

35:23

looking at it, we're just having a reference, or in our in the back of our heads of what

35:29

our data looks like. So that's why this tool is very useful. This tool is useful Also,

35:34

if you're just training your Python skills, and or their permanent language skills, because

35:39

what you're gonna see is it's just a regular Python interpreter. In this case, I can execute

35:43

some code, that's two one times, actually one plus three, there we go. And the result

35:49

is four. Right. So this is a Python is a fully featured Python interpreter. The good thing

35:56

is that again, it's going to respond to us pretty much immediately I create a command

36:00

and I immediately get a response, I can do something a print here, hello world. And I

36:08

immediately get a response, I can do Hello, world, times, times three. Again, it's a again,

36:17

a Python interpreter, a fully feature Python interpreter, but it's not being accessed from

36:23

a terminal, which you can write this is the good thing about Jupiter lab to have a terminal,

36:29

I can do Python, right. And I can do two, time three, and I get an answer back. But

36:35

this is not convenient to work with our data, we need something a little bit more interactive,

36:40

we can also mix with documents, that's going to be the advantage of a Jupyter. notebook.

36:46

So what what's the way we work with Jupyter notebooks, there are a few concepts, very

36:52

important concepts that we are going to follow a Jupyter Notebook is just a sequence of multiple

36:59

cells, okay, everything is a cell. And as you can see, when I click on these cells,

37:05

even if even if it doesn't look like being a cell, it is, you will see that these blue

37:11

thing right here, right is pretty much following me because I'm clicking on the cell, and I'm

37:17

selecting that particular cell. Everything happens within a cell, if I want to execute

37:23

some code I can do, again, one plus five, and to get a result or a result back, right,

37:30

that's, that's how it works. So I'm creating a cell, I'm deleting a cell, I create another

37:34

cell again. So it's everything happens with a cell, and I'm going to tell you how to add

37:39

the cells, how to remove them how to execute code, etc. The interesting thing about a cell

37:45

is that it can either be Python code, or any other programming language you're using in

37:52

this case is a Python data analysis course. It can be Python code, as we're we were doing

37:58

before one plus three, this is Python code, or it can be what we call markdown, okay,

38:06

which is a formatting

38:09

format, right? To create text, that will be a render with sort of HTML ID at the output.

38:18

So in this case, this is what the source code of the markdown looks like in markdown, any

38:26

line that starts with this part, it's going to be a title, in this case, it's going to

38:31

be the largest, the biggest title you can have is just one pod, and then you keep adding

38:36

to reviews the size in this case, level three title. And then you can have for example,

38:43

this is a quote this is bold, this is it Alex, this is a link, right? So let me actually,

38:51

I could copy the cell and open the source code. There we go. So this is a link right

38:57

issue, issue is created or it's rendered as a link. So markdown, what is is that is a

39:05

text formatting tool, right or protocol, we could say that in this case, we just specify

39:10

us we have some

39:14

some rules to use in our in our text, and markdown knows how to interpret them and format

39:21

right or return a formatted document after them. So for example, here, we have green

39:26

divider, which is a picture and we know it's a picture because it starts with an exclamation

39:32

marks. And that's that what you're saying right here. So again, a cell can be either

39:36

Python code, or it can be markdown. markdown is an entire thing on its own. You can get

39:45

any tutorial online free, it's it's fairly simple to get started with. And it's also

39:50

very important because when you're formatting your reports, right, when you're creating

39:54

your reports, you want them to look pretty, you can use markdown for not and what we're

39:59

going to see later So you can export these notebooks and they will generate PDFs, right.

40:05

So this whole thing can be a PDF or an or an HTML page. So after you're done with your

40:12

data analysis, you can hand over to whoever asked for the analysis, a PDF report, which

40:17

is pretty neat. So moving forward, again, any cell is going to be either markdown, or

40:22

it's going to be code right here. So these ones code, and you can switch the modes, you

40:27

can say, this LS code, or actually, let's make it markdown. So right now, if it's a

40:33

code, it doesn't doesn't matter, or just, it's not executing anything, because the cell

40:38

is interpret as markdown. So now, I'm switched back to code. And now it works. Again, I said,

40:45

Sure. It can also be raw, but to be honest, we don't use raw very often. So again, you

40:50

have this this general cell type, this cell we're using, what type is it? Is it code is

40:56

it markdown, you can switch it with these with the selector right here. So a few more

41:04

things that I have to tell you right away, so you can start internalizing them, and it's

41:08

gonna take some time to get used to it. But once you get used to it, you're gonna move

41:13

very fast in your data analysis with Python Jupyter notebooks. The first thing is, as

41:18

you're seeing right here, every cell has been given an execution number. So any, the cells

41:26

will be moved, right, they will be moving around, you will be moving them around. But

41:31

you will always know which one executed before another one. And that's because every execution,

41:38

you run will be assigned an execution number. In this case, this is the seventh time I have

41:44

executed code. If I execute code again, for example, I don't know, two times two, this

41:51

is the eighth time that I've executed code. And if I move this thing, right here, if you're

41:58

reading this thing, top down, you will not be full, right? You will understand this thing.

42:04

It was moved, the cell was moved, the structure of the notebook changed. But these thing was

42:10

executed after this other cell, right? xact. And this is seven. So the execution order

42:16

is always preserved. So that's an important thing. Something else that you're seeing me

42:22

change the structure, and do things with the notebook without using any menu. And that's

42:28

because I know how to use keyboard commands keyword shortcuts to run most of these commands.

42:38

So for example, how can I add a new cell I have these is a markdown cell. This is a code

42:44

cell, if I need a cell before these one, what's what's that command that I'm going to issue

42:50

in order to create the cell, in this case, the command is going to be the letter A, I

42:55

just type A, and there is a new cell creative. How can I delete the cell, it can be two times

43:03

that the key two times the D key. And again, this is all these reference with built. So

43:09

for example, right here, whereas hit at some point, you can.

43:13

Here, you can type, you can press A to create a new cell, you can press B to create a new

43:19

cell, what we call below. So let me put something here, this is a reference. And I'm going to

43:26

put here the letter B and it's going to create a cell B below the currently selected one.

43:34

So the selection here is here in the blue, I hit let me delete this one, I hit B. And

43:39

again, it's going to create a cell below the previously selected one, if I hit a, it's

43:45

going to create a cell above that previously created one. So these, these are the mnemonics

43:51

of the creation. Something else and it's very important why when I'm in this cell, and I

43:57

hit the letter, a leader, literally it just hits the letter A in my keyword, no control,

44:04

no command, just a, it creates a new cell, and it doesn't type A inside the document,

44:11

right? So right here, if I type A, it's adding an actual a character in the cell. Why didn't

44:20

that happen before. And you're going to notice that when I change, when I'm going to call

44:25

a mode in a second, you're going to see that the content of the cell is grayed out, show

44:32

what now when I when I press on the letter A it actually creates to sell and it's not

44:39

adding content to the sell itself. If I go back again to the other mode, and I'm going

44:45

to give you a better explanation in a second. If I type anything, in this case, a it's actually

44:52

appended to the text within it. So this is my interaction to sell modes and this is very

44:59

important. The Jupyter Notebook is a mode base editor, right? So there are multiple

45:06

editors are, for example, vim or VI, vi, those are mode based editors, which basically, the

45:15

behavior of your work will change depending on the mode that it's currently activated.

45:23

So for example, in this case, I am in addition mode, because any character that I type will

45:31

be appended to the cell, A, B, C, D, etc. If I switch out of editing mode to what we're

45:41

gonna call command mode, I switch out of that mode. Now the cell is grayed out, and any

45:48

key that I hit, it's gonna do something different associated with that key. So A is going to

45:55

create a new cell above, B is going to create a new cell below, Double D is going to delete

46:02

this cell, right. So that's, that's the important part of Mo. That's one of the most important

46:07

parts in order to understand how to work with Jupyter notebooks, the mode that you're currently

46:14

working with, and there are only two modes, so it's fairly simple. This is command mode.

46:19

And we recognize command mode, because this cell is grayed out. When we get into edit

46:26

mode, there is a regular prompt, as you're saying before, the number one on the cell

46:32

is actually subjects of addition. So that's the way we can realize that, how are you going

46:38

to switch from modes, in this case, I'm in editing mode, if I'm using my mouse just pointing,

46:44

I can click outside, I'm gonna get out of the edit mode into command mode. If I point

46:53

inside and going back again, to the Edit Mode, but let me tell you something right away and

46:59

then say, we don't like to use our mouse, we don't like to point and click, because

47:04

that's very slow. We like to use our keyboard, we move very fast with our keyboard. So how

47:10

are you going to switch from, from editing mode back to command mode, that's going to

47:17

be with the Escape key to go from editing to command, edit as Escape key, it's going

47:23

to switch out of editing, but when mode. And if you actually want to make modifications

47:28

to the cell, basically, you want to get into edit mode, you're going to hit the return

47:32

key, that's going to get you into edit mode, again. So we have tackle multiple things are

47:39

writing, again, we said in Jupyter notebooks, we're going to use Python code very quickly

47:44

to interact with our data, we need a real time, you know, I'm asking a you're answering

47:49

type of editor. That's what the Jupyter Notebook is. The Jupyter Notebook has these two modes,

47:54

edit and, and command mode. And then the cells which is pretty much everything is the most

48:00

important, it's a fundamental part of the notebook, the cell is going to have two types

48:05

can be either code, or it's going to be markdown, right. And now I'm going to start showing

48:11

you more features. And I'm going to show you, I'm going to show you the most important commands.

48:16

And of course, how the what the keyboard shortcuts for those commands are, so you can move freely.

48:23

And and and work with Jupyter Notebooks in the most efficient way. So let's get started.

48:30

First of all, for for from the most important commands is moving right. So navigating, it's

48:36

very simple to navigate, just use your arrow keys, up and down, up and down. And you're

48:42

going to move around in your notebook. If you wanted to switch the type, right going

48:47

from markdown to code, etc, you can switch use these drop down or you can press the specific

48:55

key is to switch to either markdown or Python. So for my markdown, you're gonna switch sorry,

49:01

hit the M key, that's going to make it markdown. For Python, you're going to hit the Y key,

49:09

that's going to make it Python code. So M and y are going to switch you back and forth.

49:14

Keep an eye on the selector you're going to hit y m y m is going to switch it from code

49:23

to markdown.

49:24

What else how can you execute code once you are within your typing code and you want to

49:30

execute it, there are two types of executions you can run. The first one is going to keep

49:36

the selection the currently selected an active cell is going to stay the same place you are

49:42

and that's going to be my by keeping press the Ctrl key and hitting return that's going

49:49

to run decode on the cell there the prompt or the current selected cell will remain being

49:55

the same. So I'm running this thing a couple of times already on this selection or the

50:01

currently highlighted cell stays the same, I can change that by using shift return. So

50:07

I keep the shift key pressed. And I hit return and is going to execute the code. But it will

50:15

immediately switch the prompt or the currently selected cell to the following one. And that's

50:21

useful when you have multiple cells, you want to execute one after the other. So you can

50:26

keep hitting shift, return, return, return return, and it keeps you moving right from

50:32

top to bottom. Alright, so Ctrl return or shift return to change the execution is the

50:38

same is just what's going to happen with the currently selected cell. We already saw how

50:44

to create cells with the A key, we create a cell above with B key we create a cell below.

50:52

To delete a cell, you're going to hit the D key, the D key two times one after the other

50:59

very quickly, dd is going to delete these the cell. What happens if you made a mistake,

51:05

and you want to undo the previously issued commands? Well, the mnemonic here is going

51:10

to be Ctrl Z, you know the mnemonic, it's not the command, it's going to be Ctrl Z,

51:15

you only need to press the Z key, you know, you don't need Ctrl Z, and it's gonna undo

51:22

whatever you did in your previous command. Alright, so a B, D deletion, and then Z to

51:29

undo the all the commands were saying they all have a correspondence in this toolbar

51:37

or in this command palette. So for example, right here, I could run this code by pressing

51:43

these play button right here you see it, the execution is changing. There are multiple

51:50

ones and you can search them if you don't remember right here. And the neat thing about

51:55

it is that you actually have the shortcuts to issue the same command. So let's say you

52:03

don't remember how execute and stay stay in the same cell, or move whatever you can search

52:10

for run. And you can see what's the name, and what's the actual command that you have,

52:15

right there, right. So you can, at least for your first ad or a month working with Jupyter

52:22

notebooks, you will usually need to go back to these commands, and try to remember the

52:26

the quick shortcuts. And with time and practice, those will just come naturally. So moving

52:34

forward, what else, we have a few other commands, in this case, we have something to cut and

52:41

paste the cell somewhere else, just cut and paste, that's going to be x to cut it, or

52:47

you can also use the scissors here, x to cut it. And to paste it, you can use this button

52:55

or actually these buttons sorry, or you can just press the V key V is going to paste it

53:01

wherever you're currently standing it. So I'm going to cut it, I'm going to remove it

53:05

from here, and I'm going to paste it below there. Or you can also copy it. So instead

53:10

of cutting it, you can press the C key just going to cut, sorry, copy. And then you can

53:16

actually say where you want to paste it. In this case, we have duplicated the same cell.

53:21

And it looks something interesting here, the execution count remains the same. So again,

53:26

there is like this unique identifier for your executions, which means that you know, when

53:32

and where something was executed. Moving forward, we're going to use some code here, we're going

53:38

to import some tools, you can see some characteristics or advantages of Jupyter notebooks and why

53:45

we use it so often compared to, for example, the regular Python terminal.

53:51

One very important thing is visualizations, we as data analyst, we're constantly getting

53:56

data on expressing it through images, or animated animations, right. But most commonly, images.

54:04

The main library we use in Python is model live. And model lib is a first class citizen

54:12

in Jupyter notebooks, which means that you can just run the figures from matplotlib.

54:17

And they will just show up directly in your notebook without the need of doing anything.

54:23

Crazy. So can you imagine showing these these beautiful picture in this terminal? That's

54:30

that's very hard, of course. So again, that's one of the main advantages of a Jupyter Notebook.

54:36

Moving forward, what we're going to do is we're going to first we're going to get some

54:42

data from a public API. So there is these crypto watch service, which basically has

54:49

crypto information, Bitcoin, ether, etc. And you can check the docs, we can actually open

54:55

them. It's gonna give you market data Tesla. You can check the docs and How you can get

55:01

in this case it's BTC Bitcoin to euro, sexual see if we can change it to USD USD price.

55:08

There we go. So this is the current price of bitcoin results, surprise, etc. And we're

55:17

actually going to do markets do we have crack and BTC USD, let's do, let's actually issue

55:25

the same query we're going to use which is open high, low, close Oh h LC. And don't worry,

55:32

this looks ugly. But this is actually what we're using. There's a list of results write

55:36

for all different candles, we call them, we get the idea of the open price, close price,

55:43

high price and low price. So we're going to issue those, we're going to issue these requests

55:49

to the internet to these API, the crypto the crypto watch API, so you can get information

55:54

about bacon to do some analysis, I say they can, you can actually get it from ether for

55:59

for ether for author different types of crypto or currencies. So the function we're defining

56:07

is get history, get historic price, it's a very simple function that uses pandas is one

56:13

of the most important tools, we're going to be using this course. And the requests library,

56:17

which is also very famous library for Python. And what we're going to do here is we're going

56:24

to get Bitcoin on ether prize for an entire week. Right. So from ferreted that the second

56:32

February sorry, February 25, up to today, right? So depending on when I'm shooting this

56:37

video, and we're gonna get a quick reference of the prices open, high, low, close. So in

56:43

this case, we have four information per hour. Okay, so this is something you can actually

56:50

change in the in the, in the request you're making to the API, you can reuse the candles

56:57

eyes. In this case, we're keeping it per hour. So we have by the hour information about Bitcoin,

57:04

in this particular market, which is bitstamp. Here, we have these day these day, and these

57:09

are right, when I'm in the morning, open, close, highest price and lowest price, and

57:17

also the volume that was operated within this time period. And we're gonna immediately plot

57:24

the price. So we see that in these time, which I think is an entire day, we the price dropped,

57:35

it's actually a few days, like an entire week, the price dropped from $9,600 below, right

57:44

9000. So it was a pretty significant drop. Let's see ether highperformance. We have here

57:51

all the records, and how it moved. So this is what I tell you that when you're doing

57:58

data analysis with programming tool like Python rar, you're not constantly looking at the

58:04

data. So what I'm showing you right here are the first five records, we actually have.

58:09

Let's do that. We actually have 169. Records, okay, 169 Records. And this is per hour. So

58:24

if we do 169 hours divided by 24 hours, we have seven days, right? So we have seven days

58:31

of data 169 Records, and then we have a little bit more information keeps this to go. I'm

58:36

gonna get to that in a second. But basically, this is one I tell you 169 Records, to be

58:42

honest, something you could be saying in a spreadsheet. But I want you to get the concept

58:47

here. We're not just looking at our data, we have it in our brain, we know what did

58:55

it we know what shape it has. We know how many records it had, we know information standard

59:01

deviation, what's the mean of that? Right? So close price was the standard deviation,

59:06

right? What's the the average, the mean, the median, right? So we have information about

59:11

our data. It's sitting behind, you know, in our brain, but we're not looking at it. And

59:16

that's because with a very simple example, with only 169 Records, but in real life, we're

59:23

dealing with millions of records, so it's impossible to see it. Have you ever tried

59:27

is crawling in an Excel spreadsheet through millions of records. It's crazy. It's not

59:33

possible. It's just unusable. So that's again, the way we work with data analysis in Python

59:39

and R and other tools. We don't constantly keep an eye on the data. We know the shape

59:46

of it. And we just have these quick references like show me the first five records. I mean,

59:50

the last five records, show me this chunk here down there, but that's it. So again,

59:57

these are the visualizations we're creating on Jupyter notebooks. Again, it's just very

60:01

simple to get the plot done right there. We're going to also see in Jupyter notebooks, a

60:08

few other pretty neat things. The first one is that we can use another library, which

60:12

is called bokeem. And the difference is that boakye will have charts that are interactive.

60:18

So I'm moving it right here, it has JavaScript. And it's interactive, you look back again,

60:23

to what we had here. This is a static chart, it's just a PNG, you can actually export it

60:29

as a PNG, there is nothing you can do with it. With bokeem, it's actually a dynamic,

60:37

dynamically generated interactive charts. So I can, I can zoom in piece of data, right,

60:42

I can move it around, I can just do whatever I want with it. I can refresh and reset it

60:49

to whatever it was. And it's a dynamically generated chart. The difference is, if you're

60:54

working with data, dynamically in your analysis, sort of in your exploration, then boek is

61:00

a planning tool because you can zoom in, right, so what's going on here, let's, let's look

61:06

at these things. If we're working on a mean, reverting strategy, for example, we see a

61:11

high volume, we see a low volume, the mean is going to be here. So we see some mean reversion

61:16

in there. It's very interesting. If you need to, for example, export a PDF, export a huge

61:23

HTML file, then static images are going to be probably better. So that's the difference

61:30

between them. To be honest, model lib is a lot more popular than bogey, we use model

61:36

live a lot more because it's we actually have a few other tools like seaborne that make

61:41

it very easy to access and use it. What else Jupyter Notebooks work very well with some

61:48

Excel, Excel files with all the file formats csvs, XML, Excel files, etc. And that's also

61:57

the the availability of Jupiter lab. So Jupiter lab can immediately interpret and opens his

62:05

v files can open with some extensions, XLS files, XML files, JSON files has a very nice

62:11

editor and tree view for Jason. So the Jupiter lab environment combined with Python Jupyter

62:17

Notebooks will give you a good idea of Jupiter in general. So in this case, we have just

62:23

saved I'm not going to execute these you can try it out. But you can execute and run what

62:29

we have just done and export this crypto file us an Excel spreadsheet. So you can just click

62:37

on here and you can basically download it, you're going to open it and see what has

62:43

There we go.

62:51

So let me reduce the size of this thing. There we go. So you can see that we have just exported

63:00

to spread two sheets, in this case, Bitcoin on ether, right? With the data that we had

63:07

in our previous notebook, right. So that's all again, the combination of Jupiter, the

63:13

combination of Python and the combination of Jupiter lab, which are tools just work

63:17

very well together. So we're gonna keep moving forward, in this video, this tutorial, I'm

63:23

talking about more data analysis, in general, we're going to talk about Python, we're going

63:27

to do a quick review of Python. Maybe when we when I was running these commands, you

63:32

felt you felt a little bit lost what I was doing with it. So we're gonna do a quick review

63:37

of Python and all that. And of course, we're gonna get directly deep into data analysis

63:42

with pandas with some other tools, I want to tell you something before we finish this

63:47

chapter. And it's not, it's very important for you to get familiar with data analysis,

63:52

with sorry, with Jupyter notebooks, because you're going to spend a ton of time with it.

63:56

And it's a very, very valuable skill that you can get if you get proficient, comfortable

64:03

with Jupyter notebooks, you know, like creating cells, deleting cells, cutting, pasting, moving

64:09

things around, etc. For you to generate reports Jupyter notebooks are going to be excellent.

64:16

So keep an eye on it. Keep practicing, it's the only way to learn it to the to the analysis.

64:21

Keep practicing it, keep open the command Polat. So you can always want if you forgot,

64:26

how can it caught a cell? Well, there is here it is command x, right? It's gonna just tell

64:32

you upfront, keep an eye on it, keep working with it and practicing it. And once you get

64:38

familiar with Jupyter notebooks, you're going to move very, very fast. Remember, they have

64:44

these nice list of compiled commands and reference you can always access if you need extra help.

64:52

And we're going to keep moving forward now with more data analysis.

65:01

Now it's time to talk about NumPy, one of the most important libraries in the Python

65:07

ecosystem for data processing. In general, it's the one that got pretty much everything

65:14

started. And if you trace back NumPy, it, it's a very old developed library. 20 years,

65:23

maybe it's it's an extremely popular library and important library, I'm not gonna say popular.

65:28

And I'm going to explain why in just a second. But it's a very, very important library in

65:35

the Python ecosystem for data processing. NumPy is a library that will lead you it's

65:41

a numeric competing library, it's just to process numbers to calculate things with numbers.

65:48

And that's it. So NumPy has a very limited scope, we could say, and this is an on purpose,

65:55

a very simple library, when you look at it, and when you look at the API, which is very

66:01

consistent, by the way, why is NumPy so important? Well, in Python, numeric processing, and just

66:09

pure Python processing numbers is very slow. Okay, Python is not slow as itself compared

66:15

to other programming languages. But when you go down, right to very deep levels of performance,

66:23

when you are processing large amounts of data, right, and you need to squeeze, even, you

66:29

know, that tiny bite at the end of your pipeline, you need to squeeze every flow up from your

66:36

CPU, then Python is not the right tool for non Python as as a pure python programming

66:42

language. NumPy is actually solving that NumPy is a very efficient numeric processing library

66:50

that sits on top of Python, and gives you the same API as you're going to work with

66:56

with just writing Python code, as you're saying here. But low level, it's going to be using

67:03

high performance, numeric computations and, and arrays of numbers and representations,

67:12

etc. That's it. That's it for pi NumPy. It's extremely simple from from an API perspective,

67:18

but it's extremely powerful. Why did I say that? It's not so popular. But yes, it's so

67:24

important. Well, because in reality, we don't usually employ NumPy directly, you will not

67:32

see yourself using NumPy. So often, but you will be using other tools in Python, like

67:38

for example, pandas, and matplotlib. And they are all working on top of NumPy. They're all

67:43

relying on relying on NumPy for their numeric processing. So that's why NumPy is so important.

67:52

So the for, at least for this part of the tutorial NumPy. I'm going to divide it into

67:59

pieces. The first one is going to be a very detail, low level explanation of how NumPy

68:05

works, why we need to use NumPy. And what are the differences between different bite

68:12

sizes for numbers, we're going to talk about integers. But this is going to apply for decimals

68:17

and data types also. And why you need a very low level, optimize to us number. Now you

68:26

can, you can skip this part, you're going to find in the description of this tutorial,

68:30

the precise moment in time. So you can just skip and go directly to the second part, which

68:35

is when we actually start using NumPy. And I show you how to create arrays, how to make

68:40

computations, etc. So for now, we're going to divide it in two parts, we're going to

68:44

start first with the low level explanation which you can escape if you want, because

68:50

it's not going to be crucial, you can easily use NumPy. Without it. We have found that

68:56

for some of our students, it's it's important to understand the low level basics of it,

69:00

especially if you didn't have a computer science background, it can help you get you know,

69:06

raise your right your level of understanding of computers, and how to make your computations

69:12

more efficient. But don't worry if you if you don't want to go through that now it's

69:17

fine. You can skip this part and come back later or any other at any other moment. You

69:22

don't need the ease to use NumPy seriously, you don't need it. It's going to be beneficial,

69:28

but you don't absolutely lead so you can just skip and come later. So with that said, let's

69:33

actually go into into a deep

69:37

understanding and explanation of how computers store integers, numbers in memory and what

69:43

are bytes bits etc. In order to understand why NumPy is so important. We have to go back

69:48

again to the basics. What are numbers, how they are represented in computers, etc. As

69:56

you might know already a computer can only process ones and zeros bits, it can't process

70:02

numbers or just decimal numbers to be more correct, sorry, it only can process ones and

70:08

zeros. A computer is just always storing and processing ones and zeros. It's a binary machine.

70:17

Your memory is the central place around the random access memory in your computer is the

70:23

the central place where your computer is storing the data that it's actively processing, right.

70:30

So you have, for example, a hard drive, which stores long term data. But the computer can

70:37

process data directly from your hard drive. Before doing that, it has to load it into

70:42

your ram into your random access memory again, usually, right a computer is going to have

70:50

what eight gigabytes 1632 doesn't matter. Let's say you have eight gigabytes of memory,

70:56

that at some point is going to translate to number of bits that your computer can store.

71:03

So if you follow, if you follow these we have right here, you can see the total number of

71:10

bits available in a regular computer with eight gigabytes of memory. Why is this important?

71:18

Because again, the objective of these of these tutorial is the objective of this bar, at

71:23

least is to explain how you can squeeze out of every single bit you can in your computer,

71:31

right? How can you make it more efficient? For your numeric processing, both in storage?

71:37

use less memory for the same data? And also how to make it faster, right for your calculations.

71:45

So in terms of physical storage, or actually memory storage, right? How can we make it?

71:51

How can we optimize to use the least amount of memory for this given problem? That's the

71:58

objective of optimizing it, we need to understand how numbers decimals or sorry, integers into

72:04

decimal numeric system are represented in binary, right. So these table right here shows

72:11

you the first nine numbers, 01234, etc. and their binary representation. In your computer.

72:21

Let's say you want to store the age of user age of a user, which is 32. You can't store

72:28

32 in here, because your computer again doesn't know about decimals, it only knows about binary.

72:36

To do that, you will need to find the correct representation in ones and zeros of 3030.

72:42

All right, sorry, two, which is not this one, to be honest, I'm just making it up as we

72:48

go. But again, you need to know the correct binary representation of these number in norther.

72:56

To store that data, how can you know that? Well, there is this whole binary arithmetic,

73:03

right? There's a whole part of math dedicated to binary doesn't matter for now. But I'm

73:11

going to just drive the intuition of it so you can have a better understanding. And if

73:16

you're interested, you can dig deeper later. So basically, any decimal number needs to

73:23

be stored in a binary format, which of course only steaks ones and zeros. And what we usually

73:28

do is just we keep increasing zeros and ones in positions, right. So in this case, we have

73:34

the number zero, the number one, that's fine. Once we need to store the number two, winning

73:39

now to increase the number, the position right here we need to increase, right, so we need

73:44

to go from two to one zero, we'd go to the number three, it's one one, and then we need

73:51

to go to number four, we need to increase positions again, because we only have two

73:56

symbols, zero and one. So as you're seeing right here, up to these level, we need only

74:01

one position. Up to this level, we need two positions. This level, we need three positions.

74:11

And these levels going to need four positions. And you'll see how the size of each of these

74:19

is increasing. And it has a

74:21

an explanation behind it that we're going to see in a second. So the question is how

74:27

many decimal numbers you can store with n bytes and bits, sorry, bits. So let's say

74:34

we have n bits. And let's say n is equals to three. That means that you only have three

74:41

positions, right three bits, how many total decimal numbers, you can store with it? Well

74:48

we can store 000 we can store zero, we can store 100 we can start stores are you one

74:58

zero, right? So in this size, we can store up to here, we can store up to seven numbers

75:08

111 is equals to seven was, once we've filled all the positions, right, we've reached the

75:17

limit, right? The largest number, the largest binary for this amount of symbols or positions.

75:24

That's the number seven. So these means that with three numbers, you can start from zero

75:31

from zero, here, zero up to one, one. In total, you can store eight decimal numbers, here

75:42

you have eight decimal numbers 012345678, total decimal numbers from zero to seven.

75:51

The

75:52

equation if you want behind this is as follows. If you have n equals three, and it's, in order

76:03

to know how many decimal numbers you can store with those bits, it's two to the power of

76:10

n, in this case, is total a bit. So if we go back into our drawings, we said that with

76:17

three bits, we can store up to eight decimal numbers. And again, the equation is two to

76:25

the power of n is going to give you how many decimal numbers you need. You can always do

76:32

the opposite process using logarithm and get how many bits you're going to need to create

76:40

to store a given decimal number. I'm, I'm not going to get into that. So we don't complicate

76:45

it. But again, the math behind it is extremely simple. So now, moving forward, we're going

76:53

to delete this whole thing. Moving forward. Why is this important? When you're working

76:59

with your data, when you're doing your data analysis, you know what, what data you're

77:04

what type of data, you're working with their own numbers, but numbers only usually have

77:12

a connotation behind, right? So let's say that you have here it's a table of people,

77:20

and you have the total net worth of the person. And also you have the age of the person. The

77:30

age is a value that will range between what zero, right? Just born

77:36

to,

77:37

I don't know, 120, we can say I don't know, what's the maximum age registered right now,

77:44

the oldest human being but zero to 120, it seems, seems reasonable. In your other column

77:52

net worth for this person, the range is it's completely difference. We can go from something

77:58

like $0 up to, I don't know $60 billion, I think Mark Zuckerberg or Jeff Bezos or one

78:07

of those. So we go from zero to 62 billions in this case, if there are dollars, what happened

78:12

if this is a highly devaluated currency? Would we have to go to trillions, right? So these

78:20

two even though they're just plain numbers, and we can say they are integers, even though

78:25

these are pulling numbers, they have an integers, they have a different connotation, and they

78:31

will need different requirements in terms of storage size, right? So if we say that

78:37

nh goes from zero to 120, we don't need so many. So many bits to store it in memory,

78:47

right? We can do the math, actually, how many bits Do we need in order to store 120 100?

78:56

And what do we say 120. Right? Well, if you do the math, you will see that two to the

79:04

power two to the power of seven is 128. So if you have if you have seven bits here, seven

79:16

bits, you're going to store from zero, up to 1111111, which is actually 127. Okay, these

79:32

number, all ones, seven ones in binary is equals to 127. in decimal, in total, we can

79:40

store 128 numbers 00 matters, up to 127. So that means that for our column right to column,

79:53

age, here age, we only we can use the size of the men We need to use is going to be seven

80:02

bits per user, or costumer or person, whatever. What about these number right here, if we

80:13

have to go up to a couple billions? Well, in that case, the numbers a little bit more

80:21

complicated, we're going to need, for example, we can say 64, or 3232. It's actually 64,

80:30

probably, but with 32 bits, right, you can store up to from zero up to these volume.

80:38

So again, I don't know about the currency we're using or anything, so we can assume.

80:42

But here, we need 32

80:46

bits in order to store that. And now you can do the math, how many how much memory space

80:54

you need, in order to process this data? How many records Do you have, if you have only

80:59

1000 Records, that's not significant. You can use whatever, you can use 64 bits here

81:05

to store the age, and you're not going to have a problem. But what happens if you have

81:09

more what happens? What happens if you have the entire population of the earth, you have

81:14

7 billion records here 7 billion records, then every bit that you're saving in these

81:23

columns is going to be important? Because he's going to take a ton of data. And of course,

81:27

you have a ton more columns, right? What happens if you are processing trillions of records

81:34

from financial transactions, right, you want to be very, you want to be very efficient

81:42

and optimize every single bit, you can't. And that means again, selecting the correct

81:47

number of have a bit per the columns you're currently processing. So so far, so good,

81:56

again, when there's 10, that the the number in decimal that we need to store has a correspondence

82:02

with emits, right? eight bits is one byte. And the more we can optimize that, the less

82:09

memory we're going to use for our obligations. Where does NumPy come in place? Why are we

82:16

talking about data in these NumPy lessons? Well, they're right. The idea is that NumPy

82:23

is a library that will lead you has a very advanced numeric processing, in order to let

82:30

you select the number of bits you want to take for an integer. Even more, let's say

82:37

you for forget about NumPy, you want to process this thing with pure Python. So you x equals

82:43

five, for example, working with Python, you create a number, we're storing age as a five,

82:50

how many bytes? How many bits? Do you think the simple variable takes in memory? How many?

82:58

Well, in reality, even though we think it should be around, what, three, three bits,

83:06

eight, let's say to be simple, too simplistic. In reality, for Python, this is going to take

83:14

around 20 bytes. Okay, so we are wasting a ton of memory in order to store this number.

83:22

And why is that? Well, because Python is a high level, object oriented programming language.

83:29

The reasoning behind it is that Python is simple to write, write simple to also read

83:35

and, and, and code on top of it. But again, in order to create that simplicity, in its

83:43

syrup, all the numbers in objects, which have all these attributes, that if you know, advanced

83:50

Python, you're going to recognize that are not necessary. So these is taking a ton of

83:56

memory. And a regular, very simple number in Python ends up consuming 100 times more

84:03

memory than what it should be consumed. And this one NumPy comes in place in NumPy, you

84:11

can create numbers that are for example, you can control the size, in terms of bits, you

84:17

can say I want to create a number that has only eight bits. And that's it, that you're

84:22

going to create a one byte integer, and you're very precise and how much memory it takes,

84:27

you can create a number that it's actually need a little bit more space, we're going

84:31

to do NP int, and we can hear us a talkie, you're going to get auto completion 6016 bit

84:41

or eight or 32 or 64, right. So we can actually be a lot more precise in the number of bits

84:50

that we need. And this is extremely important for again, our high level processing. On top

84:57

of that, NumPy is our array processing library at NumPy is 99%, about processing a race constantly

85:06

processing erase the data structures we have in Python, the built in data structures we

85:11

have in Python, for example, the list dictionary, they are not optimized for high level computing.

85:18

So if you have a list of numbers in Python, let's say you have, I don't

85:24

know, l equals 3224, right, you have three numbers in your list. In Python there, it's

85:36

not guaranteed that the least they'll the list is gonna contain all the numbers, three

85:42

to four in contiguous positions is gonna, it might put them in separate positions in

85:49

memory. On top of that, you can't rely on advanced CPU directives and instructions for

85:58

processing matrix matrices, sorry, because Python, again, is wrapping these things in

86:04

objects. So there is no access to these high performance, low level instructions with NumPy,

86:12

that changes because when you create an array NumPy, you say, I want to create an array

86:17

of three numbers, and they are all into eight, then imposition forget about this is not these

86:24

are not bytes I am, I'm using these drawing as a general representation of memory. So

86:30

in that case, in NumPy, when you create these three element, int, eight array, it's going

86:39

to create those three elements in contiguous positions in memory, three to four, and they

86:45

are only going to take that amount of memory the police said they were going to take and

86:51

on top of that, we can rely on a bunch of very efficient low level instructions from

86:57

your CPU for matrix matrix calculation, this is something that it's a little bit more advanced.

87:04

And it's something that has exploded in the past 10 years CPUs with more with richer instruction

87:10

sets, and the same thing for GPUs, you might have heard, especially with machine learning

87:15

and all that we need, we need fast array processing, when we are storing features and weights and

87:21

all that's a topic for for different story. But again, the idea is we need right a ton

87:29

of week, sorry, we can use all these important and very efficient, low level directives from

87:36

our CPU, which makes our computations a lot faster. So again, as a recap, you don't need

87:43

to know all these to work with NumPy. That's the first thing. Second, you don't need to

87:48

get extremely, extremely conscious about all the numbers you use. At the beginning, you're

87:54

just going to use NumPy as it is, and you're going to use just the default types that it

87:58

picks in 38 cents or in 32. In 64, that's fine.

88:04

But then, with when you get into bottlenecks, when you're working with with larger amount

88:11

of with more amount of data, then you might need to get into the details of that size

88:18

of the integers that you're using. And this all applies to float. So I'm just using integers

88:24

because it's simpler. But it's all applies to floats. So again, NumPy, the main advantage

88:31

is that it's it has built in very fast and I raised kit, take advantage of CPU instructions

88:44

for matrices and arrays and all that. And it also has a very efficient representations

88:49

of numbers, right are not the regular objects of Python. Again, recap, you don't need a

88:56

list. If you want to get into more details, I recommend you to get a little bit more understanding

89:01

about binary arithmetic, and how numbers are uncomputable architecture, how numbers are

89:08

stored in memory, etc, especially for floats and all that's a completely different representations.

89:14

So with that said, we're going to see now how we actually use NumPy without worrying

89:19

so much about the low level details. And that's the beauty of NumPy. So we have already done

89:24

our low level explanation of binary arithmetic, why unknown vice important and all that if

89:30

you skipped it, that's perfectly fine, you will not need it. The reasoning was to include

89:35

was that if you're in this tutorial, you're probably looking for fast and efficient options

89:42

to process large volumes of data. And that's when all those things come into play. So let's

89:47

without further ado, let's just get started and start using NumPy as a library. So again,

89:53

as I told you, a NumPy is a very simple library for array, processing and numeric powers.

90:00

To sing, it has a few objects, numbers, floats, integer floats, arrays, and that's it. And

90:07

it's very simple, but it's extremely powerful. So, in NumPy, we're going to create these

90:11

arrays, which look a lot like Python lists, but there are going to be significant differences.

90:16

The first one is, of course, performance. If you go to the previous part, when we were

90:20

discussing the binary representation of an array of numbers, in Python and NumPy, you're

90:26

going to see the difference between them. So in this case, we're creating two arrays.

90:30

And you will see right that the creation is extremely simple. The only thing that changes

90:36

we need to add this NP dot array, and then we're passing in this case, a list of numbers.

90:42

This is something we will usually be reading from external sources. Now, how can you access

90:48

individual elements of a NumPy array is this works in the same way as with a Python list,

90:54

you can say give me the first element, give me the second element. And it's zero index,

90:58

like, again, in a Python list. Slicing works the same way. So in this case, up a zero to

91:05

something, a one up to three rights, just getting low level, right, on high level of

91:13

the index, negative indexing, and steps, they all work in the same way as with a Python

91:21

list. So if you know how to use a Python list, you will know how to use a NumPy array. There

91:28

is one new thing right here so differently from a Python list. And it's what it's called

91:34

multi indexing. Let's say you have a, an array this case B, and you need to extract three

91:43

elements out of it, you need the element of the first position, third position and last

91:47

position, you can just type B of zero, B, A to B at minus one, or, and this works, this

91:56

also works for a list. Or you can use again, mod the indexing, which is from B, I want

92:01

to select the elements in zero to n minus one first element, third element on last element,

92:09

right, so you pass an int, another list containing the indices of the elements that you want

92:15

to select. And in this case, the important part is the result. It's another NumPy array,

92:25

it's not just individual elements, you're creating another NumPy array, which again,

92:31

if you're processing, it's gonna be a lot faster. So arrays have types associated. And

92:38

this is related to what we were speaking before. As a NumPy array is a continuous is continuously

92:45

assigning memory, the NumPy library needs to know what's the type of the object you're

92:52

storing, you can't just or you know, anything, a string a number within it, because it will

92:57

not be able to

92:58

provide performance and optimizations for arrays or non consistence insights. So for

93:07

example, when we create these arrays only had injures by default NumPy selected in 64,

93:14

is because of the platform, it's a 64 bit platform, you can tune this, and you can select

93:20

us, we're going to see other sizes in a second, when we created the array B that contain decimals

93:27

or floats, it assign a different type, which is float 64. Again, the default type is always

93:35

six, at least in this platform that is 64 bits, it's going to be float 64 and integer

93:40

64. You can always change that you can say Actually, I want these, even though these

93:46

are all integers, I want you

93:49

to

93:50

create them using a float type, or as we saw in our previous video, we can say it should

93:56

be actually type integer x. So smaller integers, for performance, for performance for better

94:07

performance. Alright. So moving forward, we were also going to see a few other types like

94:11

for example, strings on the regular objects. But as you're going to see this, there is

94:18

no point of storing these things in NumPy NumPy, stores numbers date Booleans, but not

94:25

a regular individual objects as we're seeing right here. There is a way to store strings,

94:32

it's perfectly valid and it has its own time. Its own type sorry, and it's related to the

94:38

Unicode representation memory etc. But again, NumPy is usually used for numeric processing.

94:48

So the idea of NumPy arrays is we can create multi dimensional arrays we can create the

94:54

what we had created before. This is a one dimensional array right? Just one dimension,

95:01

you can create matrices, which in this case are two dimensional, we have two rows and

95:08

three columns. And NumPy has a ton of attributes and functions to work with multi dimensional

95:15

arrays. So the first thing we're going to see is the shape of an array, which is two

95:20

rows by three columns, how many dimensions it has, it has one vertical and one horizontal,

95:27

we have two dimensions. And what's the total size of the array in this case, the total

95:33

size is six, the total number of elements we have, let's go one dimension. Further,

95:40

let's create a three dimensional object, a three dimensional array, which is basically

95:44

a cube. In this case, for B, we have that the shape is two by two by three, the number

95:52

of dimensions is three, and the size is a total count of elements. 12, you always have

95:59

to be careful when you're creating these multi multi dimensional arrays. If the dimension

96:05

dimensions don't much, like in this case, right here, where we have this second list

96:12

that only has one less than bits in it, then the dimensions will not match. And it will

96:17

just tape you they'll use sorry, that the array is of type objects. And the shape is

96:24

only two only has two elements, these one element, and there's another element. So in

96:29

this case, we've done we've done it wrong, basically. And you have to be careful when

96:33

creating these these objects by hand. So how can you index and slice matrices? We've done

96:41

it for a one dimensional array. So we were selecting elements, individual elements, give

96:46

me the first element, give me the second element cetera? How can we do it with a matrix with

96:52

a matrix, what we're going to do is going to be very similar to what we did before.

96:58

The difference is that now we have to account for multiple dimensions when I do give me

97:03

a at one, is it the column add one, or is it the row at one? Well, as you can see, it's

97:10

the row. So this is going to be right here. 012. Right. And there is also another dimension,

97:21

right? So this is 012. In terms of index, index positions for our slicing. So here,

97:34

how can you get the first element, the first element of this second? rope. In that case,

97:43

you're going to first select the first row, the sorry, the second row, and then select

97:48

the first element. And that's what you get number four. But there is a better way, which

97:52

is by using the multi dimensional selection of NumPy. In this case, you're going to say

97:58

from this matrix, I want to select and here you're going to pass a in this case, you're

98:04

going to pass dimension one dimension to dimension three, dimension four, etc, right. And these

98:11

are selectors for each one of those dimensions that you're passing. In this case, we say,

98:17

for a row level one, the element, the position one second element, and for a column level,

98:23

we want the first element in it. And it's the same thing as we did before. The advantage

98:29

of this index and keeping in mind and remaining it is that it will also let you add slicing,

98:34

right, so you'd say I want to select every thing from dimension one, which is rows. So

98:42

in this case, you say from zero up to two is these two ones, the two is not included

98:48

upper limit the same as as Python. And then

98:53

you can also pass other other dimensions, you say, I want to select every row, that's

98:58

fine. But then I want to select from column level, I only want to select the elements

99:03

up to two. So these two and these two, and the two, right, so 124578. These all works

99:15

as intuitive as it gets. Remember this syntax is the important that you need to keep in

99:21

mind. Moving forward for modification, you can say I want to assign these new array to

99:28

this entire row, right? So if the dimensions match, that is going to work now 10 is equals

99:36

it's added to the second row, or you can just use what we call usually an expand operation.

99:43

We're just going to say for row number two, I want to assign the number 99 and NumPy is

99:52

going to take care of expanding it into this corresponding array, given the number of dimensions

99:58

that you have So so far that selection, it's simple, we're going to see also is that NumPy

100:04

has a huge advantage of containing a ton of operations you can perform on top of your

100:10

arrays and matrices, your multi dimensional arrays in general. So the first one is the

100:15

all the summers basic methods we have. So given an array, all these methods are already

100:20

built in the sum, the mean average, right, standard deviation, variance, etc. And that

100:25

also works for matrices. So in this case, we can get the sum the mean standard deviation,

100:31

or we can do it per axis. So this is very useful, we can get the, the here, let's compare

100:40

these two, there we go, we can get the some of these, what is this first column, the second

100:47

column or the third column, we can get it the first row, second row and the third row.

100:52

So it's either this dimension, this dimension one, or it's a vertical dimension, which is

100:58

x equals one, right? So per row per column. Or, if you have more dimensions, you can just

101:05

keep increasing the number of this answers. And that's just going to work as expected.

101:12

Broadcasting vectorized operations, this is a fundamental topic that we're going to talk

101:17

about. And it's going to be extremely related to Boolean arrays. And these are a few new

101:23

things that you have to keep in mind with working with NumPy. And now we're going to

101:27

talk about vectorized operations and broadcasting, which can be a counterintuitive topic at the

101:33

beginning, but then you're going to understand how much sense it makes. It's one of the fundamental

101:39

pieces of NumPy. We've seen how NumPy works in a very general way we saw the multi dimensional

101:46

arrays and all those advantages. But you might be thinking, I mean, I don't need another

101:50

library just to complete the summer domain. When I show you the vectorized operations

101:55

and broadcasting part, this is going to make a little bit more sense of why NumPy is so

102:00

important. So to get started, we're going to have these array, which is a right, that's

102:05

just very simple array vectorize vectorized operations are operations performed between

102:11

both arrays and arrays and arrays and scalars, like in this case right here, which are extremely

102:17

fast, they're optimized to be extremely fast. In this case, what we're going to do is we're

102:21

going to sum the entire array plus 10. And what it means we're going to see an example

102:26

of what happens without with Python.

102:30

But what it means is that let me show you the results, that each one of the elements

102:35

within the array will be applied the same operation. So usually, that's the concept

102:41

of vectorizing an operation you have the number and then this operation is applied to each

102:47

one of the elements in here are actually in these other one, right, so here and here and

102:54

here. And here to result in these new array, the operation is expressed at an array level,

103:03

right, we say a plus 10. That's it. But then again, internally, this is broadcast said

103:11

to each one of the individual elements within the array. And this gives me how a plus 10?

103:16

Well, a times 10, for example, which also in this case is we're playing the times 10

103:23

operations to each one of the elements in the array, resulting in a new array with the

103:28

result of that operation. And these resulting in a new array is very important, because

103:33

as we're going to see, NumPy is an immutable first library, it will not any operation,

103:39

you performing an array will not modify it, but it will return a new array, if we check

103:44

the status of a, you're going to see that the elements are the same, it has never changed,

103:49

we are creating a new array and returning it. There are ways to override these behavior

103:55

if you want. And this they all these operations were performing these way always have the

104:01

interface of plus equals minus equals times equals etc, which will indeed modify their

104:08

rights. In this case, we're making a broadcasting operation, adding 100 to each one of the elements

104:13

in this array. And now this operation was immutable. A was modified and did it hasn't

104:21

returned a new operation. If you remember from your pure Python skills write the correspondence

104:30

of vectorized operations are list comprehensions, in which you're expressing an operation for

104:35

each one of the elements in your collection. Right. So that's a list comprehension. It's

104:41

a it's pretty similar to what we're doing with NumPy. The main difference is that this

104:45

is all optimized and extreme. It's extremely fast. So, the operations are these vectorized

104:52

operations are reduced broadcasting doesn't need to be only between arrays and scalars

104:57

can only be between arrays and arrays. So in this case, we have a and we have B and

105:02

showing you right here. And we can do something like a plus b. And what you're saying is that

105:08

if there is a correspondence, right, so zero plus 10, one plus 10, two plus 10, right?

105:16

Let me, let me do it in this way. 110 210 and 310. There we go. And that's the result

105:26

that we get right here. So these for these to work, you of course, need the arrays to

105:32

be online and to have the same shape.

105:36

But when that does work, then the operation is extremely fast in memory. And it's aligned,

105:43

it's a vectorized operations with seen so far. Why is this topic of vectorize operations

105:49

so important? Well, because of the following, which is bull in a race. And this is a very,

105:56

very, very important thing. If you don't completely get it now, I asked you please, to go and

106:02

check the exercises we have for this lesson, because we're gonna use it a ton. And we're

106:07

gonna, we're gonna see that in pan, this, the same syntax, the same primitives of Boolean

106:13

arrays, a play apply, and we're going to use the same things. So why are Boolean arrays

106:18

similar to vectorize? operations? Well, all these operations we've had performed here

106:24

are just arithmetic operations, mathematical operations, plus something times something,

106:29

etc. If you look at the operators that you have in your programming language, it's it's

106:34

not only mathematical operators, like plus or minus, or times, you also have Boolean

106:41

operators. And the question now is going to be what happens when you apply Boolean operations,

106:48

when you apply Boolean operators to it. So given our right, we had, what ways we had

106:54

to select different numbers. For example, in this case, we need the first and last element,

107:00

we do zero and minus one. That's, that's the way we saw with NumPy. We also saw the traditional

107:07

Python one, right, so we can say a zero, and also want to get a minus one. So this is the

107:13

first, the first way of selecting these elements, we know there's a second way with multi index

107:20

selection. And there is a third way and this is new, which is with Boolean arrays right

107:27

here. So in this case, we're gonna say I want to select the elements in this order, right?

107:33

And you're gonna pass either true or false if you want to actually select the element

107:38

or not, right, so if you have four elements, you have to pass four Boolean values, saying,

107:43

I want to select this element, I don't want to select these ones. I mean, I don't want

107:48

to like this element. And I do want to select this element right here. So I want the first

107:51

one, and the last one, and the result will be the same 030303. So so far, it's it's nothing

107:58

terribly new, right? So this is new, but it's not extremely complicated. We are showing

108:03

you a brand new way of selecting data, you can select regular Python multi index, or

108:10

a Boolean array. Now, you might be thinking, well, I manually write true false false, true,

108:18

true false, for I don't know how many records you have a million records, this is not scalable,

108:24

right, you will not say to write all the strong forces. But this is actually very important,

108:30

because these arrays are the ones that are the result of broadcasting Boolean operations.

108:37

So we saw again, regular arithmetic operation like this, but we also have it for Boolean

108:47

operations. So we what happens if we ask a greater than or equals to the number two,

108:54

right, and array A is this right here is 0123, then the result is false for zero, false for

109:04

one, because they are not greater or equal to do true for number two, of course, and

109:10

two untrue for number three. So all the individual elements that match this condition will have

109:17

true and false. In other cases, this is the power of Boolean arrays, we will be able now

109:27

to combine these operations. So now we can do a greater than or equals to two, right

109:35

that a equals A being greater than or equals to the number two. The

109:44

advantage of this is just filtering, we're filtering No, no numeric arrays very quickly

109:50

with a very familiar syntax a greater than equals to and we just provide that as the

109:57

index of the operation. It's pretty much What is happening right here? We're saying use

110:02

these Boolean array. It's a willing list, right? is a Python list with Boolean, to filter

110:09

or sorry to select elements based on that. But the question is, how do we construct that

110:15

list of Boolean? Well, in this case, we have constructed it by including a predicate by

110:20

including a condition that needs to be matched. The result, again, is filtering. It's a query

110:27

method, you're looking, looking up some data, you're saying, Give me all the elements that

110:32

match this condition. So you can say, for example, these values can be of course calculated,

110:39

you can say, give me all the elements that are greater than the mean. Or you can actually

110:45

provide other Boolean appraiser operators like for example, all the elements that are

110:49

not greater than the mean. So that means they're less or equals and the mean, or you can also

110:55

include all their Boolean operators like or, or, and so or n and in NumPy, are expressed

111:01

with a pipe or an ampersand ampersand. Because we can't use just the regular or and then

111:07

in Python, we can, but it's a good choice, they've selected this. So again, this is the

111:12

concept of Boolean arrays, we are going to construct these arrays that artist Boolean

111:19

representations or Booleans, based on conditions, right, so we have this matrix, and we're gonna

111:25

say I want to select these one, and these one end is one, etc. So in that case, this

111:30

is the result right here. This is the result of that. And we can generate a dynamic Boolean

111:38

array, we never manually type all these right, we don't sit and say true, false false through

111:44

etc. We just Run Query filtering option, a Boolean operation, which results in a Boolean

111:51

array. And now we can use it as filtering. So again, the idea here is that the operations

112:00

we saw in broadcasting before, a timestamp are also defined for Boolean operators. Boolean

112:09

operators return Boolean, a race, which can be used in filtering, that's the idea of all

112:16

of it. And you can even combine these operations, you can say, A equals zero, or a equals one,

112:24

a less or equal to two. And it's also divisible by zero, you can combine all these queries.

112:32

So now it looks a lot more powerful than when we were doing before. So moving forward, talking

112:39

about linear algebra very quickly. And this is we're approaching the end of the NumPy

112:44

lesson. The part the important part of of linear algebra is that NumPy already contains

112:51

all the most important operations for it already optimized with low level semantics, it's going

112:57

to be extremely fast, adult product cross products, and all that transposing majors

113:02

is all that works as expected. And again, these might be very important, specially,

113:08

for example, machine learning, etc. It's it's extremely important. And finally, to wrap

113:15

up what we saw in our, in our binary explanation at the beginning, what you might have escaped

113:22

is the difference in sizes between NumPy and Python, the differences in terms of performance

113:30

between them. So in Python, a regular number, this is just a regular engine in Python, that

113:36

total size is 28 bytes in order and just let this thing for a second. The total number

113:46

of bytes, not bits bytes that you need in Python to store a simple number, as the number

113:54

one is 28 unit 28 bytes to store just the number one is extremely,

114:01

super space consuming, right? It's not very efficient, larger numbers will even take more

114:09

bytes to store them. What's the size of the integers? Well, we've seen it we have, for

114:17

example, we can create integers with eight bytes. We can create integers with one byte

114:22

right which were something like here we have np.int eight will already know how many bytes

114:33

has only one byte, right, but you can have control of how many bytes or bits write your

114:39

numbers will take. And you can see here the difference between the size of an integer

114:45

in Python which is extremely large 28 byte on NumPy and also the difference in performance.

114:55

Let's say for example, we want here you have the ultimate difference in size of lists,

115:00

which is also significant. But I want to focus on performance, we have two elements two,

115:05

we have one list that has the first 1000 numbers, I will have a NumPy array that has the first

115:11

1000 numbers, we're going to perform the same operation in both of them. Let's use the Python

115:16

one. First, we're going to do the Python one first. In this case, we're, we're squaring

115:22

all the elements in the list, okay, the elements A squared, and then we're summing all the

115:29

operations might so we express it at saying, create a new list, x times x, sorry, squared,

115:37

4x, nl, and then some everything, how much time it takes 321 microseconds, we're gonna

115:46

do the same thing with NumPy, we're gonna say NP dot sum, a square. And you're gonna

115:54

see that it's a lot faster in the NumPy perspective, then the Python perspective. And these are

116:01

all very, very tiny, tiny operations with small numbers. What happens if we add more

116:08

numbers, let's add two more numbers here. That's odd. Two more numbers here. And we're

116:15

going to do the same two operations. So as you see here, that that the units have even

116:23

changed, we're still in the microsecond layer here with NumPy, we've gone to the millisecond

116:28

layer in Python. So as the size of your objects increase, NumPy will prove to be extremely

116:36

fast compared to Python. So there are a few other functions you can see here, for example,

116:42

extracting normal, random numbers, etc. I'm going to live let these for you to look, if

116:49

you're interested in them, I remember you have the exercises, which can help you solidify

116:55

all the concepts we discussed. And we're going to move forward now to work with pandas, we're

117:01

going to see also visualizations are gonna keep moving forward this data analysis with

117:05

Python tutorial.

117:09

Now, it's finally time to talk about pandas is the most important library that we use

117:18

for data analysis in our day to day basis with Python. It's a library that will aid

117:25

in the entire process of your data analysis project, you're going to start getting the

117:31

data, step one, getting the data from multiple sources, like databases, Excel files, CSV,

117:39

files, etc. That's all gonna get into pandas, you're going to be processing the data, right?

117:45

So you're going to be combining merging, doing different types of analysis, you're going

117:50

to be visualizing the data, right, so a bar chart, you're going to be visualizing the

117:55

data with pandas, and you're going to be creating reports, you're going to be also doing simple

118:01

statistical analysis, you're going to be doing machine learning close to it, with the help

118:07

of other libraries, but everything from the platform that provides the pandas library,

118:13

it's, again, one of the most important libraries in in in the data analysis data science ecosystem

118:20

with Python. pandas has recently released the version 1.0. So we are talking about a

118:28

very mature library. It's been around for a long time now. And again, it's the primary

118:34

library that we use in Python for data analysis and data science. So I'm going to do a quick

118:41

introduction to the data structures of pandas house, and we're gonna understand how they

118:47

work. So you can start building right the phone, we're gonna start building the foundations,

118:51

I need you to be very familiar with the way the data structures from pandas are processed.

118:58

And then we're going to move into other things like reading files, grouping data, etc. So

119:03

to get things started, we're going to talk about the first data structure to pandas house,

119:08

which is this series. In reality, pandas has two main data structures that it uses all

119:13

the time, and it's the series under the data frame. The data frame is the one you will

119:18

probably be more familiar with. It looks just like an Excel table. But we're gonna start

119:23

first with a series. Okay, so just stay with me here. We're going to talk about a series

119:27

for a second. In this case, we have important pandas, and we have also imported NumPy. As,

119:34

as you might imagine, as I told you before, in the NumPy part of this tutorial, we're

119:39

saying NumPy is fundamental for data analysis because every other library pandas, matplotlib,

119:45

they all sit on top of NumPy and you can see it right here. We're gonna be using some features

119:49

from NumPy within this lesson, too. So these is a series in pandas, what you see right

119:57

here, it's The concept of a series is this ordered sequence of elements right? Or indexed

120:09

right with they are all indexed by a given index, of course. And you might think that

120:14

this looks a lot like a Python list, right? So in this case, we're storing the population

120:20

of countries, right in millions of inhabitants. In this case, it's jevelin. g7. pub is because

120:27

we're getting the population of the Group of Seven, you can console the Wikipedia page.

120:33

But basically, we are storing population in here in this series. And again, it looks a

120:39

lot like a list, but we're gonna find a ton of differences in here. So the first one is

120:45

that the the series has an associated data type. And this is something we saw in NumPy,

120:52

when a NumPy array couldn't hold different types of objects, we were all we were only

120:59

having one type of object. In this case, it's float 64. So all the numbers of the series

121:05

will be of type float 64, the underlying data structure, the 10, this is using to store

121:11

these objects is a NumPy array. So a second difference we see very quickly is that zeros

121:17

can have a name, right. So now when we display the series, we see that it has a name. Now

121:23

it might not make a ton of sense. But once this series is part of a data frame in the

121:28

form of a column, then the name is going to make a lot more sense. So moving forward,

121:32

again, we saw that A has a type. And again, this is because the backed the data is backed

121:39

by a NumPy array that you can always consult, you can check values of a series. And you're

121:44

going to get the array that it's backing up that pandas series, right, so you can see

121:49

that it's a NumPy array.

121:52

Once you have these series, we were just consulting here, design pop, you can in you can select

121:59

elements as you good in a regular list, right? So for example, give me the first element,

122:05

give me the second element, the last element, etc. And that's because a series inherently

122:12

has an index, similar to list a list when you create a list in Python, right? So if

122:19

I create L equals a, b, and see, but there is something wrong here missing, quote, this

122:31

list, we don't say it right. But the idea is that there is an index here, zero, this

122:39

is one, and this is two, right? In the pendous series, this is a lot more explicit, each

122:48

element has an associated value within it. And you might think that is pretty much the

122:57

same thing. They're all they're both the list on the series, there are both sequences, they're

123:02

ordered sequences of elements. But we're going to see that there is a fundamental difference,

123:08

and is that we can arbitrarily change the index of a series. So by default, when we

123:16

created it, we didn't assign any indices. So by default, it was a range index from zero

123:22

up to n minus one elements. But you can actually arbitrarily again, say, what is the index

123:31

of your series. And in this case, these data structure these series has now these indices

123:36

that we're seeing right here. Why is this important? Because now we're going to be referring

123:44

to these values, not by a sequential position, but by a name, but by a label by the index,

123:50

which has a meaningful name for us humans. Okay. So now, these thing looks a little bit

123:58

more like a dictionary we could say, than a list, we started thinking that a series

124:04

was similar to list but now, we can think that a series is limit similar to a dictionary.

124:09

But wait, don't get me wrong here. The series has a fundamental trait, and it's that it's

124:15

still ordered something that didn't happen with. With dictionaries, dictionaries in Python,

124:23

are not ordered, actually, in python 3.7. They're ordered, but we shouldn't be thinking

124:27

that they are ordered their unordered data structures. In this case, a series is in the

124:34

order. So it has both those advantages. It's ordered candidates always before friends,

124:41

that's as we decided to create it, but also it has names or labels or keys associated

124:48

with the values as a dictionary. So this is creating the series from scratch. Right? All

124:55

these methods, you can see you can create a series bypassing the index it doesn't have

125:00

To be a two step process where you first created the series, and then add the index, in this

125:06

case, you can do everything at once. And the indexing is now going to be done by those

125:14

indices, right. So those labels that make up the index will be used to index specific

125:21

data. So g7 pop, we see has these countries with these population. And now, before the

125:27

index, we were saying, I want to get what's the population of Canada, and then we had

125:35

to remember, what was the position of Canada, oh, it's the first help countries, we have

125:39

to do g7, pop zero. With the index, now we can just consult what's the population of

125:45

Canada, what's the population of Japan. And as you can see, the syntax is the same as

125:51

with a Python dictionary, it's just pretty much same, you pass the key and is going to

125:56

get the value. So again, summary, the advantage of a series is it's it's a ordered sequence

126:03

of elements, backed by a NumPy array, very efficient very fast. But it also has

126:13

an index that can take any labels we pass, so it's going to make it a lot better for

126:20

indexing, you can steal when you have a series, you can still get the elements by the sequential

126:27

ordering. After all, it's a sequential data structure, and doesn't matter if you have

126:32

in an index, you can still say, Hey, I know we have on the index. But if you want to get

126:37

the last element, or the first element or the second element, you're going to do that

126:42

by using the attributes, I look at it and say to this series from this series, I'm going

126:49

to ilok locate by sequential position, these element the element in position zero or the

126:56

last element. And that still works as expected series also support multiple indices as we

127:03

saw with NumPy. So in this case, we can get two elements out of two, three n elements,

127:09

you can pass multiple indices. And the same thing happens with more with sequential multi

127:16

index series also support range or selection or slices. But there is a fundamental difference

127:24

here, this is very important here attention, there's a fundamental difference with Python,

127:29

and it's not in Python, the upper limit of a slice is not returned. So from our list

127:40

that we created before, if I do l, up to number two, I don't get the index See, right, so

127:49

this is zero. This is one, this is two, two is not included in our pandas series, the

127:58

upper limit is indeed included. So if when you asked from Canada up to Italy, Italy is

128:05

in the result. Okay, so this is something to consider when using index selection in

128:11

pandas, I think this is still valid, it's very, I understand the reasoning behind it's

128:15

just different from Python. So, you should remember, Boolean arrays, which was a topic

128:21

we discussed in our previous lesson of NumPy. Boolean arrays is still a thing in pandas,

128:28

the difference is we instead of saying Boolean arrays, we should say Boolean series right,

128:33

the idea is that we will be able to perform operations on top of series. So for example,

128:40

right here we have mathematical operations on top of series in this case, we have the

128:43

zero D seven pop, which as I told you the beginning is in millions of inhabitants. If

128:52

we want to get the series of interest units, we will need to do Jessamine pop times 1 million

129:01

and there we go now is in terms of units these operations right these vectorized operations

129:09

the bras these broadcasting operations can also be performed with Boolean operands. So

129:16

instead of a multiplication, a summation and subtraction, etc. We can add we can use a

129:22

Boolean operators. So in this case, we get asked

129:27

what

129:28

are the countries that have more than 770 million inhabitants we will receive receive

129:36

their assault is a bull in aerates, Nebraska, right? Well, let's hear it you know, but basically,

129:43

it's the same concept of with us with a NumPy Boolean array. Canada, friends, they do not

129:49

have more than 70 million inhabitants in Germany does have seven more than 70 million inhabitants

129:56

here. 80 on the same for Japan, so Japan Here is the same on the same for the US, the US

130:04

also have past more than 70 million inhabitants. So again, the Boolean array or Boolean series

130:12

in this case, works in the same way, as with NumPy. And selection also applies. So I can

130:19

now select, I can say, give me from these series g7 pop, all the countries that have

130:27

more than 70 million inhabitants, the value is more than 70. So now, again, we are building

130:35

filtering, we're building a query language if you want on top of pandas, we're selecting

130:41

data based on this condition. Remember, when if you ever have trouble remember all these,

130:49

the idea is that you can always track down the way this index is being built. In this

130:57

case, we are it's not that the selection knows anything, these first election knows anything

131:04

about how to select countries with more than 70 these operation was performed first, which

131:11

resulted in these series. And now this series will be indexed by these array, this Boolean

131:20

array. And the result is as you can see it, and again, these operations can be run with

131:26

calculator methods, and all the operators we saw in our previous lesson, which was not,

131:33

which was or this irregular pipe, or, and amberson, which is the and all these can be

131:43

applied in any order you want. So if we read this thing, which is complicated in purpose,

131:51

it's worth saying give me all the elements that are above the mean, minus two standard

131:57

deviations or below the mean, actually, above the mean, and here was below the mean, or

132:05

if this isn't correct, but it doesn't matter. It's just an OR operation between two ends

132:12

of the it's actually, it's above the mean, minus the standard deviation. So we are applying

132:20

this operation or right, that operation we have before so they're not the or, and the

132:28

and they all work with Boolean selection as well. The operations we saw from a mathematical

132:38

perspective mean in in statistical operations, we saw a NumPy. Some mean, average standard

132:46

deviation, we're actually using standard deviation before, they're all still relevant in this

132:51

case, but also you can use traditional NumPy functions with our pandas series, because

132:57

again, a panda's series is internally backed by a NumPy array.

133:03

So this is all the same, as you can see, here is an example that it's a little bit more

133:09

clear, we're getting all the countries that have more than 80 million inhabitants, and

133:16

all the countries have less than 200 million inhabitants. So it has to be above 80. But

133:23

it also has to be below 200. Okay, or in this case, we say either above 80, or below 40,

133:32

or below 40. Right. So that's with the OR operator or the NOT operator. Modifying series

133:39

is relatively simple. Whenever you have a value, you can just assign it all together.

133:45

In this case, we're saying Canada is now 40.5. I don't know why we just wanted to do it.

133:52

This is by index, you can also do it by sequential positions. So in this case, we're going to

133:57

say the last country should have 500 now. So we're going to see a right here, the last

134:02

country has 500 now, or you can also modify elements based now bool and selection. So

134:09

you can say all the countries that have less than 70 million inhabitants, all these from

134:15

our previous query, all these will now be 99.9. So as you can see, it has changed all

134:24

these countries. So this the assignment works by direct indexing, or also works by Boolean

134:31

indexing. And this is going to be extremely important when we are cleaning data. So let's

134:36

move forward and start talking about data frames now before you have exercises in for

134:43

series, and also for data frames, so I recommend you to check them out. So talking about data

134:51

frames, this is what a data frame is going to look like. It's pretty much the same thing.

134:56

us an Excel table. So this was our series and this is going to be our data frame. It's

135:04

a table. So it looks a lot like an Excel spreadsheet. And actually, it's very common to create pandas

135:11

data frames out of CSV files, which are tables basically. And I'm going to create it we created

135:18

with these data frame object I created. There you go, these are data frame. And as you can

135:26

see, right, it has columns that we have assigned. In this case, we were designing the columns,

135:32

and we have rows of values right below each one of these columns. Why? What's the similarity

135:40

with with series, and it's not a data frame column will be basically a series. So we can

135:49

think a data frame is a combination of multiple series one per column, we're going to assign

135:56

an index to the data frame the same way that we did with our series. So in this case, this

136:02

is our data frame. Sorry, right here. This is our data frame that has the index, right?

136:09

And it has the columns as we had before, what columns Do we have, what's the index of the

136:17

data frame, these are all attributes that you can consult, there are a couple of very

136:22

interesting methods from data frames that we use all the time. The first one is the

136:26

info method. That's going to give you quick quick information about the structure of your

136:31

data frame. Right. So it's going to tell you what columns you have population GDP surface

136:37

area, HDI continent. And it's also going to tell you the types and how many no values

136:43

you have, it's actually telling you how many non null values you have. But we use these

136:48

when we're cleaning data to quickly then define those columns and have missing values, we

136:54

can check for the size of the data frame, we can check for the shape. And this is similar

136:59

to a matrix right, a two dimensional array in NumPy is pretty much a data frame. And

137:05

also similar to info the voice again, to check a summary of the structure of the data frame,

137:11

we can also use this cribe, which is going to give you a summary of the statistics of

137:17

the data frame. And in this case, what we see is that for each numeric column, only

137:25

those columns are numeric continent is not here, for example, this is continent so you

137:31

can see the type is object is a string, basically, all the numeric columns, we're going to have

137:37

summary statistics for them. So for example, for population, how many elements we have,

137:44

what's the mean, right? What's the average Romney, what's the standard deviation, the

137:47

minimum, the maximum, and in between a couple of percentiles 25th 50th and 75th percentiles.

137:56

So this is quick summary statistics. And we do this a lot. So keep in mind, his crime

138:01

method is very popular.

138:04

As you could see, in the in the info method, the columns have associated types, okay, so

138:13

this is very important. They continent is an object that means that it's basically a

138:18

string HDI is a float and surface area is an integer. And that's because NumPy, pandas

138:25

is automatically with through NumPy, is automatically recognizing the correct type to assign to

138:32

each one of the columns. This is similar to what we saw with a series in which the series

138:40

contain natural datatype, a series was part of a given data type. So that's something

138:46

you cannot change. And in this case, checking value counts, you can have a quick reference

138:52

of the types of your series. So moving forward, how will we we will be selecting data from

139:01

series Well, there are a couple of methods. And this might be a little bit confusing.

139:08

So what I'm going to do is I'm going to skip and just going to give you a quick reference

139:14

first, and then you can read if you want through the process we follow here, given a data frame,

139:22

and this is just two quick rules, given a data frame, you're going to select by index

139:29

using the lock attributes. So the lock attribute is will let you select individual rows. So

139:39

for example, when I get Canada and that's the value of Canada, when I lock attribute

139:50

will let you select similar to the series, the row by sequential position. So let's say

140:00

We want to select the last row. In this case, it's the United States of America. So again,

140:07

look lets you select a select rows by route by index, give me the row under this index,

140:15

I log will let you select rows by sequential position, give me the last row, the first

140:21

row, the second row, etc. And finally, without using lock without using a lock, just by saying

140:28

the f up something, you are selecting that column, give me the entire give me a V and

140:35

tire column population right here, the entire column population. So what you're seeing here,

140:45

first, first of all, this is a quick reference dot dot Lok will give you an element by index,

140:51

I look we'll give you an element by position, I wrote by position and just doing df, on

140:59

some things gonna give you the element, the column sorry that you are passing. So it's

141:04

like, both look on I look, look and I look work in a horizontal ladder, give me this,

141:13

while bf art, whatever works in in a vertical montanus, which is getting you a given row.

141:21

But something more interesting here is that all the results, these one and these one and

141:26

these one, they're all series, what are being returned our series. So that's what we saw

141:33

before. And the way it works is first, if we focus in this last example, we're going

141:40

to see that it's pretty standard, just these series right here was is a one return I remember

141:46

it has a type and everything. So that's, that's fine. If If we ask for a row, like in this

141:57

case, we can get for example, here easily. There you go. The result is also series. But

142:08

what you can see here is that this thing is kind of transposed in a way dot here was the

142:14

volume of this year is population is here, and GDP is here and surface area is here HDI

142:24

on continent on here you have volleys. So it's it's again, it's it's being transposed,

142:30

right from vertical to horizontal, in our regular series manner on the index of this

142:37

series is extracted as the name

142:42

that the column hot. So in this case, the name right there is the value of the index

142:50

that it had. So you can read more about it right here. But I just want you to remember

142:55

these rules don't lock you select by index dot I lock you select by sequential position,

143:01

the F at something you go by column, there are times when these might not apply. So or

143:07

not want to apply, there will be some issues. So for example, if your rows if your index

143:13

is numeric, you might have issues with these form or dot form, just respecting these three.

143:20

For now, it's gonna get you any element you want to get either by row or column. So from

143:27

what we've seen, the oldest slicing also works as expected. So we can get, for example easily,

143:33

or we can get friends up too easily. So the upper limit is included. But again, it's built

143:39

look and we select by indices from France to Italy, we can also do the second dimension

143:45

similar to the way we worked with NumPy, we can do second dimension here. And we can get

143:51

all the countries that are from France, or to Italy, including Italy, but only the population

143:57

column or population and GDP. So here you can see the second dimension being applied

144:03

at the concept of of multiple dimensions in selection being applied also to famous for

144:09

ilok. It works in the same way that in that then multi index and the slicing. So we get

144:15

for example, from one to three right in sequential positions. In this case, the upper limit is

144:21

not included. So that's another difference from what we have. And we can also do multi

144:28

dimensions we can say give me the countries from one to three and the column should be

144:35

0123 should be the third column, the fourth column, the column under index three which

144:41

is HDI, so that also works as expected. And again, recommended, always use Look, I like

144:49

to select rows and just use the naked data frame to select columns as we saw before.

144:56

Now moving forward, conditional selection Boolean arrays will series, whatever you want

145:01

to call it. This also works for data frames. And it's very important, it's a way to filter

145:06

data, it's a way for us to consult the data when the when, when it so in this case, what

145:11

we have is, we want to select all those countries, which the population is greater than 70. Okay,

145:19

so all the countries that have more than 70 million habitants, similar to what we were

145:24

we did with a series, but in this case, we want to do it with a data frame. So what you're

145:29

going to see here is that we're going to construct a Boolean series as we did in our previous

145:34

video, right? So every country with more than 70, false false, true false. And we're going

145:41

to inject that result, that Boolean series in a dot lock selection, give me all the countries

145:50

which match here than that the true value in it. And remember, just this is kind of

145:58

mnemonics are a way to remember, the way pandas knows how to filter things is by matching

146:06

this index, right from the resulting series. With these index of the resulting data frame.

146:13

These are two different objects, they are completely different objects, but their index

146:19

much. So here, Japan, March, Germany March, so here, Germany, on Japan, they are the same,

146:26

and that's why that thing is working us expect that they This is just the first dimension,

146:33

which is give me these rows, you can also on the second dimension, saying give me these

146:38

column, or these columns, right. So that's steel, that's the awards us desire. So what

146:46

about dropping stuff, you can say, whenever you have data from you can say give me just

146:51

these pieces, or you can say drop the others, right, it's just pretty much the same. Dropping

146:56

is very simple, you can drop by index, drop this value drop Canada altogether, period,

147:02

or drop these indices can in Japan, or you can also drop columns, drop population, and

147:08

HDI as columns. These ways also have a more advanced usage, which is with access similar

147:16

to NumPy. I don't recommend them so much, but you can still use them and see them here.

147:24

So all the operations we've seen, so far, they're all working. The most important part

147:29

here is the broadcasting operation that we're going to do between series. So we're going

147:34

to create a new series crisis. And I'm gonna

147:39

show you what it looks like. So we have here crisis. And we're going to perform a broadcasting

147:46

operation between between these, I'm going to show you what this thing looks like first,

147:53

between that two, these data frame on the crisis. And the result will be that we will

148:00

subtract, I don't know what's this number 1 million, subtract 1 million for each volume

148:09

in here. And we're gonna subtract 0.3 HDI for each one of those. So what you can see

148:16

here is again, this alignment between columns and indices, the GDP here is matched with

148:24

these GDP and the HDI is much with these HDI. So there are two different objects, two independent

148:30

objects, these series and these data frame here. But when we combine them with an operation

148:39

like this, the the columns in this case are aligned GDP, and HDI and they work together.

148:48

So you're gonna subtract these value in all these column, let me remove this, you can

148:54

subtract these values in all this column for all these values, I'm going to subtract these

149:01

value here in these column for all these values. That's the way it's going to work. So moving

149:10

forward, what about modifying data frames? Now I wanna I want to show you something.

149:16

And that's when we were dropping stuff before. We were not actually modifying the data frame.

149:22

So here we did df dot drop Canada, but df still has Canada in it. And that's because

149:31

similar to what happened with NumPy these operations are all immutable. They are not

149:38

changing the underlying data frame. We are storing. We are storing we're creating new

149:46

data frames that store the result of the given operation. So in this case, you have to drop

149:52

Canada. The result is that the these new data frame but the underlying That iframe is not

150:00

changed. That's because again, they are immutable operations. 99.9 operations in pandas, it

150:08

are immutable, there are ways to change it, there are ways to make the changes permanent.

150:14

But for now, I want you just to think that everything is immutable. Whenever you want

150:19

to perform an operation, it's going to create a new series. If you want to keep track of

150:24

this, you will just need to do something like df two equals that, or even df equals, you

150:30

know, just to modify the current data frame. Again, there will be a way to not do that.

150:34

But we're going to save in a sec. So modifying series more explicitly, that affrontare modifying

150:42

data frame more explicitly, how can you create a new column? Well, very simple. Assign a

150:47

column, I said, let's say in this, this column right here, it says similar to say, here,

150:55

language. Oh, it's just read only. But if I say language equals, and I can just write

151:01

whatever I want. In this case, what we've done is that the language, let me show you

151:08

what Lynx had, in this case, was a tiny series, it didn't have elements for all the indices

151:16

in the data frames, but that doesn't matter. pandas will match all the indices of our chill

151:22

exist. And it will live like the rest. This na n is what we use for a blank. It's another

151:28

number from NumPy. We're going to talk more about it when we start doing cleaning data.

151:32

Data cleaning, sorry. So again, links France, Germany, Italy, you can see the volleys are

151:38

all up there. What happens if you want to change a value the language series already

151:45

exist, you want to change it or column or read exist, you want to change it. So in this

151:50

case, we're going to say df language equals English. So we're going to change it all together,

151:56

df now will be affected, and all the values of language will be English. How can you relate

152:05

How can you realize when there is an operation that is changing the underlying data from

152:09

the underlying series or than the line NumPy array, it's usually when you have an equal

152:17

symbol, remember, NumPy, we saw something plus equals, in this case, whenever you have

152:22

a plus and equals symbol is you're modifying the underlying data frame.

152:29

So for example, check this out, the Rename function or method of a data frame will let

152:35

you pass columns and indices to rename. So in this case, we want to change the United

152:41

States to USA, the EU, United Kingdom to UK and Argentina to AR, Argentina doesn't exist

152:48

in this data frame. But that doesn't cause a problem. And that's why we want to show

152:52

you, the US, UK were modified correctly, and HDI was modified correctly. And a PC which

153:02

doesn't exist, didn't cause any problems. Now, why am I showing you this because remember,

153:10

these operations are immutable. If I check what's the state of the data frame, we see

153:15

that the original data frame has not been changed HDI a steel HDI, it doesn't matter

153:22

if we renamed it before, it's still the same data from the same thing for days, indices,

153:29

all these operations are immutable. A few more examples of modifying data just for you

153:39

to look at. And something that is very common for us is creating columns that are combinations

153:49

of other columns. So again, this is read only, but you can you can imagine, that I could

153:53

do is hear something like for example, GDP per capita, right? If I go here, and I do

154:02

GDP per capita, GDP, p per capita, per capita, and here I say is equals to the GDP, this

154:15

column divided by this column, right? So I do something like B, B three, actually, C

154:26

three, C three, divided by b three, right. And then we would extend the values all the

154:36

way along here. In pen this, we could do something very similar. We can do just any column, we

154:44

can just perform operations, broadcasting operations between them, in this case, GDP

154:49

by population. And we can assign that series which is a result right there. So it's a series

154:54

we are going to assign that series to a new column. So GDP per capita Now, there you go

155:01

is now a column of our data for. Again, all these broadcasting operations are extremely

155:09

fast, they are backed by their NumPy array, and they result in a series. So very quick

155:16

statistical information, a few methods, right to do summary statistics. We saw them with

155:21

this crime method. But minimum maximums mean, median, all that works as expected. Something

155:31

that I want you to note here, if possible, is that with pandas, we have, I'm going to

155:39

change colors here, we're going to use red. With pandas, you have this concept of a data

155:46

frame, right data frame that has multiple columns, multiple rows. And these operations

155:54

are resulting operations are resulting in just one series. So in pandas, you have your

156:01

data frame, and you have your series. And we could say we have individual numbers. And

156:10

it's like always, the data frame is always resorting back to this, it's like some operations

156:16

will just return a series. And the series can be used in a data frame, right. So in

156:20

this case, these resulted in a series, but then we merely use the series to set the value

156:28

of a column. Right. So that's why understanding series is so important. So there are a few

156:38

more assignment exercises for you here. So you can check them out and complete them if

156:42

it's going to make a little bit more sense once you're working with it.

156:47

Finally, I want to give you a very quick introduction to reading the external data on plotting.

156:54

And to do that, we're going to use a few methods that are very popular in there, maybe we can

157:00

look them up very quickly here, we can say read CSV, use the read CSV function from pandas.

157:10

So these function, read CSV. And as we have read CSV, we actually have a few others read

157:17

sequel, read Excel, read XML, there are multiple adjacent or multiple ones, read HTML will

157:24

be able to automatically parse an HTML page and read it. So a few functions like these

157:31

like, what we're going to do with these read CSV, right here is the structure of it. A

157:38

few of these functions will let us import data from an external source into our pain

157:45

this workflow. So in this case, what we're going to read is these BTC market prize volumes,

157:52

so it's right here, if I open the CSV, this is what it looks like. It's the date of the

157:59

price taken a read and devalue the bread, the timestamp, and the value the timestamp

158:06

of the value no decide the price of bitcoin 2017. Now it's close to $9,000, I think. But

158:11

just note inside, but again, this is a CSV, and this is a CSV that we're going to be writing.

158:19

To do that, again, we're going to use this method read CSV, the method will automatically

158:26

parse the CSV, as expected. And there you go. And the process now will be for us to

158:35

start tuning it to get to the right point. So I'm going to show you a few customization

158:42

SP customizations, we can do with the receipt, read CSV function. So the first one, and sorry,

158:47

let me tell you first, we have a ton of attributes here. So we have a ton of customization to

158:53

do with read CSV, you will not remember all this, you will not remember everything out

158:58

of the top of your head. So don't worry, you can always go back again to the documentation

159:02

and just practice, it's going to come naturally. So the first thing, the first row of the CSV

159:11

was considered to be the column names. So in this case, this fine lesson have a column

159:16

name, let's say I add it, I'm going to do timestamp, timestamp price, you're going to

159:20

save it, I'm going to rearrange the file and re re read it. There you go. So by default,

159:31

pandas is assuming that the first line of the CSV is the rd columns. I'm going to go

159:37

back into what it was. Right, and I'm gonna show you again, that's the assumption that

159:42

pandas is doing. We're gonna Of course, of course, change that assumption, because in

159:46

this case, our CSV file does not have column names. So we're going to just say Heather

159:52

equals none. And this is when we start seeing the attributes that we're going to use from

159:57

the read CSV function, read CSV. When I do hether equals none for us going to be known.

160:04

That means don't infer don't read a header. Don't try to infer a header, a header from

160:11

the CSV file. And the columns are zero and one. So now I'm going to change the columns.

160:17

And I say, actually to be time something prize. And now what I'm going to do is show you the

160:24

first rows. So you're saying here that I have these df dot head method that I'm doing. That's

160:32

because this is a significantly large file. So we're going to say not not that long, but

160:38

at least it doesn't fit in my screen. What's the shape of the day CSV or the data frame?

160:45

It has 365 rows, and we have two columns. So we can do df the info, for example, to

160:52

have a little bit more reference about we have 365 values, there are no no values, and

160:59

price is actually float, that Tamsin is an object and we're going to fix that in a second.

161:06

I'm sorry, that the F that head on the F dot tail, are the methods we used to get either

161:13

the first and files or the end row sorry, or the last n rows, which are five rows, by

161:20

default, you can change that and say, Show me the last three rows, for example, that's

161:24

something you can do. And again, the types so the types is the timestamp in this case,

161:30

the timestamp column was not properly parsed as a date, he was parsed as an object as a

161:37

string, which we don't want. So we're going to use the function PD dot today time, something

161:43

we're gonna explore in more detail in the reading in the cleaning data cleaning course.

161:49

Part sorry, if it weren't tutorial, we're gonna use it today time function to turn these

161:56

column D f, the timestamp into an actual date. And now we're going to say, the F that timestamp

162:04

equals to this function resulting, and now everything looks as expected, there is one

162:11

more change that we want to do, we want to set the index of the data frame to be the

162:21

timestamp, because by doing so, we can quickly access price information led me see what was

162:27

the price of bitcoin in 2000 1709 29. And I make a mistake here, I forgot to do the

162:39

LLC. There you go. So we have the value of Bitcoin. On these particular date, forgot,

162:48

look, remember that to get value from a particular row, you have to do dot lock. There we go.

162:54

So we are getting Dodd's particular value. Because we've made a timestamp the index,

163:01

we get artists value directly from the index. So what happens if you want to turn this thing

163:09

into an automated script, for example, when I run this process, every day at 5am, whatever

163:14

we can, we want to read the CSV, strip the columns, rename them turn into timestamps,

163:20

etc. This is what we've done so far. Read the CSV without a header, create the columns,

163:26

turn it into a daytime timestamp into a daytime and assign it to the index. And that's the

163:32

result again, well, actually, the read CSV, oh, sorry, the read CSV method is so powerful

163:41

that it will let us do all these actions in just one call of the read CSV method, we there

163:51

are parameters that will let you customize the behavior to achieve the same results that

163:57

we did with four lines of code right here. So in this case, we're gonna say, read this

164:02

CSV, don't assign a header, that's something we do already or don't don't infer our header

164:09

from the first line. These are the column names. So we don't need an extra line, we

164:14

can just say these are the columns names. Oh, and by the way, the first column is going

164:19

to be the index of the data frame. Oh, and also part of the date. They've the index,

164:25

it's a date, so part of the date, and we have the same result as before. So now I'm going

164:32

to pro try and same thing. There we go. So you can see it's work. So very quickly pan

164:42

this plotting. Alright, so we're going to be doing here is I want to show you very quickly,

164:49

I don't know what's this thing is as a vertical scrolling. I want to show you very quickly

164:55

that you can create plots with Hannah's interest a breeze. It's so simple to create a block.

165:02

So in this case, what we're going to be doing is, given a data frame, you can always invoke

165:08

the plot method. And the plot method, what it's doing, it's using the map plot live library,

165:14

something that you can check if you want in the docs. But for now, it's not necessary

165:19

with these, we're going to be more than enough. What it's doing is just using, again, the

165:24

regular plug library, as you can see dimopoulos Library, which is part of the standard PI

165:32

Data stack. And again, for us to access using pandas is extremely simple, just df dot plot,

165:38

you're done, you can set the plot as you want, we're gonna see more details of matplotlib.

165:43

So don't worry too much about that later. So there is a more challenging example here

165:49

that I can just run very quickly, you can inspect the process we follow to fix the data.

165:57

But this is what we have, there we go.

166:01

And what you can see right here is the difference between the Bitcoin and ether in this period

166:09

of time right here, and they are both loaded in the same chart. And that's because this

166:14

is the resulting data frame, we have Bitcoin on one side, and we have ether on the other

166:19

side on we are plotting it right here, we're creating one plot with all of it. And we are

166:27

noticing these empty value right here. So what we can do is we can go from December

166:35

1 up to January the first these period, so we can select that period, is in that lock.

166:43

And we can just go ahead and plot it again. And this is what you see right here, the gap

166:50

that we're seeing. So again, this was the introduction to pindus. We have a real life

166:55

example of pandas following up. Also we have a little bit of data, more data cleaning on

166:59

reading all the interesting files and sources of data for in getting more data into the

167:07

pipeline, right. So the idea is going to be showing you how you can import data from Excel

167:11

from SQL and then do the actual processing and analysis.

167:22

Now it's time to talk about data cleaning, we have arrived to that point in our tutorial,

167:27

in which we have pulled the data, I've shown you how to manipulate it with pandas, the

167:33

beginning at least the introduction to data manipulation with pandas, and now it's time

167:39

to properly fix it. For the sake of brevity, we are skipping a few parts of the process

167:45

of data cleaning, especially you're going to find it in this first notebook that we

167:49

talked about basics, conceptual, missing data with Python with NumPy. And we're going to

167:55

miss a few other things. But I'm just going to mention them. pretty generic, pretty general

168:01

form. And then you can of course dig deeper, you can check our courses if you want to know

168:04

more about it. Usually when we talk about data cleaning, where it's in from a more conceptual

168:10

level, we're going to talk about a four step process. The first step is usually finding

168:16

missing data, which is the simplest problem to identify from a data set when something

168:23

is missing. So you have car sales data. And there is a car that has no name right? Or

168:30

there is a card has no price, right? So there is an number missing or there is a category

168:35

missing and there's a string missing. And of course, each one of those is going to have

168:39

a different meaning how to solve how to fix data set that is missing data, it can be very

168:46

simple. If you can just for example, drop the record, if you can fill the value, right.

168:51

So for example, the prices fill in these missing, you can fill it with the average value of

168:57

the sales data or something like that. Or it can be very complicated if the volume is

169:03

important if you can't move forward until you actually find that missing volume. And

169:08

it can involve something like picking up the phone calling your ETL team asking what's

169:13

going on that the data is missing. Or even if you're buying the data, you have to call

169:18

the vendor, ask them why their ID if you've you're paying for that and there is data mentioning

169:24

etc. So it can be a very political process. It depends what's your use case. But again,

169:29

from a technical perspective, identifying missing data and fixing it is going to be

169:34

extremely simple. Once you have fixed the missing values, then you start looking for

169:41

the data is assuming the data is not clean yet in this process of data cleaning. The

169:45

second step is when there are invalid values. So you have for example,

169:52

column that is price and there is a string within it right here. We're expecting only

169:57

numbers and there are strings in it. So then It's not going to be complicated to identify,

170:02

it's not going to be too complicated to fix it. But again, we're increasing the complexity

170:08

until a deeann of these data cleaning process, we're gonna reach problems that have to do

170:13

with the domain of the day you're looking right. So for example, you have a column that

170:17

is customer age, and there is a value that is 170. Right? So that is not an invalid value,

170:25

it's a perfectly valid integer. The problem is that given the domain, right, but speaking

170:31

about customer age, is highly unlikely that a customer is 170 years old, right? So in

170:38

that case, the vowel is completely valid, there is no missing data, there is no invalid

170:42

values, etc, is just about the domain. And this is when things get very complicated,

170:48

because in this case, that example of age is something that resonates with all of us,

170:53

we know about age of humans. But if you're working in a domain, if you're working as

170:57

a data analyst, in a domain that you don't know much about, right, then you might not

171:03

be able to judge if a value is invalid or not. If I am working in a biology lab, and

171:09

I have something like white cells count per milliliter of blood, I don't know what's what

171:16

it's a good value, or what's an invalid value, right. So it's, it's something you need to

171:21

know the domain. So that's usually the the most complicated part of data cleaning, when

171:25

you reach the limit of everything is valid, everything checks out. And now I need to make

171:32

sure that these value is valid for these domain that we're working. So again, this is the

171:37

spectrum that we're going to be revisiting today. So to get things started, the way pin

171:42

this works with no values is is it has four functions, which actually there are synonyms,

171:50

it's going to be it's going to be relatively simple, just trust me on that. There are a

171:53

few things first, everything that pandas does in the process of missing values, is related

171:59

to the way NumPy works. So again, we're skipping it, you can go to that notebook, check it

172:03

out by yourself. But it's extremely simple. NumPy has these objects and n not a number

172:10

to identify a missing value or no value in Python world to have the non value. But again,

172:17

in pandas and NumPy, we're going to use na n none on there, or in this case, at the beginning,

172:21

we have these two functions is no n is na, which are complete synonyms, we're going to

172:27

find also is no and no we have it isn't a and they're also complete synonyms. So no

172:34

n na for pan, this is the same. You can use the one you prefer. Sadly, I like is na because

172:40

it's the way I learned it. I think for my students I usually recommend is no, because

172:46

it feels more correct. And it feels more self explanatory. So you can use the one you prefer,

172:51

if you can use is no, I think that's going to be better. If you get used to ease in a

172:55

then you're going to be on my side, just do whatever you prefer. So again, it's no one's

172:59

gonna say true or false, depending if the value is no or none, right? And of course

173:04

not No, it's going to be or not na is going to be the opposite. So not na have not a number

173:11

is false, and not an A of three is true. If you get to this first notebook, you're going

173:19

to set all the false e values on the true fi values in detail in terms of Python, anything

173:24

that is not empty or non etc is going to be considered to be truthy. So anything you pass

173:30

here that again, is not an empty string or a no is going to be considered a true fi value.

173:37

So it's no not no or is it a and none an A, they both work also with entire series or

173:43

entire data frames, right? So it's not just for one of Valley you can pass an entire series.

173:48

And the result back is going to be if the series is if the series what values in this

173:55

series are either no or not no, depending what's the question you're asking either is

174:00

null or not null. So in this case, we say which one is of the series are no, this is

174:06

not, no this is not No, this is no so this is only true. And the opposite for the following

174:13

method we are applying are actually function. And again the same thing works with not

174:22

entire entire data frame. So something we do usually is if you look in to not know unknown,

174:34

a few hacks that we usually apply are the count on actually this be the sum of all the

174:42

no values or not no value. So we have this entire series, we can say how many not null

174:47

values we have. And if we sum those, not no values. In this case, we're going to get a

174:53

result out which is the entire the entire summary Have the nod no bounds we have asked,

175:02

and the same thing is gonna happen if we say is no. So if I do here is no, some, we're

175:09

gonna get how many novels we have? And it's pretty much the opposite of this question

175:15

is no. And the way it works is in Python bullions are pretty much integers, they're ones and

175:22

zeros. So every true Val is going to count as one and every four is going to count as

175:28

zero. So if you ask for the sum of a Boolean series, you're going to get out the result

175:35

of the number of truths that are available in that series right. So, in this case, we

175:40

have to know values we ask how many knows value we have is know that some we get two

175:47

out, you can use these tricks to filter the data with a series. So in this case, we can

175:53

say give me all the values that are not known. Right? Just not know. Also, something interesting

176:00

is that both for data frames are for series. The not not no is no isn't a not an A methods

176:06

also, sorry, functions also work as methods. So in this case, we can say instead of PV

176:13

dot know, we can say s.is, no load s, that is no. So now, it gets a little bit more,

176:20

a little bit simpler. But if the final objective of these core as equals alzarri, s selecting

176:29

only the boundaries are no no, was to drop the null values, then there is a simpler form,

176:36

which is dropping, okay, so in this case, we can say s dot drop in a, and we're basically

176:41

invoking the same thing that is happening here, we're missing we're just excluding sorry,

176:47

all the missing values in the series or the data frame, because this also works for data

176:52

frames. So what's the one, one important thing to remember here is that all these methods

176:59

are immutable, we are not actually changing or modifying the original series, the underlying

177:06

series is not being modified, there is a new series that is returned. So if I invoke s,

177:13

again, this thing has is not modifying their series, you're creating a new series, and

177:19

that's the one that hasn't, that doesn't have the missing values. Everything we've said

177:24

also works for data frames. So right here, with these on a frame, we can say how many,

177:31

right? The first thing usually is to start with an info method, right? So we have info,

177:37

and we see that there are in total, four entries, four rows, we can also do a shape, if we need

177:43

more information about the structure of our data frame. So there are four rows, four entries

177:49

in our index, column A has only two no no values. So that means there are two values

177:55

that are actually no no, sorry, no, there is column B that has three nought non null

178:02

values. So that means that one value must be known, and that's for column B, again,

178:09

so usually info gets you very close to understand the structure of your data frame and how many

178:15

values there are missing. The same thing happens with some, we can just do df.is, null isn't

178:22

a and then some, we're gonna get a quick reference of how many null values we have in that given

178:28

data frame. Drop in a works in the same way, but there is a significant difference. The

178:34

way drop in a works in a data frame by default is by dropping any row that has at least one,

178:42

no value. So these row has no value dropped, these row has no value dropped, these row

178:49

has two new values dropped, this is the only one that it's not being dropped, right. So

178:54

it's very harsh in that respect, you can change that to make it to the column only, only keep

179:02

the column that has no no values, and that's by switching the axis equals to one. And there

179:09

is also a way to select a subset or thresholds. So only delete rows that have less than three

179:19

valid values. For example, in that case, you're going to use something like the strategy of

179:24

the drop in a you're gonna say, drop the columns, the rows, sorry, are the columns because it

179:29

is also works for columns that have all the values and no, or drop. The This is the default

179:36

behavior, drop all the rows that have any value in an NA or specify a threshold, which

179:45

you mean by basically saying, I need this amount of valid values in order to keep the

179:50

rope it's the way it works. Now, which ones to drop is which wants to keep based on the

179:57

fresco. So once you have identified it No values, it's extremely simple to clean them

180:03

to sorry, fix them. So the first method we're going to see is fill in a within a particular

180:09

value, we're going to say from this series, I want you to fill the blanks or fill the

180:13

missing values with or fill the anaise. fill them with numbers zero in this case. So these

180:20

two are numbers zero, or, of course, you can use any statistical method you want. In this

180:25

case, we can use the main. Remember, this is not altering the series, the original series

180:32

is still the same, we're not changing it, it's creating new series because all these

180:36

methods are immutable.

180:40

The following method is or this the following way This method works is by passing a method

180:45

which is for field or backward fields, these are the possibilities. And basically the way

180:50

it works is it's overflowing, all the values top down, at least in Fairfield, right starting

180:56

here, it's dropping this value here, dropping this volley here. And dropping now three here,

181:02

as this thing is a nun, it gets replaced. So this thing is three now, which gets throw

181:06

up here. And now this thing is three again. So that's what we have right there. And of

181:11

course, backward fields works in the other way, starts with four and moves, it moves

181:15

it here and then moves here, etc. You have to be careful when using these. Because if

181:20

you have no no values at the beginning or the end, then you're gonna end up again, with

181:26

no values because there is nothing to fifth forward, right, this is the first volley you

181:31

have India. And all we've seen also works for Donna friend. So both boggler fail for

181:36

field or both in terms of rows for feeling, right, so we have, we have these, these data

181:44

sets. So we do for field row base is going to be one to here too. And then five. So that's

181:51

going to be for field x is one, if you use for field x zero, then it's a vertical filling,

181:58

right? It's going to go here, one 130 30. So that's for the column, that is y here,

182:06

one 130 30. So it's either for filling in, in, sorry, this direction for failing, or

182:17

it's going to be in this direction, depending on the axes that you are passing. And actually,

182:23

let me we're going to put the correct forms with axes equals zero, it's going to be columns,

182:28

it's going to be visit direction with axes equals one, it's going to be row based. So

182:34

it's this direction, right? So we had a no volley here, that got fail in this way. Okay,

182:42

moving forward, we what else we have, we have here, checking for values. And we've pretty

182:52

much seen this already, you can use the is know, the sum method to get how many values

182:59

you have. And there is also any an old, which will give you very quick. These are usually

183:05

called Boolean tests, you can say ask if there are any values are valid, or all the values

183:11

are valid is just to build more complicated queries. So so far, so good. So the process

183:19

we said was at the beginning, we were fixing missing data, missing values, there is nothing

183:24

in there. We have read a data frame, where's our data frame right here? We have read our

183:29

data frame from CSV from a database, and the value is missing. No, there is a hole in it.

183:35

So we have quickly identified it with isn't a or is no, we were able to drop the ones

183:40

we didn't want to keep dropping a or we were able to fill the volume we wanted to fill

183:45

fill a name that was simple isn't a drop in a fill in a what happens when you're cleaning

183:51

data that actually has a value, so there is no nothing missing.

183:56

But those warnings are invalid. So for example, here, the sex column is a categorical column

184:03

that only accepts an on f. d on question mark, those are invalid they are, it's very simple

184:12

to see an invalid value here because it's completely out of the scope. The same thing

184:18

as we have, for example, question mark in the age column where we have we have a string

184:23

in the age column, it's very simple to identify that, how we're going to clean those. Let's

184:28

start with sex first, because it's simpler in this case, the first check we can do is

184:34

with either unique or with volley counts, I'm going to use value counts. We've seen

184:40

this method before. It's a quick summary of all the unique values you have. And in this

184:46

case, volley counts also gives you a total count for those values. How can you fix them?

184:52

Well, there is a replace method which is extremely intuitive. You can just replace in this case,

184:58

we're changing all of these two F's and The End two M's, and it can work in multiple columns.

185:07

For those volleys, that again, we said were more complicated to fix, like, in this case,

185:13

we know age, in this case, is 290. And we know because we know the domain, that 290

185:21

as an invalid age for a human. So we will need usually in those cases, we're going to

185:28

need more complicated fixing, and it will involve more programming, that's the reality,

185:34

you have to be better coding. In this case, we know that these volley is invalid, because

185:41

it's probably an extra zero. So all these values, you're pulling a CSV with ages, and

185:46

there are a total of 180 290 32 320, for example, invalid values out of 100, right in the 100

185:56

places. And that's because there were typos when they were creating the ages. So how are

186:01

you going to fix that? Well, in this case, it involves a little bit more programming,

186:05

we're dividing everything by 10.

186:10

So also, something that may be useful is dealing with duplicates. And we need to first define

186:15

what's going to be a duplicate value. So this is, this is usually a little bit more political,

186:22

if you want, you have to define what's going to be a duplicate. In this case, we have a

186:26

series that contains ambassadors, and each, their master is the index, the country of

186:32

the ambassador is going to be the value, right? This is usually the important part. The rating

186:37

here says the word conducting a party, and we want to invite one Ambassador per country,

186:43

we don't want to repeat ambassadors, ambassadors. So in this case, what's going to happen is

186:48

that these two in our humanize at least, we can click clearly and quickly see that these

186:54

two belong to the same country. And these three belong to the same country. But here

186:59

again, we have to define which ones are the duplicate, if you want, and which ones are

187:04

not duplicate. So for example, maybe we can say the first one is duplicate, or we can

187:09

say the last one is duplicate. So this is the first one not duplicate, or actually can

187:14

say this, the last one is one, and when I bite, it's not to duplicate. So we're going

187:18

to have political rules if you want for each one of those. So let's see the duplicated

187:23

method and the way it works by default. By default, duplicated method is going to return

187:32

true for duplicate for all the it's it, I'm going to invert it, it's going to not treat

187:39

it as a duplicate as the first instance that it says. So the method is actually walking

187:45

top down right now saying, Do I have friends? No, I don't have friends. I'm going to keep

187:51

it here. Because it's the first time I see friends. Do I have the UK? No, I don't have

187:56

the UK, it's just gonna keep it here. Then it sees the UK again realizes the UK is already

188:02

there, too. It's already present. So this one is going to be considered a duplicate.

188:08

Italy is here, it's fine. The first occurrence of Germany, it's fine wrightstown, Germany,

188:15

but then it says Germany two more times. And it realizes that Germany was there. So those

188:21

are now duplicates, right. So the way it works by default, we can change that and change

188:27

it to last to the last element is not considered to be duplicate, and the other two are considered

188:33

to be duplicate. And the same thing here. Kim, here is the one consider duplicate. So

188:39

it's either top down or bottom up depending the way the parameter you're passing, it's

188:47

either keep default or keep last, or you can be a little bit more harsh on say everything

188:53

duplicate it is actually to be needs to be considered duplicate. So these two are duplicates,

188:58

and these three are all duplicates, as you can see, right there. Similar to the duplicated

189:06

method, which pretty much tells you which values are duplicated, it's it helps you identify

189:14

them, you also have the drop duplicates, and in this case, what this method is going to

189:19

do is basically the same thing as before, but dropping all the values are checked for

189:27

true, right if the method is if the value is missing, it's gonna just drop it. And the

189:33

same rules apply default, last and false. For subsets in this case, we have Ace, we

189:42

have multiple, we have multiple players in the data frame. But what happens is that these

189:49

player Colby is present three times for humanize we see Kobe three times. What is going to

189:56

happen here is that the The way we're going to think about duplicates is by understanding

190:04

the correct subset that we should check. In this case, Coby plain as sn SG is duplicated

190:11

two times but COBie, playing us in SF could be considered a different player if you want,

190:16

because maybe it's a different season, or it's a different, a different position they

190:22

played. So in that case, we need to pass What's this subset that we are going to consider

190:27

duplicate, only check for the column name, or check for the column name on or not check

190:34

for the column name, which is the default is going to check the entire data frame. And

190:39

when that happens, then these two are considered to be duplicate. So these one is a duplicate

190:46

with this rule, if we put keep last, sorry, keep false, both are going to be considered

190:51

duplicate. So this second occurrence is the duplicate one. And the last one is a completely

190:56

different row, because the the value in position is different. That's the way it works here.

191:05

Moving forward with more cleaning of values, we're going to talk about string handling.

191:08

And this is a very neat feature of panelists, that special types of columns will have special

191:16

attributes. So given the column type, so df info, which is an object, which is a string,

191:23

right, in pandas, that all the strings columns are going to have these special attribute

191:29

which is str, all the daytime columns, something we're not going to cover, but you need to

191:34

know, all the daytime columns have a.dt, Math attribute, all the categorical columns don't

191:41

have a.ca t cat attributes. And those attributes, str DT cart, they have a special methods associated

191:52

the domain of that column. All the methods associated with string are of course, we're

191:57

string handling, or the methods associated with DT r for data handling. So in this case,

192:02

we're going to review all not all very good subset of the string methods we can apply.

192:10

And something interesting is that all these methods have a very good have a lot of relevance.

192:16

And they're related to the ones in pure Python. So if you have a pure Python string, there's

192:21

a split method. There is a contains method or I don't know if there was a contain actual,

192:26

it's actually, I think it's the in operator, but there is a strip, and there is a replace,

192:30

right, so most of the methods under the str attribute in pandas have,

192:39

have an analogy in the standard library of string handling with Python. So starting at

192:46

the beginning, this data we have, I'm going to delete this this data we have, what we

192:51

are going to do is split the values right by an underscore. So in this case, that's

192:58

what we have, we have split all the volleys with that underscore, and we're going to use

193:03

the special attribute is expand, expand sorry, equals true. And what it's going to do, it's

193:09

going to create a data frame out of that. So we create a data frame with 70 columns.

193:13

And this is what we have now. So we can keep applying methods. So for example, contains

193:19

or content contains, regular or contains with regular expressions rights for you to see

193:25

the power of it, we can just strip replace, and we can do even regular expressions with

193:30

replacing so we could fix something like this question mark in a string, we could fix it

193:37

with regular expressions if you know how to handle them. And finally, something that is

193:43

going to be very helpful when you're doing data cleaning, is looking at the data from

193:48

a visualization perspective. data cleaning has a ton to do with statistical understanding

193:54

of your data to when a volume is considered an outlier. For example, it might be invalid,

193:59

and you want to claim it. So but that's a lot more about statistics. And this case,

194:04

I want to show you very quickly, the mottled leave library, I've been promising for some

194:09

some time now, the mapa lib library. So far, we've accessed it directly from pandas, from

194:15

pandas, or we're doing a data frame dot plot. It's these library mapper lib is the one backing

194:22

all those methods and we're going to see how to use it directly. Now. The model live library

194:27

has two important API's we're gonna call him one is the one that I don't prefer, which

194:34

is the global API, but it's the most common one. It's the one you're gonna find around

194:39

the global API. And the second one is the object oriented API. So it's around here.

194:45

And usually there are there are ways it's just two different ways of doing the same

194:50

thing. Okay. The global API is an API that it's in part inspired in MATLAB. It's been

194:56

around for a long time on sadly Most of the answers you find in Stack Overflow tutorials

195:03

and books will be using these global API. The way the word the one I prefer the most.

195:09

And I'm gonna explain you why in a second. It's going to be the object oriented API.

195:12

But I want to show you both. So you have a reference. If you follow me in this feeling

195:16

of preferring the object oriented API, you will always have to translate global to Opie.

195:23

Why is it considered a global API? Well, we have imported matplotlib.pi plot as PLT. So

195:30

we haven't imported the whole module, the whole Python module, depending how much you

195:35

know about Python programming is going to make sense or not. We have important the whole

195:39

module. And now what we're doing is we're invoking PLT dot figure. And finally, and

195:46

then we're going to do a title. And then finally, we're planning two things. We're plotting

195:51

x, our plotting x squared and minus x squared. And why is this global because we're invoking

195:58

functions that are at module level. And there is an object, the final plot, that it's being

196:05

modified by these very generalistic and global courts, right. So by by doing these call right

196:13

here, I'm modifying the final result of the plot. Let me show you a more complicated example.

196:19

So you see the problems with the global API.

196:22

If you look at these line, if you could delete everything, let's actually delete everything.

196:31

What is this line doing which plot is affecting, you do not know, there is no object oriented

196:42

way of saying in this second plot the plot on the right or the figure on the right, or

196:48

actually the sub plot on the right, I want you to plot this thing, you're just saying

196:54

it to the entire module. And depending the order that you set it, is where it's going

197:01

to land, that particular figure where it's going to land in which plot, it's going to

197:07

lend. Again, it's a global API. So we start saying, I'm going to create a figure, trust

197:11

me from So from now on, I'm going to start drawing on it, there's going to be the title.

197:17

And hey, by the way, it's going to have one row, it's going to have two columns. And I'm

197:23

gonna start drawing in the first plot these one right here, these one right here on the

197:29

left, okay. So now I have kind of activated if you want that plot, it's active. So now

197:39

I'm going to start drawing on it. So every action that happens after this line is going

197:45

to be affecting these blocks, these blocks, right. So then I plot x and x square, I plot

197:53

this vertical line, I put a legend, I set labels, etc. And at some point, I just stop

198:00

and say, Hey, now I want to switch the plot, I want to now start plotting. Sorry, I want

198:05

to start plotting here in this second one, because I have just changed that the first

198:11

line these one. Oh, sorry, the way it works is by saying the first row, second column,

198:19

but second plot. So now I want to start plotting in here, every successive line will affect

198:25

that line. And again, you can see that understanding a code, given the order that the order in

198:32

the sequence of lines is very hard. If you have to debug a report that has a plot that

198:39

takes 100 lines, then you have to keep in your brain, what's happening top down, a different

198:48

approach is going to be the object oriented approach, in which we're creating a figure.

198:55

And we're creating axes. So in this case, we have in this case, we have right here,

199:03

one entire figure in red. And we have in here, purple, we have two axes. So these axes one,

199:13

and this is access to so we have two axes. We're going to create those using an object

199:21

oriented approach. And we're going to keep references to them. So we're going to say

199:25

later, to these blocks to these artists, sorry, I want to plot something. And that will be

199:31

very explicit, it's going to be an object oriented way. So the first thing is creating

199:38

the figure on DCE. The axis in this case, we have just one axis, that's it, but you

199:43

can have more and then you say in this axis, I want to plug this thing in this axis, I

199:48

want to pull up that thing, etc. When you have multiple axes, so I could show you. I'm

199:55

going to go back again to that in a second. But In this case in which we have four axes,

200:02

right, so we create one figure. And it has four axes, we do it with this subplots, method

200:10

saying and rows and columns. Now we say to the axes number one, I want to put this thing

200:17

to axis number two, I don't want to put that thing, right. So it's 1234. And now it's a

200:24

lot more explicit, it's not depending on the order, I could change this order, that doesn't

200:31

matter.

200:32

They're that the results are gonna be the same oxes number four has yellow, regardless

200:39

of the position that we're following. So the map will live. And now that we have clear

200:46

out the differences in both API's, maple leaf has this very simple plot function, or method,

200:52

depending on sugar enter global, that we'll plot something you specify. In this case,

200:58

we're passing all the values in x and all the values in y. And in this case, we're passing

201:02

a given line style, this can change with these type of syntax, you're saying, I'm plotting

201:08

this thing in X, I'm blowing this thing in y second parameter and why. And I want you

201:14

to use a straight line, it's a straight line, yes, with this marker, the dot and in green.

201:25

So this is if you are very familiar with it. If you're very familiar with my bullet you

201:31

can use to send links in other games, you can just say line style market marker, sorry,

201:36

color specific keyword arguments for each one of those. So do we only have line plots

201:44

in APA live? No, of course not. We have a huge variety of plots. And by the way, there

201:49

is another one here, if you want to see more events are grids, you can create these grids

201:53

and put different things in it. And again, not only land plots, one good example is a

201:58

nice scatterplot. So basically, we're plotting X and Y correlation. And there is also our

202:04

value, our color map, right. So given the volume, there is going to be a change in color.

202:11

So these kind of lets you plot three to four dimensions of your data, the volume x, the

202:18

volume, y, the size of the bubble, and the color of the bubble. So where you're pretty

202:24

much encoding four dimensions in just one figure, right. So in this case, we're just

202:31

using two different scatter plots, there's more information here, we can also block histograms,

202:36

that we've very quickly seen that with pandas with pandas is, is very simple with just plot

202:41

type histogram, current histogram hist, actually, you can look it up in our previous lessons.

202:48

So just go back into the index in the video. And the histogram is extremely simple just

202:53

takes the valleys you're plotting and how many bends you want, or some more advanced

202:59

arguments here, like the alpha level, etc. But it's simple. And similar to the histogram,

203:03

you can also create kernel density estimator diagrams, which is very similar to distance

203:11

to simulate if you want a continuous distribution. You can combine these plots if you want, in

203:17

this case, we are creating the plots were plotting a histogram. And they were plotting

203:22

the lines and they were plotting our changing limits. But that's pretty much it. And you

203:27

can also create bar plots, right? So in this case, we have PLT dot bar, or here we have

203:35

two bars are stacked, right? That's the different way to look at it. And finally, check in outliers.

203:41

You can always plot histograms or box plots, right? So box plots are also a nice feature

203:48

to have in here. So this was all with data cleaning, we're gonna keep moving forward

203:54

this tutorial, I want to mention one more thing here. And it's there are notes here

203:59

for kind of a task that you can follow with data cleaning, which where we are identifying

204:06

where indentifying missing values in given positions with is known as an A. And right

204:12

here, we're looking into more detail about some statistical properties of the data, in

204:19

case we need to clean it. Okay, so this is little bit more events. And it's it's related

204:26

to the concept of cleaning data given the domain. So the statistical analysis can tell

204:32

you that this value is an outlier. For this distribution, the value might be valid. So

204:38

for example, a human being is 90 years old. That's, that's valid, that's a valid age.

204:44

But if you're analyzing data about high school students, and a human that it's not a year

204:51

soul, it's going to be completely invalid or it's going to be an outlier in that distribution.

204:57

And you can treat it as such You valid valid and clean it out, remove it, for example.

205:02

So that's, that's deal a little bit more with the whole statistical analysis you can follow

205:07

here, it's a little bit more advanced for the scenario. So let's move forward with the

205:11

rest of the videos.

205:19

Now it's time to get into more advanced features of pandas to import external data. So we've

205:26

seen already in our real life example, the way we can import data from CSV files, and

205:32

from SQL databases, right, we had actually those two lessons, the objective of these

205:37

part of the tutorial is to show you how you can improve or get into more advanced use

205:45

cases of importing data. So we're going to start for example, with csvs, and text files.

205:50

And again, you've seen it already. But here, we're gonna give it an extra twist. So I'm

205:56

going to show you more advanced features. And for special use cases, txt files, CSV

206:01

files, is, conceptually speaking, a CSV file is a text file, it's just human readable text,

206:08

right? That it's encoding information. The idea for CSV file is that it's tabular. Right?

206:14

So it's a plain text file that contains tabular data in it, and it's separated. csv stands

206:20

for comma separated, but it can be separated can be anything, we can see more examples

206:25

later. But basically, the idea is that it's a text file that it's tabular into in a tabular

206:31

format. So though, both CSV files and text files will be read with the same method. So

206:38

to get things started, I want to show you the basic way we import will read data from,

206:43

from from external sources using Python without even starting yet, with pandas. So you don't

206:49

need to know this, it's usually it's usually productive if you want for data scientists

206:56

or data analysts to understand a little bit more how fire reading and writing works in

207:01

computers, because there are multiple, multiple concepts align, here, they evolved, operating

207:07

systems processes your language, right, it's not same thing to read a file with our or

207:12

with Python or with another language. So there are multiple concepts here. And even though

207:18

pandas in this case can make it simple, very simple to read and write data, you can get

207:25

a little bit of a more advanced use case, if you know the internals of again, both the

207:29

operating system processes on your language. So this the way we read data with a reader

207:35

file, sorry, using pure Python, we use a function open. And in this case, we're using a context

207:42

manager, just a security feature, again, related to to the advanced usage of reading and writing

207:48

files. But it creates a file pointer, right. And with a file pointer, you can then use

207:53

the very simple API x point post. But they but that pointer, which is something like

208:00

red line, red lines, read a number of bytes or characters, or you can just even trade

208:05

FP as an iterator, just do a four line in FP. But basically, we're going to do something

208:10

like this, we'll start reading data from top to bottom, just a month to, I don't know,

208:17

we hit I've given in this case, we're doing it just for a couple of lines. What else we

208:26

can, it gets very difficult when you're reading text files to process them, because it's usually

208:32

hard to parse the structure of the file. So it's not the same thing to have a funnel that

208:37

is separated by comma separated by colons separated by pipes, spaces, etc. So you're

208:43

gonna see that once you want to get a little bit more, I don't know a little bit more with

208:52

an advanced usage, right, or a little bit more fancy your calculations and and the way

208:58

you parse the data, it's gonna, it's gonna get harder. So that's why we're going to use

209:02

pandas, or I'm going to show you in a second, this is the module that is part of, of Python.

209:09

So this is the file that we're going to be reading. It's the XM review file, and I'm

209:13

going to open it. And even though it doesn't look like a CSV, it isn't either CSV. The

209:20

difference is that here the separator is the greater sign, it's not the comma, it's a greater

209:27

sign. That's going to be what marks the elimination between different fields in our CSV file.

209:33

So we're gonna use the CSV module. And

209:38

the way right here to parse the data using that module is by passing a special delegator,

209:46

right? So that's gonna be the type of work you might need to do when you're parsing the

209:52

data. It's not the same thing to have that limiter dates a greater sign. It's not the

209:55

same thing to have numbers for example, that are enclosed in quotes. All those things right

210:02

will change the way you work on all days is going to be abstracted away by the pandas

210:08

module. So to get things started, again, with pandas, at least, pandas has multiple read

210:15

underscore something methods that will work for different sources, right. So we saw already

210:20

have read sequel we've seen read CSV, there's also a read HTML to directly parse information

210:27

from a table, it's literally you can just you pass a website's going to read information

210:31

from a table, or read Jason read more advanced formats like pocket, or Stata, etc. And, again,

210:40

each file format will usually have a correspondence in pandas, it's, I've never had the chance

210:49

to rewrite my own stuff. To be honest, the same thing is going to happen for something

210:54

like Excel, which might need external modules, it's not directly provided by pandas, but

210:59

by installing those modules, you can easily incorporate Excel files in your day to day

211:04

work. So the read CSV file methods already has a ton of parameters. So this day, the

211:13

main characteristic of all these rate something methods, given the amount of possibilities

211:18

you're going to have with these files, there exist a ton of different ways to customize

211:24

the method invocation. Alright, so again, CSV files, we saw, there are multiple things

211:29

happen. csv is a passage that have a header don't have a header, different delimiters

211:35

different and closing of strings or numbers, multiple things, blank lines, etc, multiple

211:41

things are going to happen. And that's all you're able to customize all that with the

211:46

read CSV method. So this is the reference of all the

211:51

attributes you can pass to it, usually something that I do, and I do this very often, and I

211:56

use pandas a lot, and I still do something like read CSV, and I get the documentation

212:01

right here, to look into the, the parameters that I think I need to pass to my particular

212:09

use case. So keep an eye always in the docs, because it's impossible to remember all the

212:14

parameters in the CSV. So in this case, what we're gonna do is something very interesting

212:19

is we're gonna parse a CSV file, but it's not located in this computer, it's not locally

212:25

available in the computer. The CSV file is these one right here, which actually is the

212:31

source, if I get the raw version is this thing. So this is CSV file, what I could do here

212:37

is download the file, right, so just do File, Save, get the CSV file on my computer uploaded

212:43

here, right, so just copy and paste here, drag and drop it here. But actually pain this

212:48

has this nice characteristic that it will read a CSV that it's either locally as we

212:53

did with BTC market price, or you can also do it remotely, it's automatically going to

212:58

download the content of those files. And it's going to provide, it's going to save it in

213:05

memory for further usage. So there's a very neat feature. And again, this is the the CSV

213:12

file that we are using. And again, the same thing, if it's a local file, it works in the

213:17

same way. So a few features you've seen already, in this case, we can do Heather known, if

213:23

you don't want to treat the first row as a header. Or what about missing values, we can

213:29

treat some of these values like a question mark, or like an exclamation mark, or dash

213:34

etc. us not a number, not a value, right, so it's a missing value. And now any of these

213:40

values we have passed, will be transformed into another number for easier and easier

213:47

process cleaning, we can pass names, which is going to be basically the column names

213:51

for each one. And we can also specify column types, as you can see, right there. So now

213:57

the types are going to be float. And object. We've done this already in one of our lessons,

214:03

we are parsing the time and there you go. So putting all together, we get to these advanced

214:09

forms of reading csvs where we're passing column names were passing types, were asking

214:14

to read dates, were passing no values, Heather's etc. So this is a pretty common thing we are

214:22

doing. So what about XM review, if we try parsing this thing, we get this very ugly

214:31

format. In this case, they put the parameter to specify the what we used to call delimiter

214:37

in CSV is now set from separator so the separator, it's going to be the greatest sign and that

214:42

just works as it needs. So, a few more examples you can check on here the most important part

214:51

is following right, the documentation to find those particular use cases that you are having

214:58

so for example, some Like skip blank lines, or whenever there are like empty rows at the

215:04

beginning, right. So if you have empty rows at the beginning is something you can also

215:08

say skip rows. So you don't need to parse those out, it's not going to break, etc. So

215:13

that is all part of the read CSV file. And to finalize these part, at least csvs, I'm

215:21

going to tell you something that applies to pretty much every other data format. As you

215:25

have a read something method, there's going to be a to something method, it's basically

215:31

the process of writing. So you can do read CSV, or you can do to CSV. So these CSV that

215:39

we imported from the external source and the remote source, we can just do to CSV and it's

215:44

going to store it locally. Alright, and there are multiple options also to pass the CSV

215:49

delimiter, or actually the separator, if you want to include a header if you want include

215:53

an index, etc. They're pretty much the same as the other one. But the idea is that for

215:58

every read something method, there's gonna exist a to something method that it's basically

216:04

the process of writing. So let's move forward with a few more data formats. And interesting,

216:12

we're gonna get to read directly HTML pages in just a couple of minutes. And now it's

216:17

time to read data from databases. We have already done that in our real example with

216:23

Panis part of the tutorial. But I want to show you a little bit more details details

216:28

for you understand how data is being processed in case, this is a common scenario for me

216:33

importing data from databases. So the libraries you will need first thing, depending on what

216:40

database engine, you're using Postgres, MySQL, Oracle, etc, you will need to install different

216:46

libraries. But the API's, once you have installed, those libraries are going to be the same.

216:50

There's actually p Ep from Python that actually defines the interface for databases, libraries,

216:58

unpin, this can work with pretty much any any database of these SQL common database

217:03

that comply with that interface. In this example, we're going to use SQL lite because the database

217:09

right here, there's nothing, no server to connect, etc, is extremely simple to get started.

217:13

And the example we're going to use, or the danavas example we're going to use is actually

217:17

different one from our previous video is reading in the previous one, we were using circular,

217:22

in this case, we're going to be using chinuch, which is smaller both in structure and in

217:26

size. So it's going to be a little bit simpler. So to get things going here, the same thing

217:33

that we did with our previous part, that was how to read data from files, I show you how

217:39

to actually read data using Python. So forget about pandas for a second, I told you, if

217:45

we go back again, to the beginning of time, there was no pain this, this was the way we

217:49

were writing, finance, open FP, FP, the red lines, etc. So I now want to show you what

217:56

predates to pin this, what was the default way to read data before paying this, which

218:01

is with the regular again, interface from Python. So the way it works is we're gonna

218:07

import SQL lite three, we're gonna create a connection. And now with this connection,

218:11

we have these common interface that again, it's common for pretty much any other database

218:16

that you're used to. And the default behavior is we're going to create a cursor. And we're

218:22

going to execute queries using that cursor. In this case, we're going to execute a regular

218:27

Select star from employees limit, Fox will want to have five, five records out of the

218:33

table employees, once you have executed a query, it's like they're waiting, you can

218:39

do a fetch all to get all the results of that query. And here are all these results. As

218:44

you are noticing this is the result is a list of tables. So it's not extremely useful. Now,

218:51

if you combine it with pain, this you can just create a data frame out of that info.

218:56

And we're close. It's not perfect, but we're close. So let me show you now before we were

219:01

gonna close it Kurt Dickerson on the connection. Let me show you now how we work with pandas.

219:05

With pandas we have as we have a read CSV method, we also have a read, see as read SQL

219:12

method, and in this case, what this method is going to receive is the first parameter

219:15

is going to be the query that we're passing and the second parameter is going to be the

219:20

connection. That's the object the connection object to actually issue the connection by

219:25

panelists. So it gets a simple as writing the query. And now everything has been imported

219:31

into a data frame, including column names and all that if you want to get a little bit

219:35

fancier, you can either specify the index column, there's going to be use, of course

219:40

as a index, and also what types to parse for a specific column. So now we have pretty much

219:47

all the work down. So we're going from something very manual as processing things with a coarser

219:53

etc, which might also be as low to using pain this to do Actually imported data from the

220:03

database. There is actually a caveat here that I'm going to tell you is kind of a very

220:08

deep detail of the way pandas works, and is that the read SQL method is actually a shell

220:14

for two other methods, read SQL query and read SQL table. Alright, so right SQL table

220:20

on read SQL query, when you're using read SQL, it's actually kind of forward in the

220:25

work to either query or table, or an SQL query is the default behavior, what we've done so

220:31

far, so in this case, it's just going to issue a query and the connection is going to read

220:35

it for you. In contrast rate SQL table is can I read an entire table, you just pass

220:41

a name, and it's going to automatically give you all the information for it. So in this

220:48

case, all the column names, etc. So it's a lot simpler to read an entire table, the only

220:53

thing to keep in mind is that to use this method, you need to install these libraries,

220:59

SQL alchemy, and the connections generated from it. So in this case, we create an engine

221:05

on we create a connection objects, and now we can pass an actual auction object sorry

221:10

for pandas to do it. So again, it's pretty much the same, if you find yourself doing

221:15

Red Star from this table, Red Star from that table, it's a lot easier just to write SQL

221:21

table, and that's going to do it just advance. As we saw that read CSV files hard to CSV,

221:30

sorry, read CSV method had a to CSV method, the same thing happens with read SQL, there

221:37

is a read SQL and the results are to SQL, what's what it's going to let you do is get

221:43

the from the database and write it down into a database table directly. So it's going to

221:49

also receive the connection, right? So to SQL, it's gonna receive what he will name

221:54

of these data frame, what table name is going to be, and a connection object. Now something

222:01

to keep in mind is that to SQL has an important parameter, which is what happens if the table

222:07

already exists, that in the default way, it's going to fail, just going to throw an error

222:14

when you are trying to save data to a table. And this makes sense, because as data analysts

222:18

were usually reading data and processing it, we're not so much writing it. So we want to

222:23

meet make sure that it's not by mistake. But if you do actually want to write data, you

222:30

can just change this parameter if exists something like replace or append. Usually, we're writing

222:37

to intermediate intermediary table tables, again, you can choose either to replace the

222:42

whole concept of the table, be careful here, or to append, write, just write it a dn of

222:49

the current table. So that's just for to see. So this was the way to read data from databases,

222:55

of course, we're not touching on anything like SQL and all that, that it's a lot more

222:58

advanced, it's just for you, if you already know SQL, if you're already working with databases,

223:03

you can pretty much copy and paste what we're doing here. And you're gonna, you're gonna

223:07

get your data import into Python. So let's move forward to read some HTML files. And

223:14

now very quickly, I'm going to show you how to read tables or data frames directly from

223:19

HTML web pages. To be honest, this is a simple method is going to be just read HTML, but

223:26

it depends a lot on the structure of the web page. So if it's not well structured, or the

223:31

tables are not correctly created, you're going to have issues and you will have to do a ton

223:35

of data cleaning. In my experience, whenever I try to parse a table from a well structure

223:39

site like Wikipedia, or some stats site, it usually works very well. And it's a very quick

223:46

way of hacking. You know, whenever you have questions, you know, like, I don't know, I

223:50

need to know the GDP of countries. Instead of looking for a GDP data set, you can just

223:55

go to Wikipedia page, there is usually a table there, you can directly parse it and you are

224:00

done. So again, it's it's a relatively simple way to get some data for quick hacking and

224:07

exploration. The way it's going to work is we have these HTML creative. It's just for

224:13

testing purposes. To get started, usually, of course, you will try to read something

224:18

from a live website. So you're going to pass the URL to the read HTML method. And the read

224:24

HTML method will download the content of the page and parse it. Let's suppose we have the

224:29

the content already the HTML, and this is what it looks like. This is a exactly the

224:34

same HTML we have on top, I'm just displaying it here in a book. And what we're gonna do

224:39

is we're gonna invoke the method, read the HTML. And the read HTML method is going to

224:44

parse the entire HTML and look for multiple tables, not just one site will potentially

224:50

have multiple tables, even if you don't see them. The is a common way to structure things

224:55

in HTML to use tables. That's why it's going to pause multiple tables. In this case, we

225:00

stored them all in a DFS, multiple player like multiple data frames. And we see that

225:06

there is only one. So in that case, we're just going to get the first data frame. And

225:09

it has correctly parsed what we had before just working in the same way.

225:15

The same is going to happen with for example, things for headings and all that if the table

225:21

doesn't have a header, it's gonna automatically right understand.in that case. So that's pretty

225:27

much as we know it already. In this case, what you're going to see is what I told you

225:34

before about data cleaning process that these table does not have a header like the previous

225:41

one that has a T head to head attribute, in this case, a header is just another row. So

225:48

that's why read HTML is going to have issues and you have to provide a little bit extra

225:52

information. So let's see another more realistic example. And we're going to parse data directly

225:58

from a website, let me tell you here, just just for educational purposes, you always

226:02

need to understand if you have if the data is public, so you can actually parse it. Again,

226:07

for Wikipedia, at least what I do, the content is created comments, so you can get a hand

226:14

on it. There. What we want to show you here is that a very complicated table that has

226:18

multiple headers, etc. So that's why we're using this example. So we're gonna get the

226:24

URL, and we're gonna directly do NBA tables. Equals read HTML, the only table in this page

226:32

is this one, the large one. So that works. And now we're gonna get NBA is going to be

226:38

that and we see that the all the players in this case have been parsed. What about something

226:47

else, let's actually open this page right here to Wikipedia, for the Simpsons. And here,

226:54

we will probably find several tables. See, we have one right here, this one. So I'm going

227:03

to import it. We have 27 tables, again, you don't see it. You don't see them, sorry, but

227:11

they are there. And the most important one is the one we care is these one right here.

227:17

So the problem you're gonna have with this table is that each using both columns, pans

227:21

and rows pans. So in this case, this column here is pans for one to three columns. And

227:29

these row here stands for 123, at least three rows. So that column spans results in these

227:36

very ugly data frame, and you will need a little bit of extra cleaning. So that probably

227:41

you're going to find with HTML tables that usually there are things that are not well

227:45

formatted for machines that are formatted for humans. So for example, in this case,

227:51

we have this header repeated, when you parse this data, you're going to find that every

227:55

20 rows, there is going to be header row, and you will have to clinic every for in this

228:02

case, to enter rows, you will need to drop it you will do something like df the drop,

228:07

let's see, actually, if we can see it haven't tried this, but let's just do it like that

228:13

head, and you're going to find 25 Records now. So here, record 22, we find that, Heather,

228:24

so what we're going to do is you will need to do something like df the drop df dot drop,

228:31

range 22 starting in 22, up to the F the shape, zero, right, these many rows plus one plus

228:43

one and every 20 rows, I don't know this is going to work, just run it. Hope didn't it

228:51

didn't even work. It didn't compile. Oh, this is NBA actually. There you go.

229:00

So maybe it works, you can check it. But what I'm going to say is, again, there is some

229:06

cleaning to do because HTML pages are optimized for humans, not for machines. So usually,

229:15

it's going to take a little bit more time. The good news is that there is usually a service

229:21

associated that you can consult. So for example, there is a Wikipedia API that you can use

229:26

instead of a page. But again, sometimes just easier to pull the data directly from Wikipedia.

229:32

So that's it. You can also write data to CSV or of course or HTML. That's pretty much the

229:40

standard. As we've said, this is up all we had for the read data portion. And we're gonna

229:47

move forward now with a few other methods, especially what we call data wrangling. We're

229:51

going to do a little bit of grouping and keep moving forward with our tutorial. We have

229:56

decided kind of last minute to our final source of external data that it's going to be an

230:02

Excel file. It's just a common Excel files, you know it, because we imagine that you might

230:08

come from an Excel backgrounds, you can just export the data you have in your Excel files,

230:13

Excel spreadsheets, and load them into Jupyter Notebook and start working with them with

230:18

him this so you can try things out and kind of draw the pearls in between Excel and what

230:23

you do with pandas and Python. So the first thing is, an Excel file is not a text file.

230:30

So if you try getting the content of it, it's not a text file, it's not so simple to parse

230:35

it. So that's why it's gonna require external tools that they already installed in notebooks

230:40

AI, there might be a student's holding goal up, but it depends on your computer, how you're

230:44

going to install it. So just keep in mind that there might be issues when importing

230:50

data from Excel, if they if there is low compatibility between the library you're using another spreadsheet

230:55

version you're using. But without those without getting into those details, there is read

231:01

Excel method, which pretty much takes care of everything for you has different parameters,

231:07

like the finding the the sheet that you're reading from, of course, the path, etc. So

231:12

we're going to start reading these file, which is products file that has three sheets, products,

231:20

descriptions, and merchants. And it's actually something we use in an Excel file to sorry,

231:24

in our data analysis, from Excel to pandas course, to show how to merge data and all

231:29

that. And from this file, what we're gonna do is just read Excel. And what you're gonna

231:35

see is that it reads the first sheet of the Excel file, I mean, a data frame is just corresponds

231:42

to one sheet only, right? And the first one is product. So that's what we are writing.

231:50

There are different behaviors for it, you can change the way you parse, Heather's etc,

231:56

you notoriety defining and specific index, that's pretty much everything we have seen.

232:01

So far, it's selecting specific shifts is simple, just pause the sheet name, and you

232:07

can share the rate story either products, merchants, whatever is available in the current

232:14

Excel file. There is another format or a new specific class that it's a little bit more

232:22

advanced. But it's the Excel file class. So it's not, as we were doing here, right, Excel

232:27

directly is going to read thought Excel file into a data frame, but you're going to instantiate

232:33

this Excel file class, with the parameter being the file name. And now these files gonna

232:39

have just a reference of everything you have. In this case, we can do for example, sheet

232:43

names, it's going to tell you how product descriptions merchants, there's a little bit

232:47

more explanatory data analysis. So let's say you can't use Excel to actually see the contents

232:53

of the Excel file, this is going to be helpful, you're going to first parse the Excel file,

232:58

get the sheet names, and a little bit more of an understanding of it. And now we can

233:02

say from these files we have previously parsed right here or instantiated, we can parse the

233:08

product, the product sheet, and that's going to get you that that frame. And the same thing

233:13

is going to happen with all the parameters weekend pass, they are the same as read Excel.

233:20

Finally, you can see that the results are to excel file. And it works pretty much the

233:25

same way as to CSV, and decide if you pass an index or not. And also you can define if

233:31

you're going to pass a sheet name or not, are just going to be the default one. So as

233:35

you can see, getting your data into a from an Excel file into a CSV, data frames array

233:44

is extremely simple. There are more customizations to do, let's say all your file is shifted

233:50

array, either rows or columns, you can change that with Star row or column that's going

233:55

to work, too. So that's pretty much the only thing we need. If your writing process is

234:01

a little bit more complicated. Like for example, you want to write specific sheets in our multi

234:07

sheets. Excel file, you can use what we call an Excel right and it's also part of fantasy,

234:11

you instantiate the rider, and then you can start the ride process saying which shades

234:16

you want to ride with each one of those, that friend. So again, reading and writing data

234:21

from on to Excel files is relatively simple. It all depends on the libraries are installed.

234:27

It depends on on what libraries you have in your current environment, if it's windows

234:31

or if it's a Linux slash slash mark, the documentation of PD dot read Excel

234:40

might have more details for the given platform that you have. So let's see if it names per

234:54

document, if it's not here, it's gonna be in the pandas documentation, but there might

234:59

be a requirement For each one of the platforms, that pan This

235:03

is supported. So just check it out, check for your own for your own platform if you're

235:11

in Windows, Mac Linux, how to get those libraries installed.

235:21

So in case you're just getting started with Python, and you might come from another language,

235:26

the objective of this quick section is to show you Python. Ideally, in under 10 minutes,

235:32

I think it's going to take a little bit more. But there's a very, very, very quick reference

235:38

of Python, again, just the high level features of the language, how to use it, how to code

235:44

functions, how to import modules, variables, data types, collections, etc. You can just

235:50

scroll through this notebook, if you want to take less time, I will be providing an

235:54

explanation on top of all the topics, but there's a very good reference of the entire

236:00

language. So to get things started, Python is an old language period. It has card, it

236:08

has caught more attention in the past five to 10 years. But it's a very old language.

236:14

It's even older than Java. It's up here in 1990s. And it was created by this person good

236:21

by Guido van Rossum. And it's an important actor in our ecosystem he is used to be I

236:28

think he still the one deciding discussions etc, when it comes to defining features of

236:36

the language, etc. Python is a high level interpreted dynamic language. And this means

236:44

a tone actually, if we read these entire sentence, interpreted high level, general purpose, this

236:50

is basically high level programming language, it's object oriented. And it also includes

236:56

functional attributes or functional features like functions as first class objects, etc.

237:02

And it also, of course, it supports imperative programming. And it has a wide variety of

237:09

applications, you can do web development with Python, you can do scripting, it's a lot use

237:14

for system development for configuring machines in general. And of course, you can also do

237:22

data science, it has multiple applications has a couple of interesting features like

237:27

indentation, for defining blocks, etc, that make it and very good language to get started

237:35

with programming. So if Python is your first language, you should be comfortable with it.

237:40

It's a very good idea for me, it wasn't my first language. And I hope it was, it wasn't.

237:47

But I, I have taught people programming with Python as their first language. Seriously,

237:54

it's always been very good for them, because Python doesn't have weird things like my have

238:00

in JavaScript or Java. So it's a very concise language and consistent language to be honest.

238:08

So let's get started very quickly. First of all, when you're going to install Python,

238:13

your own computer or you can use notebooks AI or Google call up. But if you're installing

238:17

in your own computer, you might see that you can install either Python two, or Python three,

238:24

or actually, if you're reading tutorials online, etc, you might see Python two and Python three,

238:30

the reality is that Python two was deprecated in 2020, so the you cannot you should not

238:41

use it anymore. There are still ways to install Python two, but it was deprecated. So you

238:47

shouldn't use Python two, you should stick with Python three, which is the evolution

238:53

of the language. So ton of fixes from Python to the bay where, where things happen in the

239:00

language and used to confuse beginners. So that's no longer a problem. Python three,

239:05

again, is what you should use, you will read in multiple tutorials, etc. What they are

239:13

using Python two, you should try using Python three, and sometimes the code will break,

239:19

but the changes to fix it are not very hard. So to get things started here, I will be drawing

239:26

the problem of this and with regular syntaxes. For example, this is the way you will define

239:32

a function in for example, JavaScript. And it's also very similar to something like C

239:37

or Java based languages, the function keyword, curly braces, etc. So I will be drawing a

239:44

parlors and with these sort of languages. So to get things out of the way to defined

239:49

function in Python is in this way. And the main characteristic of this language is that

239:56

the way we're going to define blocks is by Using different indentation levels. So this

240:03

is a valid function in Python def is the key where we use the name of the function the

240:08

parameters it receives. And the way to define that the body of the function is by just indenting.

240:15

Everything one level to the right. Usually, this is just for spaces.

240:21

Another example is an if else statement. So if this thing happens, do that if else do

240:27

something else, right? This is JavaScript. In Python, again, it's defined by indentation.

240:32

If this thing happens, we indent one level to the right, do this else do something else,

240:38

if there was another if statement here, if I don't know, language, ends with something

240:47

like I don't know, three, then do something else. Print pi, three, for example. So we're

240:58

indenting everything to the right, every time we start a new block, whenever the block finishes

241:04

is just when you go back again, print this as first block, right, that's the way it's

241:10

going to work it by indenting. Our blocks, this is very good, because first, we don't

241:16

have debates of where we should place the curly braces. And also, because it makes it

241:24

a lot more readable, it's a lot easier to read these code because there is obligated

241:30

obligatory indentation to even make the code work to. So you can see that's that's just

241:38

how it works. How we're going to make comments in Python, just by using the number pad symbol,

241:44

there we go. And the way to define variables is just by specifying the name. So it Python

241:53

is a language that you don't need to declare variables, you just declare and define everything

241:59

and just one pass, you know, you find a variable, as it goes. Python is dynamically dynamically

242:08

typed. But it's also strongly typed. And these might kind of cause confusions. But basically,

242:13

you can assign variables to any value you want. And you will see that collections etc,

242:21

are heterogeneous in terms of types, etc. It is a very dynamic language. Talking about

242:27

types, I'm going to show you the most important types that we have in Python, especially we

242:32

have numbers, of course, integers, we don't have so many like, like, you might find that

242:38

other languages, like different precision cetera, we have integers, there is also the

242:42

concept of Long's that has changed with Python two. To be honest, on Python three, to be

242:48

honest, we use just integers, that's the way we work. It's a, it's a smart enough type

242:53

to save storage when needed. So that's, that's good. And it will also have floats, right,

242:58

which is the regular float type for floating point arithmetic in other languages. And of

243:06

course, it suffers if you want from strange behavior from float floating point arithmetic,

243:12

like in this case, you can prevent that by using the decimal module, which, as you can

243:20

see, doesn't suffer from from this issue. So numbers, we have integers floats, and we

243:26

also have decimals, strings are just a type str, and they are defined literal, right,

243:33

as in this in the st, you can see right here, you can just type the string as it goes. There

243:40

is a difference between there was a difference already in Python two, between Unicode and

243:46

strings, etc. In Python three, that has all been fixed. So we Python three, this is all

243:52

Unicode. And there is the concept or the difference in terms of the concept of something being

244:00

the type. The Unicode code points as it's this string, and the underlying encoding will

244:08

turn it into binary. So in Python three still have we have a few ways to differentiate between

244:15

whether it's a binary string or whether it's a text based string. For you shouldn't worry

244:20

about it, I just want you to know, if you're writing a Python tutorial, for example, you

244:24

might find a difference between Unicode strings and regular strings, which is, is no longer

244:30

something that we should be worrying about. If you have a string that it's too long and

244:35

it expands multiple lines, you can always write it using three quotes can be double

244:40

quotes or single or single quotes.

244:44

So just to create multi line strings is extremely simple. Boolean there are two Boolean type

244:52

do Boolean objects are unique, right? It's kind of a single tone which is the true or

244:58

false objects. For example, They are of type Bo. There is also the concept of No, in Python,

245:04

which is none, we don't have no, we have none, but it serves pretty much the same purpose.

245:10

In Python, everything is an object. So even this strange, strange objects, like none will

245:17

have an associated class, if you want, everything in Python is an object. So all these types

245:25

of you have seen. So for example, we have this string, which is H of a string. The type

245:32

is str, you can use the int, str float bool types, right, but it's the result of the type

245:44

also as function. So in order to cast in this case, a string into in order to cast a string

245:53

into an integer, you will use it you will do it using the end function, which is the

245:59

same thing that you get with these, for example, so this is the same as this, as you can see,

246:09

what we have to show. So functions again, death is the key word we use, we don't use

246:15

function, we use death I, you can use define, as a mnemonic, the name of the function parameters

246:21

are optional, and finally have the return keyword, you should always include a return

246:27

you usually 99% of the time, the function should return something. Because that's going

246:34

to be the result assigned once we invoke the function just this is pretty regular. If your

246:39

function doesn't return anything explicitly, if that means if you haven't written down

246:45

a return statement anywhere in your function, the function will still return something so

246:51

that the fact that you haven't included a return statement explicitly doesn't mean that

247:00

the function is not returning anything implicitly, actually, it is returning something, it's

247:04

returning none, right. So by default, if you don't include a return, Python will do this.

247:12

Just for you to know a function always returns something as specified parameters and passing

247:20

parameters is pretty standard. Python has some advanced features with parameters like

247:25

for example, variable length arguments, we can pass as many arguments, we want to make

247:30

it very dynamic keyword arguments, named arguments, etc. So all their ethic operators, you know,

247:37

already, the shin modulus, in this case, were doing a power its operation, all this is pretty

247:49

standard. And the same thing happens with all our Boolean operators greater than greater

247:54

or equals then etc, there are type checking. So this is when we have the strongly typed

248:02

feature, even though Python is dynamically typed. It is the types are enforced. In this

248:08

case, you cannot compare a two with this doesn't make any sense. And Python is going to complain

248:14

about that. So this is an example of an error in Python. The exception type error was raised

248:22

on the same thing with bolens and not on or operators. As we saw before control flow is

248:29

defined by the indentation so every new block is defined with an indentation level. Python

248:37

includes if else and also l F, which is very convenient. And this is an example If this

248:43

happens, Elif, Elif, etc. Python does not have a switch statement. For example, loops,

248:52

how are you going to loop through something in Python loops on lists,

248:56

or

248:57

collections in general, are very interconnected. Because in reality, when you're looping the

249:03

Python, you're not doing a regular in Python, we don't have something like in, in Java,

249:09

you're gonna have something like int i equals zero.

249:13

What else I

249:14

it's been decades. And I this is I haven't coding in Java. So I, I don't know, minus

249:22

10, less than 10 less than 10. And here we do I put last There you go. So we don't, we

249:34

don't have these in Python. We have a way to mimic it. But we in Python, we always eat

249:42

iterate over a collection. So what we're going to do is we're going to create a range elements,

249:47

and we're going to iterate over it. So the way it works is very close to one other language

249:54

is going to be a for each. Alright, so in this case, we have all these elements and

249:59

we're going to do for name in names, that's it. And at any moment, the name is going to

250:04

be associated with an element in the list. while loops are part of the language, they

250:11

are usually discouraged in favor of for loops. If something can be coded with a for loop,

250:16

it should be coded with a for loop and not a while loop. Because as you might know, already,

250:20

these my trigger or these might result in an infinite loop if you're not checking the

250:26

conditions correctly. So the collections we have in Python, are the fundamental ones,

250:33

the primitive ones, the most important ones are first the list Python is we do a heavy

250:38

usage of lists. And it's just a heterogeneous data structure. So you can put anything in

250:44

it. And actually, all these collections are heterogeneous, you can mix volumes as you

250:49

want. And in this case, we have three elements that we have added one string, one integer,

250:54

one string, and one Boolean. And let me say something here. Even though pythons, Python

250:59

supports mixed types in the collections, it doesn't mean that you should do it. To be

251:06

honest, we should, you should usually avoid mixing types in collections, because that

251:10

means we don't, we don't know what we're putting in it, right. So it's, we should be consistent.

251:16

So it's possible, revisit your code, if you have too many different types in it. I'm checking

251:23

the length length function accessing elements is by by zero indexed, and we use square brackets.

251:30

So in this case, give me the first element given the second element. And also we can

251:34

index starting from the from behind from the end. So in this case, minus one, minus two,

251:40

minus three. So in this case, minus one minus two, again, give you different elements, you

251:46

can check the operations associated with all these elements. Very quickly, a list is L

251:55

dot append, we're going to append the new element. So the list now has that element

252:00

at the end. And we can check if that element is part of the list in this case is true in

252:09

this case is false. topples are similar to lists, they are also sequences, but the main

252:17

difference is that they are immutable, there is no way to add new elements to a tupple,

252:21

or remove elements from a tupple once it has been created. So in this case, we have created

252:27

a list with three elements. Now tupple, sorry, with three elements, we can access it, we

252:32

can check if something is in it in the same way that we did with a list. But in this case,

252:37

with a tupple. Again, you cannot modify it tupple never changes, you can't add elements

252:43

to it. Another important data structure is a dictionary. In Python, a dictionary is a

252:50

key value, right and mapping, it's similar to an object in JavaScript or hash table in

252:57

in, in Java, it's a key value mapping type. And in this case, we are going to associate

253:03

values to names. So you can see this, the way I like to explain it is if you create

253:09

a topo list, right? So let's say we're going to create a list, out of all these elements,

253:14

give me one second, we're going to create a list. There we go, we're gonna copy these

253:22

elements. And we're gonna associate that to our list. There you go. So these are a list,

253:32

we could very well store the information about our customers in a list, right? That works.

253:38

I mean, I can get it done. The problem is that whenever I need to access information

253:43

about this list, we're going to say, for example, I don't know I want to give me the email for

253:49

this

253:50

customer, I have to remember the position that the email is located so in this case

253:56

is going to be position number one, if these information grows, instead of having 1234

254:03

values or four pieces of information for our user, we have 100. Right, then it's gonna

254:10

be very hard to access those individual volleys. So that's why we create dictionaries, dictionaries

254:16

are collections of values. The important part is on the right, the important part is the

254:22

value. But they are instead of just indexed by the precision, we give them arbitrary names,

254:31

we tell them very explicit names. This is the name, this is the email. This is the age.

254:37

And this is if they are subscribed or not. So once we create these dictionary, we can

254:43

access those values by the name, give me the email of these user or is the age present

254:50

of the user is the last name present of the user in the user in the user dictionary. So

254:55

again, it's a way to store information associating later In order to make it simpler for us later,

255:02

let me delete this. And I move four sets sets are very common data structure, he is when

255:10

you're learning about a collections and, and and yeah, the instructions in general, it's

255:16

not so common in too many languages. I mean, it's not very popular in Python, we use it

255:22

often because it has a very interesting feature, first of all, and it's something that I forgot

255:28

to tell you about dictionaries, both sets and dictionaries, are what we call unordered.

255:35

data structures, you never know, the order of the elements. In Python, with recent versions,

255:44

there have been changes, which make Python dictionaries ordered. But for now, I'm going

255:53

to say you shouldn't rely on it, you should think your dictionaries as they are completely

255:57

unordered data structures, and the same thing for sets, sets are, it's a bag that contains

256:05

elements, you know, it's a big bag, you keep throwing elements inside of the set, there

256:11

is no orphan in it. And what's gonna happen with it, you're gonna odd elements, for example,

256:18

to the set, or you're going to remove elements to the set. And there is one important thing

256:24

that makes this set so useful, and it's the membership operation, I'm gonna

256:30

write it down here, membership, ship operation, there you go. So you can access these notebooks

256:38

later.

256:39

So in the membership operation, the the, the process of checking if something now, nine

256:46

in s, the process s of checking this is extremely fast, it will be called oh one. And this is

256:59

because as you might have seen here, when I created this set, I included a couple of

257:05

repeated elements, 333, write 11179, the resulting set doesn't have those repeated elements,

257:17

these are two features of the set, the set will only contain unique values. And by the

257:24

way, it's implemented behind the scenes will make dot these unique values are extremely

257:31

simple to check whenever you pass these membership operation is extremely simple, or sorry, is

257:38

extremely performant. It's very fast, different from for example, a list. So keep it in mind

257:45

sets are very, very useful when you're checking for members. So again, as I told you before,

257:53

we're going to iterate over collections with the for loop. So in this case is if we have

257:57

a list, it's going to be for element in list. There you go. If you have a user dictionary,

258:02

use a dictionary, sorry, in this case user, we're going to the default iteration is by

258:10

key, we're going to get for name email age subscribed, and we have to extract the value

258:18

out of the of the dictionary, we could also do for value in user dot values. Oh, there

258:32

you go. Or you can iterate over both key and value with items. Key. And value. There you

258:45

go. So each iteration in in in Python is very readable to put it in a way. And again, remember,

258:55

we're always using the for loop that assumes that you're iterating over a collection, we

259:00

don't have the for Ei equals zero equals zero, I equals zero, i less than 10. i plus plus

259:16

we don't have that right in Python, we can simulate it with for i in range. Five, for

259:24

example. Print. We've got simulated with the range function, which generates pretty much

259:32

those elements. Something that you might have heard about Python is that it has a huge library

259:40

of built in modules, right that you can just import and just gonna work. There are so many

259:47

things already coded in Python, that it makes it very simple for you to create something

259:54

on top. Do you want an a library for I don't know security cryptography Math, numeric processing

260:02

NumPy, right? machine learning web development, creating games through is pi game, do you

260:09

want to create a graphical user interface, whatever you want to do, there is usually

260:14

a library that has already been coded and will make your job easier. On top of that,

260:21

the bill team is down there library, right, which is already included with Python, it's

260:25

not third party. In this case, it's already created by the Python core team. It's a huge

260:35

library, so many modules. And the way it works is by importing this module, so this is the

260:42

way we work with packages and modules, there are differences between modules and packages,

260:46

third party ability, and this is a little bit more advanced. But again, this gives that

260:51

random number generator, it's already built in. And you can check the docs

260:56

right here.

260:58

exceptions, whenever you do something that doesn't work. So in this case, we say, if

261:04

the age is greater than 21, but age is a string, it's an it's not an integer, this is going

261:10

to fail. We can catch exceptions before they happen, that's going to be with a try and

261:19

accept lock. Right. In that case, if this fails, if anything here fails, these blocks

261:25

going to be kicked in. And you can catch the exception without the program fail failing.

261:33

And you can be more explicit about the error aspect. So again, this is just an introduction.

261:40

It might be useful if you're coming from another language, especially to keep this notebook

261:45

as a reference. We're going to be using Python a lot, of course, and it's a great language

261:50

if you want to do scripting, work development, of course processing with data, data analysis,

261:55

etc, visualizations, machine learning, Python is just great. So I hope this tiny tiny reuse

262:03

lesson helps you port your knowledge from other languages into Python. And that's it.

UNLOCK MORE

Sign up free to access premium features

INTERACTIVE VIEWER

Watch the video with synced subtitles, adjustable overlay, and full playback control.

SIGN UP FREE TO UNLOCK

AI SUMMARY

Get an instant AI-generated summary of the video content, key points, and takeaways.

SIGN UP FREE TO UNLOCK

TRANSLATE

Translate the transcript to 100+ languages with one click. Download in any format.

SIGN UP FREE TO UNLOCK

MIND MAP

Visualize the transcript as an interactive mind map. Understand structure at a glance.

SIGN UP FREE TO UNLOCK

CHAT WITH TRANSCRIPT

Ask questions about the video content. Get answers powered by AI directly from the transcript.

SIGN UP FREE TO UNLOCK

GET MORE FROM YOUR TRANSCRIPTS

Sign up for free and unlock interactive viewer, AI summaries, translations, mind maps, and more. No credit card required.