TRANSCRIPTEnglish

SWIGGY Data Pipeline | End To End Data Engineering Project In Snowflake

2h 1m 33s17,746 words2,584 segmentsEnglish

FULL TRANSCRIPT

0:00

swigi is a well-known food delivery and

0:02

quick Commerce platform in India I'm

0:05

sure many of us have used swigi to order

0:08

food or groceries and every time we do

0:11

those flashy offers pop up on our mobile

0:14

screens tempting us to place an order

0:16

before the deals vanish I'm sure you

0:19

know that these deals are not created

0:21

randomly platforms like swii use a data

0:24

driven approach to design these offers

0:26

and customize them based on their user

0:29

preferences the goal is to retain the

0:31

users and encourage them to spend more

0:33

and more on the platform so it helps

0:36

increase their overall Revenue as well

0:38

as Revenue per customer per

0:41

order according to data published in

0:43

newspapers swii handles 1.4 million foot

0:47

orders daily across India that's nearly

0:50

1,000 order per minute last year this

0:53

number was

0:54

700k and now it is doubled they operate

0:58

in 500 cities partner with 140k

1:01

restaurants and have 200k active

1:04

delivery Executives incredible isn't it

1:07

this clearly shows how fast they are

1:09

growing and to sustain this growth they

1:12

must have a powerful data and analytics

1:15

platform to support their data driven

1:17

decision making millions of users

1:19

actively use platform like swii and all

1:22

their transactions like placing the

1:24

order Etc and mobile activities are

1:26

stored in oltp systems and some SAS

1:29

platforms forms respectively these oltp

1:32

systems could be one or more rdms under

1:35

the hood to support transactional

1:37

business processes the transactional

1:39

data that are captured by olp System

1:42

then move to their data warehouse or

1:44

data Lake platform for reporting

1:47

Advanced analytics and machine learning

1:49

workloads to handle such high volume and

1:52

highspeed data many of us might think of

1:55

massive parallel processing so-called

1:57

MPP or distributed computing like Apache

2:00

spark or a data braks or cloud services

2:03

like AWS data services or Azure data

2:06

services however in this end to end data

2:08

engineering Hands-On tutorial we will

2:10

use snowflake data platform where you

2:13

don't need to worry about any

2:14

infrastructure setup and we will learn

2:17

how easily you can build a powerful

2:20

self-service data warehouse system in

2:22

just few

2:23

hours you might be wondering why is swiy

2:26

for this end to end data engineering

2:27

project well to truly Lear learn end2end

2:30

data engineering you need to solve a

2:32

real life business problem and for that

2:35

you need some understanding of how a

2:37

business application works and how it

2:40

store Data before diving into the data

2:42

engineering and analytics activities we

2:45

have all used the swiggy or similar food

2:47

delivery apps so as we go through this

2:50

project it will be much easier to

2:52

connect the dots and understand why we

2:54

are doing what we are doing any food

2:57

aggregator app will have restaurant

2:59

entity that will have catalog or a menu

3:02

to serve the food it will be clubbed

3:05

with some discount and promotions there

3:08

will be orders delivery locations and

3:10

payment activities and then delivery

3:13

executive will deliver the food and

3:15

finally a rating for the service it is

3:18

also easy to understand that what kind

3:20

of reporting requirement such food

3:22

aggregator will have to monitor the

3:25

health of business month or month and

3:28

hence their Founders and leaders would

3:30

like to know total revenue average

3:32

revenue per order average revenue per

3:35

item top performing restaurants Revenue

3:37

Trend over time Revenue by customer

3:40

segment Revenue by restaurant and

3:42

location delivery performance Geographic

3:45

Revenue insight and many more such

3:48

metrics and that's why we will use this

3:51

food aggregator example to learn end to

3:53

end data engineering

3:55

work so who is this course designed for

3:58

if you are a data engineer or a data

4:00

analyst or a SQL Developer or a python

4:03

developer working on a data engineering

4:05

project or a cloud developer and wants

4:07

to learn how to build an end to end data

4:10

warehouse platform using snowflake then

4:13

this course is for you if your next

4:15

question is do I need to buy any tool to

4:18

complete this Hands-On guide the answer

4:20

is no we will use the free trial edition

4:23

of a snowlake which provides $400 in

4:26

free credits and last for 30 days all

4:29

the exercise in this tutorial can be

4:31

completed at no cost your only

4:33

investment is your

4:35

time it is important to let you know

4:38

what is not covered in this tutorial

4:40

when you say an end to end data project

4:42

is it starts with extracting data from

4:45

Source systems and pushing it to the

4:48

data platform automatically however in

4:50

this project we will use synthetically

4:53

generated data in CSV format of

4:55

different entities we will load this

4:58

data using snowflakes file upload

5:00

feature via snite web UI which allows up

5:04

to 250 MB of data to be loaded at one

5:07

time if we have to visualize the overall

5:10

architecture diagram of this end to end

5:12

data project this is how it looks like

5:15

we will load the data using snowflakes

5:17

file load feature into the stage

5:19

location after that the data will go

5:22

through a series of Transformations and

5:24

finally we will use the streamlet app to

5:26

visualize the kpi for the food

5:28

aggregator business

5:32

a food aggregator platform like swiy has

5:34

many business processes and it is

5:37

impossible to cover them all in one

5:39

tutorial so in this guide we will focus

5:42

on their food ordering subprocess the

5:44

various tables and data set that need to

5:46

be ingested and how they are interl in

5:49

their oltp

5:50

system welcome to my channel data

5:53

engineering simplified the only YouTube

5:55

channel offering practical snowflake

5:57

video tutorials for both beginners and

5:59

expert data

6:01

professionals if you are curious where

6:03

you can find source code and the data

6:05

files that are used in this tutorial

6:08

please find the link below in the

6:09

description section and before we

6:11

proceed a quick note all the Hands-On

6:14

exercise are done using snowflakes free

6:16

trial Enterprise Edition hosted on

6:19

AWS I suggest watch this video in 4K

6:23

resolution and if you are a fast learner

6:26

considering speeding up the video to

6:28

1.25x or 1.5x you can also change the

6:32

language and listen the tutorial in your

6:34

preferred language like German or

6:37

Hindi if you are looking to learn in a

6:39

more structured way with complete source

6:42

code and data files be sure to check out

6:44

my courses on udmi and some of them are

6:47

also available through udmi business all

6:49

of my courses have a rating of 4.7 or

6:53

higher for queries suggestion or

6:56

feedback drop a direct message to my

6:58

Instagram account the link is given

7:00

below and yes if you would like to stay

7:02

updated on snowflake new and popular

7:05

features end to data engineering

7:07

projects architectural concept live demo

7:09

videos don't forget to subscribe to this

7:12

channel data engineering simplified so

7:14

let's start this

7:16

tutorial so before any food order is

7:18

placed food aggregator needs to Target a

7:21

location where they will serve once

7:23

location is finalized they onboard

7:26

restaurants each restaurant then will

7:28

have their menu uploaded into the system

7:30

then the customer will login using their

7:32

email or a phone number create a

7:35

customer profile including their

7:37

delivery location or so-called customer

7:40

address then this customer will search

7:43

and place an order once order is placed

7:46

the Delivery Agent will be picked for

7:48

delivery and then delivery will happen

7:51

so if you look into this picture at very

7:54

high level minimum 9 to 10 such entities

7:56

or tables are involved to support the

7:59

order business subprocess in real life

8:02

it would be much more than this table

8:05

but for this tutorial we are considering

8:07

9 to 10

8:08

entities now first let's go through the

8:11

ER model that represent this entities

8:14

with relationship I'm using DB Community

8:17

Edition to draw this ER model and I am

8:20

assuming my oltp system is hosted on a

8:22

post rdms database and if you use D and

8:26

if there are relationship between the

8:28

tables this ER diagram will come

8:30

automatically you don't have to draw

8:31

that and using reverse engineering you

8:33

would be able to get this diagram

8:35

through DB so this is my customer entity

8:40

and each customer has one or many

8:43

addresses so there's relationship

8:45

between these two entities here on the

8:47

other side I have this

8:49

restaurant and each restaurant will have

8:53

menu item again this is one to many

8:56

relationship and on the other side we

8:58

have this Delivery Agent

8:59

and this is my delivery table whenever

9:03

an order is placed by the customer the

9:06

order table gets one record and order

9:10

item may get one or more record based on

9:13

the number of order item per order and

9:16

this is again one to many relationship

9:18

because each customer may login multiple

9:21

time during this order process and this

9:24

is my location entity where the location

9:27

is tracked so when I click on this thing

9:30

it says customer has a relationship and

9:32

this is one to many when I click on this

9:34

customer to login and here my customer

9:36

to order likewise if I say this is menu

9:40

this is order item is also linked with

9:43

menu and here this is order to order

9:45

item this is my order to delivery and

9:49

this is my delivery to Delivery

9:52

Agent so whenever you are working on any

9:55

data engineering project for any domain

9:57

it is very very important to understand

9:59

The Source system ER diagram because

10:02

based on that relationship only you

10:04

would be able to design your data

10:06

warehouse table under different layers

10:08

let's see how the data looks like

10:09

quickly so if I say customer this is how

10:12

the synthetic data looks like here I

10:14

have a customer ID customer name mobile

10:17

email address a login using Gmail or

10:21

Facebook account and this is a gender

10:23

data birth

10:25

anniversary and different food

10:27

preferences when this record is created

10:30

and when this record is modified okay

10:32

now going to the customer address

10:36

book so here I have customer ID as a

10:40

foreign key rest of the information is

10:43

available here like flat number house

10:45

number floor building landmark and this

10:48

if you pay attention this is a number

10:51

field this is a text field and so this

10:54

tool also helps you to verify the data

10:57

type and you can decide by looking at

10:59

the data type what kind of data type you

11:01

will have it when you host this data

11:02

into your data platform now let's

11:05

quickly see the delivery

11:07

agent now this is the Delivery Agent

11:09

where Delivery Agent ID is a primary key

11:12

then we have a name phone number vehicle

11:14

type location ID status rating and

11:18

created dat if I go to the

11:20

delivery so here when I talk about

11:23

delivery so delivery is a transaction

11:25

table where every delivery will be

11:27

mapped to order ID and and it will be

11:29

delivered by a Delivery Agent so this is

11:32

a child table of order and Delivery

11:34

Agent and for every delivery primary key

11:37

you will have this foreign key

11:39

associated with that and here we have a

11:41

delivery status estimated time and

11:44

address ID and so on now let's talk

11:46

about restaurant so here for our n2n

11:50

data processing we have generated quite

11:52

a lot of synthetic data and uh this data

11:55

only belongs to location New Delhi and

11:58

we are not considering all the cities so

12:00

for a Simplicity I am only considering

12:02

one location and these are the number of

12:04

restaurant we have it under this

12:06

location so this is a master table where

12:08

you have the name of the restaurant type

12:11

of Restaurant pricing for two okay and

12:14

restaurant phone number operating hours

12:16

location ID if you see the location ID

12:18

is always one because this belongs to

12:21

one location and then we have active

12:23

flag open status and all those things

12:25

again this is a completely synthetic

12:27

data if I go to the location let's see

12:29

how the location looks

12:32

like so if you see location I have a

12:35

Delhi Delhi state ZIP code an active

12:37

flag and all those other information and

12:39

this is a standard created date and

12:41

modified date which is all technical

12:43

field so this is my another master table

12:47

okay and uh this will be converted into

12:49

a location Dimension when we get into

12:51

our data warehouse platform now let's go

12:54

to menu so menu is a child master table

13:00

and uh here this is a restaurant ID okay

13:04

so for restaurant one I have two menus

13:06

restaurant 2 I have quite a lot of menu

13:08

again restaurant 5 has a menu so the

13:11

data is not very consistent this is for

13:13

only the demonstration purpose but this

13:14

will be converted into another dimension

13:17

will we move this into snowflake data

13:19

warehouse platform now let's go to the

13:21

order so when you look into the order ID

13:23

which is a primary key of this table

13:25

each order belongs to a single customer

13:29

and I can have many orders for one

13:31

single customer then they will place an

13:33

order from a particular restaurant so

13:35

this is a restaurant ID foreign key this

13:37

is a date when the order is placed the

13:39

total amount what is the status

13:41

currently uh payment method so these are

13:44

the important

13:45

information this is not a master table

13:48

this is a transaction table now let's

13:50

see how the order item looks like so one

13:53

single order can have multiple order

13:55

item this is a menu ID because you can

13:57

have one or many order item as a part of

13:59

your order this is a quantity total

14:02

price and subtotal and when it is

14:04

created and when it is modified in case

14:06

if it

14:07

is this is also a transactional table

14:10

when we are going to create a fact and

14:12

dimension The fact granularity will be

14:14

at the item level and not at the order

14:16

level so every Enterprise grade data

14:20

project start with Source analysis

14:22

sometime you may have access to the ER

14:24

diagram of the source system and

14:26

sometime you may not have it however you

14:28

need to understand relationship between

14:30

two tables and what is the data type

14:32

what is the volume of the data how the

14:34

data is being generated whether it is a

14:36

master table or a transaction table

14:38

because this information is very very

14:40

important to map it when you try to

14:42

create dimensional modeling and we will

14:44

see that at the later part of this video

14:46

so in this section we have already seen

14:47

how the ER diagram looks like for our

14:50

food aggregator system and how the data

14:52

looks like so the first part of tutorial

14:54

will deal with subset of data and we are

14:57

going to see them in a CSV form so let

15:00

me jump into my vs code editor and show

15:02

you how those sample data looks like

15:05

this is my vs code editor where I have

15:08

already kept those generated data under

15:11

different folder in this n2n data

15:13

project we will demonstrate how we can

15:16

have the initial load and then how we

15:18

can process the Delta load which is a

15:20

very common pattern in many badge

15:22

processing system this is my location

15:24

data Resturant Data customer customer

15:28

address men

15:29

order order item Delivery Agent and

15:32

delivery CSV for each of this entity I

15:36

have handful of record so first we will

15:38

build end to end solution then we'll go

15:41

and try to load thousands of record and

15:43

this is a common practice in any data

15:45

engineering project where business team

15:48

will give you a sample data set and

15:50

based on the sample data set you will

15:52

build your entire NN data Pipeline and

15:55

then once it is deployed you would be

15:57

able to validate the L data set so if

16:00

you're looking to this location I have

16:01

this location five row CSV and then I

16:04

have Delta Data which simulate insert as

16:07

well as insert followed by update

16:10

likewise if I go to the restaurant I

16:12

have only five restaurant and then in

16:15

the Delta file I have again handful of

16:19

restaurant then this is my customer

16:21

where I have this customer detail and

16:24

then you can see customer insert and

16:26

customer insert

16:27

update then I have customer

16:31

address then we'll go to the menu this

16:33

is how the menu looks like and if you

16:35

see here I have some invalid record and

16:38

in any data engineering project you will

16:40

see lot of inconsistent data set which

16:42

you need to handle as a part of your

16:44

transformation

16:46

layer now I have handful of order here

16:51

then I have the Delta order coming in

16:53

two different

16:55

files likewise we have order item and

16:58

then you have order item Delta

17:02

files this is the delivery agent and

17:06

this is the Delivery Agent Delta files

17:09

finally we have delivery transaction

17:11

table and these are the two Delta

17:14

files this is how my CSV file looks like

17:17

which is exported from my post database

17:20

as a handful of record

17:33

now let's quickly understand the overall

17:35

end to end data architecture for this

17:38

food aggregator data platform we already

17:41

have lot of CSV file for entity and

17:44

those CSC file would be loaded through

17:47

the file loader in

17:49

Snowflake those CSC files would be

17:52

finally placed into the stage location

17:54

and from the stage location we will run

17:57

the copy command and that copy Comm

17:58

command will load the data into a table

18:02

now from the table the data will move

18:05

further under the clean layer and this

18:07

clean layer is nothing but a schema

18:09

where we will do a basic cleansing

18:11

processing and transformation and from

18:14

this clean layer the data will finally

18:16

move to the consumption layer where we

18:19

are going to perform the fact and

18:21

dimension table we will go step by step

18:24

starting from our location entity in our

18:27

order item fact table once the data is

18:29

available in fact and dimension table we

18:31

will create some View and display the

18:33

data through our streamate app for our

18:35

end user 90 to 95% of data engineering

18:38

project will follow this architecture

18:41

where you may have a slightly different

18:42

naming convention some organizations say

18:45

raw or stage some organization will give

18:48

the name called ining curation and here

18:51

in the consumption they may give a name

18:52

called modered layer if you follow the

18:55

datab brakes architecture they call this

18:57

architecture as a medallion archit Ure

18:59

where it is start with bronze then

19:02

silver and then gold as you go from left

19:04

to right the data quality increases and

19:06

become more and more

19:08

consumable at the end of this process

19:11

what we are trying to achieve by

19:13

following this overall architecture our

19:15

primary objective is to consume Source

19:18

data into a data platform and after it

19:21

goes through set of transformation it

19:23

will form the data which follows the

19:26

star schema dimensional modeling where

19:29

you have all the master data like a

19:31

customer customer address restaurant

19:34

menu restaurant location Delivery Agent

19:37

will be converted into a dimension table

19:39

and order item along with order table

19:41

will be converted into a order item fact

19:44

so your clean schema would primarily

19:46

represent the oldp style data set and

19:50

your consumption schema will have the

19:52

data available in form of dimensional

19:54

modeling so let's go to our SN site web

19:56

UI and start working on a code

20:28

so the this is my isos site webui and I

20:31

have already created different worksheet

20:35

and each of this worksheet will help us

20:38

to create different object under

20:39

different schema so this is my first

20:42

script called create database and schema

20:45

and as a part of this script we are

20:47

going to create a database called

20:48

sandbox and that is what we generally do

20:51

and inside the sandbox I'm going to

20:53

create stage schema clean schema

20:57

consumption schema and and one schema

20:59

called common where all the common

21:01

object would recite so let me shink the

21:04

left

21:05

panel and first let me switch the

21:09

role now my role is Switched and I'm

21:12

also going to choose a different virtual

21:15

Warehouse so I will do ad hoc virtual

21:19

warehouse now let's create a database

21:21

called

21:23

sandbox and I will first use this

21:26

database so if you notice my context got

21:29

changed to sandbox dopu and public is a

21:32

default schema here and I'm going to

21:35

create the four different schema called

21:37

stage schema clean schema consumption

21:39

schema and common

21:41

schema if you look into our architecture

21:44

here so I have created stage schema lean

21:46

schema consumption schema and one of the

21:49

common schema that will hold all the

21:51

policy object another common object

21:53

which can be accessed by any of the

21:56

schema in our data flow now I we run

21:59

this use schema command now here my

22:02

schema is stage schema and as we are

22:05

going to process lot of CSV file I'm

22:08

going to create a file format called CSV

22:11

file

22:12

format so snowflake supports six type of

22:15

file format and CSV is a very common one

22:18

it also support par Jon XML orc AO and

22:23

if you are not familiar what is the use

22:25

of this file format and what is the use

22:26

of stage location I would suggest you to

22:29

go and watch this particular

22:31

video now I'm going to create a stage

22:34

called CSV stage where we are going to

22:36

load all the CSV file what we have

22:38

reviewed at the beginning of this

22:41

tutorial and I'm also enabling the

22:43

directory service so so this is

22:46

created now this is another tab where I

22:49

would like to see the sandbox database

22:51

and all other schema let's refresh it so

22:54

this is my sandbox database this is my

22:57

stages schema

22:59

followed by Clean schema followed by

23:01

consumtion schema and this is my common

23:03

schema if I go inside this stage I have

23:07

this one CSV stage visible and another

23:10

is file format and this file format will

23:12

be helpful to pass our CSV file when we

23:15

are loading them through the copy

23:16

command I'm going to create couple of

23:19

policy tag and we will see how this

23:21

policy tag will appear when we attach

23:23

them with the table if you're not

23:24

familiar with the tag and masking

23:26

policies I would suggest you to go and

23:28

watch these videos which covers tags and

23:31

policy in

23:32

detail so my policy created under common

23:36

schema and let me select all of them and

23:38

run

23:40

it if I refresh let's see whether those

23:43

policy tags are visible here or not

23:44

policy TXS are in general not visible as

23:46

on today on snos site web UI so as a

23:49

first step we manage to create a Sandbox

23:52

couple of SCH schema stage environment

23:55

internal stage and the F format so stage

23:58

schema is created Kean schema is created

24:00

consumtion schema is created and this

24:02

location is also created now we are

24:05

going to use the file upload feature and

24:08

try to load the data into this stage

24:10

location so that's our next step you can

24:13

find all the scripts available in my

24:16

block refer the link below the step one

24:17

script is already available so you can

24:19

go and copy paste them and before going

24:21

to the next step please make sure that

24:23

these objects are available in your

24:25

snowflake data platform so now we got

24:28

our

24:28

stage location which is called cvore

24:31

stage this is an internal snake stage

24:34

where we can load the CSV data so since

24:37

we have a two type of data files as we

24:40

have seen the First Data file we call it

24:42

initial load and then we have Delta

24:45

loads so we are going to create two

24:48

different partition one is called

24:49

initial and second is called Delta so

24:52

let me click on this plus files and here

24:56

this is my sandbox do STG stage

24:59

underscore schema is already selected

25:01

and this is the location where the data

25:04

would be loaded and here I will give

25:06

initial the first I will start with

25:08

location and I will quickly drag drop

25:10

the location file so this is a location

25:13

file and this location file will be

25:15

residing under folder or a context

25:18

called initial followed by

25:22

location so this is my initial likewise

25:25

I will create a data and I will drag

25:27

drop all all the file from my location

25:30

Delta folder I have multiple files and I

25:32

will demonstrate you how the copy

25:34

command works when you try to load

25:36

invalid file or duplicate file here it

25:38

will also go under the context

25:42

location snowflake is very fast so even

25:45

if you have 100 MB or 200 MB data file

25:47

it will be loaded very very fast

25:48

compared to any other platform now let

25:51

me shrink this so this is my stage

25:53

location this is my context initial this

25:56

is my another context location and then

25:58

I have this location hyphen 5 row. CSV

26:02

likewise I have this Delta inside Delta

26:04

I have a location and inside location I

26:07

have a multiple files so my location

26:09

data is loaded now I will follow the

26:11

same process to load other data file and

26:15

I will do it in a fast forward

26:19

mode so for this tutorial we are only

26:22

and only focusing on or I would say one

26:24

location at this stage to keep the

26:26

tutorial simple and easy to understand

26:29

now here you can see my another context

26:32

under this initial is created so as I

26:35

add more and more file I will always

26:37

create a context and under the context

26:39

or a partition I'm going to place the

26:40

file let me go how does it look like so

26:42

here this is restor in Delhi plus

26:49

NCR here I have two

26:51

files so now you can see this is my

26:54

stage location this is the Delta as a

26:56

partition or a context under that I have

26:59

two different context or partition

27:00

restaurant and location and inside

27:03

restaurant I have these CSV files looks

27:06

good now I will go ahead and do for

27:08

other data file

27:49

so you can see total 10 initial load CSV

27:52

files are loaded and this is again

27:54

following a different partition so it is

27:56

easy to understand one thing to remember

27:57

when we are going to run a copy command

27:59

we need to make sure that we do not

28:01

follow the pattern and we will follow

28:02

the exact path because sometime if I

28:05

follow delivery it will consider both

28:07

this location and will try to copy all

28:09

the CSC file available under this these

28:12

are the small small things you will only

28:13

come to know when you work in a project

28:16

like this by doing yourself otherwise

28:18

and if you are going to appear for a

28:20

snowflake interviews and if somebody ask

28:22

what kind of challenges have you faced

28:24

you can clearly tell them these are the

28:25

challenges we have seen while loading

28:27

the data or while processing the data or

28:29

while transforming the data now I'm

28:31

going to load the Delta files quickly in

28:33

a fast forward mode so this is a Delta

28:36

and I have only Resturant and location I

28:39

will quickly process other eight Delta

28:55

files so for every entity I have two

28:59

Delta files primarily one Delta file

29:01

represent your insert operation and

29:04

another Delta file will represent insert

29:06

or update operation and when we are

29:08

trying to follow the std2 slowly

29:10

changing Dimension type two architecture

29:13

in our consumption layer these sample

29:15

files will help you

29:28

another important thing to remember in

29:31

any SQL based tool order is a keyword

29:35

and that's what instead of giving order

29:37

I have given orders and again when you

29:39

are following these naming conventions

29:41

and partition make sure that you cover

29:44

different kind of scenarios and

29:46

accordingly you create a pattern for

29:48

your project and then follow that

29:50

pattern for all the stage location or a

29:53

table or a schema

30:21

so here we managed to load total 23

30:24

files and these 23 files are part of

30:27

this total 10 partitions or so-called

30:30

context so when we are going to use the

30:32

copy command we are going to follow

30:35

stage name followed by Delta or initial

30:38

followed by name of the entity and then

30:40

we are going to specify the name of the

30:42

CSV file and then we'll run the copy

30:44

command the snowflake has recently added

30:47

this feature called lineage and if you

30:50

want to know more about lineage how does

30:51

it work you can refer this particular

30:53

video which primarily focus on this data

30:56

lineage and how it works in Snowflake

30:58

and as we progress in this tutorial we

31:00

are going to use the lineage feature to

31:03

understand the overall flow and validate

31:05

that flow with our architecture so in

31:07

this step all the CSC files are loaded

31:09

inside initial and Delta location so I

31:12

would say this part is done and in the

31:15

next step we are going to create tables

31:17

for each entity and run a copy command

31:20

and we'll see how we can utilize the

31:22

stream object to capture the changes

31:24

while loading the data let's quickly

31:26

explore one more thing how you can check

31:29

the files which are available in your

31:31

internal stage or external stage for

31:33

that snowflake gives a command called

31:36

list followed by at theate sign and then

31:39

you can give the name of your stage

31:41

location either fully qualified or just

31:44

the name of the stage so we know that we

31:46

created our stage location inside this

31:49

stage undor and the name of the stage

31:51

location is cvore STG so if I run this

31:55

command let's see what result does it

31:57

bring

32:03

I can see total 33 files are available

32:06

all the files are loaded on 1st December

32:09

2024 Sunday and here I have this Tila

32:13

partition followed by customer address

32:15

and so on all the file size are

32:17

available including the md5 if I have to

32:20

narrow down and check how many files are

32:22

available under my initial partition I

32:25

can add the partition name and let's

32:28

rerun the

32:36

query now I can see only 10 record and

32:40

all the partitions with initials are

32:42

available here remember whenever you are

32:44

referring to this stage location this is

32:47

case sensitive in Snowflake so if it is

32:49

lower case you have to follow the lower

32:50

case if it is uppercase you have to

32:52

follow the uppercase so during your

32:54

interview if somebody ask you is stage

32:57

location files are case sensitive or not

32:59

you have to say yes you can also use the

33:01

remove command to delete the file or

33:04

from the web UI you can also click on

33:06

Triple Dot and delete the file you can

33:08

also download from

33:09

there so you can remove it or you can go

33:13

to a particular file and with triple dot

33:16

you can download or you can load into a

33:18

table or you can just copy the path okay

33:21

now it is also possible that you can

33:23

query this stage location files without

33:26

loading into a table and for that you

33:28

can use this dollar notation so you have

33:32

to write select the name of the column

33:34

as we have shown here from stage

33:37

location so this is the schema name this

33:39

is stage location name this is the

33:41

partition initial and this is the

33:43

partition location and I can attach a

33:45

file format and this file format

33:48

understand what is the CSV file and if I

33:51

go

33:52

up here we have created this file format

33:56

where type is CSV compression is auto so

33:58

whether it is CSV or gzip it will

34:01

or compressed file it will it will

34:04

understand theimer is comma record

34:07

delimer is new line es skip header yes

34:10

it will es skip the header and if the

34:12

description fields are enclosed in

34:14

double code then I have to keep this

34:16

code so this is what the definition of

34:19

my CSC file so if I run this let's see

34:22

what

34:24

happens so it is clearly representing

34:27

the location data which we have seen so

34:30

this is my location data first line is

34:32

skipped because it is a header and rest

34:34

of the five records are available looks

34:37

good if I make any mistake for example

34:39

if I give a capital N and try to rerun

34:42

this it will not bring any result so you

34:45

have to be very very careful with your

34:47

stage location partition name followed

34:49

by your name of the file I can also go

34:55

and so let me pick this file name and I

34:59

can give very specific file name if I

35:01

have a conflict with my partition name

35:03

and I can rerun

35:06

this and I will still get the same

35:08

result always remember since we are

35:11

running a query and getting the data

35:12

from the stage location this needs a

35:14

virtual warehouse and if you do not have

35:16

a virtual Warehouse you may not be able

35:18

to run the query so we finished creating

35:20

database object different schema object

35:24

stage location file format we loaded the

35:26

file into the stage location and now we

35:29

are going to process each of this entity

35:31

one by one until we reach to the

35:33

consumption layer and in the next step

35:35

we start with the location which is our

35:37

which is our topmost Master information

35:40

in this use case this is my next

35:43

worksheet called location entity let me

35:46

shrink this and first I'm going to

35:49

change my

35:51

context so this is my sandbox stage

35:54

schema this is the role and this is the

35:57

virtual Warehouse all looks

35:59

good we have already tried this query

36:02

where we are trying to fetch the data

36:04

from the location CSV file let me rerun

36:08

this so I have total five record looks

36:11

good so whenever we are loading the data

36:14

from stage location it is important that

36:18

we add some additional column which is

36:20

called audit column and this audit

36:22

column captures your file name from

36:25

where this data is coming from what time

36:27

the file was modified what is the md5

36:30

value and the time stamp when the file

36:32

is loaded so let me run this query and

36:34

show you the

36:36

result so let me shrink

36:40

this so this is the stage file name so

36:43

in the next time if you're going to load

36:45

the same file by running this query you

36:47

would be able to understand where this

36:49

record is coming from and this is a load

36:51

time and this is the

36:53

md5 this is the time when we queried the

36:55

data okay so let's go through this

36:57

architecture diagram once again where we

36:59

will look from the lens of location

37:01

entity so first we loaded this location

37:04

CSV file into the stage location so this

37:07

is available here into this stage

37:09

location which is my CSP STG from here

37:13

we will run a copy command and that copy

37:16

command would follow the Sate structure

37:18

what we have just seen and from there we

37:21

will move the data to this clean schema

37:25

and from the clean schema we are going

37:27

to create location uncore team table

37:29

before we run the copy command we have

37:31

to create this location table along with

37:33

the location table we are also going to

37:35

create a stream object and stream object

37:38

will help us to track the changes and if

37:41

you do not know how a stream object

37:42

Works in Snowflake I have couple of

37:44

tutorials which had already explained

37:46

the stream object in detail and from

37:49

this stream object we are going to

37:50

create a next location table under the

37:53

clean schema and when we go to the clean

37:55

schema the primary objective of the

37:57

schema is to have the appropriate data

37:59

type and from there we will create this

38:02

location Dimension table with hashkey so

38:05

let's go back to our worksheet so this

38:07

is the ddl for my location table where

38:10

location ID city state ZIP code

38:14

activation flag created date and

38:16

modified date these are the domain

38:18

entities and all of them are having a

38:21

data type text and this is the

38:24

additional technical column which

38:25

generally I prefer to give with

38:26

underscore

38:27

so you are very clear that these are not

38:30

coming from the source system good and

38:32

this is the

38:33

commment now let me create this

38:39

table so my table is created

38:41

successfully Let me refresh this is my

38:43

another Tab and let's see my location

38:45

table is created and the description is

38:48

visible and all the columns are

38:51

visible so I have total 11 column and I

38:54

can always go back and check all the

38:56

column names are matching as per this

38:58

CSV file definition looks good I have

39:01

total seven column coming from the

39:03

location entity and four column which

39:05

are technical columns before I do

39:08

anything I am going to create a stream

39:10

object and the primary purpose of the

39:12

stream object is to capture any change

39:15

which happens to this table and once we

39:17

run the copy command you would

39:19

understand why this stream object is

39:20

created and thisam object is created on

39:23

the location table which is this

39:24

particular table and the proper p app

39:27

and only is true and this is the comment

39:30

so let me create this stream

39:33

object so this also got created when I

39:36

switch my

39:37

tab I can see here I have a stream

39:41

object and it is created on the location

39:45

this is the copy command where I am

39:47

loading this specific columns and this

39:50

is my dollar notation to fetch the data

39:53

from the CSV file and this is the

39:54

location of the CSV file I have just

39:56

given the part name and this is the file

39:59

format and in case of error it will

40:01

abort it so this is how the copy

40:03

construct looks like in Snowflake and as

40:06

soon as I run this let's see what

40:08

happens so this is the file which is

40:11

being processed the status is Success it

40:13

means all the records are loaded and

40:15

this is the total number of record which

40:17

Got Loaded I can go back to my table

40:19

page and Let me refresh this and when I

40:21

click on this location here I can see

40:23

total five record got loaded and let's

40:26

see how how the data looks

40:28

like so I got all the data and here is

40:32

my audit column shown so data is loaded

40:36

on this particular time and if I click

40:38

on a copy

40:40

history I would be able to see that this

40:42

particular file is copied and if I click

40:45

on the

40:46

lineage this is a table lineage and it

40:49

clearly says that this table is

40:51

populated from this stage location looks

40:56

good I we can run the select a statement

40:58

here

41:00

and I can see the data and I can also

41:04

run the select a statement on the stream

41:07

object let's see what result does it

41:12

bring so it has brought exactly the same

41:15

data set I will show you the

41:18

difference so up to here I have total 11

41:22

column and these three columns are added

41:25

by the stream object and and it clearly

41:28

says that whatever operation you will

41:29

perform on the main object a stream

41:31

object will capture that and I can run a

41:34

query on the stream object and push the

41:36

data from the stream object to the next

41:37

layer so if you look into this diagram

41:39

this is my stream object and when I'm

41:41

going to populate data to my clean

41:44

schema under location table the data

41:46

will not come from this location table

41:48

rather it will come from this stream to

41:51

the location table and I will show you

41:53

what do I mean by that coming back to my

41:55

location worksheet here first of all I

41:58

will change the schema name to clean

42:01

schema now you can see my schema name

42:04

got changed and here I'm going to create

42:07

exactly same table the only difference

42:09

you would notice that here the data type

42:12

is not text and the data type is

42:14

matching the expected data type from The

42:16

Source system so location ID is number

42:19

city state code they are all following

42:22

the source system data type including

42:24

the limitation however here I have added

42:27

some additional information which says

42:29

whether it is Union territory or not

42:31

whether it is a capital city flag or not

42:33

zip code and additional things right and

42:36

we will see how we can enrich the data

42:37

into the clean layer okay and here every

42:40

clean table will have a surrogate key

42:42

which is nothing but Auto incremented

42:44

and primary key for this table location

42:46

ID should be the unique in this table so

42:48

this is how this table looks like so let

42:51

me create this

42:55

table so this look location table or a

42:58

restaurant _ location table is created

43:00

Let me refresh

43:02

this and I come here and I can see I

43:05

have a Resturant location

43:08

table this does not have any data at

43:10

this stage and I am ALS going to create

43:12

a stream object called restaurant unor

43:15

locor STM and this stream object is

43:18

created on restaurant unor location it

43:20

means that any kind of change whether it

43:22

insert update delete this stream object

43:24

would be able to capture that so let me

43:26

do

43:29

that quickly refresh to check yes my

43:33

stream object is available I'm going to

43:35

run a merge operation and that merge

43:37

operation will load the data into the

43:39

restaurant uncore location where it is

43:42

going to run a query and this query is

43:45

running on your stream object let me

43:48

show you so here if you see this is a

43:50

stream object STG stream object and in

43:53

this query it is trying to enrich the

43:56

data so so here I have added couple of

43:58

case statement if it is Delhi then make

44:00

it New Delhi and for other state it will

44:03

retain the same text and I'm adding a

44:05

new column called state code likewise I

44:07

am also adding another field called is

44:11

Union territory so if any state matches

44:13

with this name it will be flagged as a

44:15

yes otherwise it is no this is another

44:17

case statement the state is this and

44:19

city is that then we have another called

44:21

Capital City flag likewise if it is a

44:24

tier one tier 2 tier three cities so

44:27

these different additional columns are

44:29

added to this and when the location ID

44:32

matches with the location ID of the

44:34

source if any changes are being recorded

44:37

into the source table then it will be

44:39

updated if the new location is coming

44:41

then there will be insert

44:56

operation e

45:46

I assume you are fully aware how the

45:48

merge operation

45:53

works you can pause and go to my blog

45:56

post and go through the query to

45:58

understand how this merge statement is

45:59

running on the stream object let me run

46:01

this merge

46:05

statement so it says five record

46:08

inserted and there is no update because

46:10

our data does not have any update

46:12

operation so looks good now let me

46:14

refresh this and sometime you may have

46:16

to refresh the entire page because this

46:19

only refresh the object entries but it

46:21

does not refresh this page but if I go

46:23

to the restaurant location and this is

46:26

still showing zeros so let me refresh

46:27

the entire browser

46:31

page now I can see total five record if

46:34

I click on a data

46:36

preview so this is looking nice and if I

46:39

see the

46:41

column I have total 16 column I added

46:44

some additional column and let me see

46:46

how does it look like so if you look

46:47

into this the state code is added is

46:49

Union territory true or false is a

46:52

capital flag true or false and if you

46:53

see that this is false because in Union

46:56

territory we do not have a capital but

46:58

we can change the Case Logic and make

47:00

sure that this stands true and then I

47:03

have a different tier tier one tier 2

47:05

tier 3 City and this is zip code

47:07

likewise active flag yes right now only

47:11

New Delhi is a active City where this

47:13

food aggregator is serving all other

47:15

cities are having a new flag okay and

47:18

then these are additional information

47:20

looks good so coming back to my diagram

47:23

from the stream object we ran the merge

47:26

operation to populate the location and

47:28

this location table te schema also

47:31

having a stream object so any changes

47:32

done on this location table can be

47:35

tracked by this stream object so as a

47:37

next step we are going to create a

47:39

location Dimension table and location

47:41

Dimension table will also take the data

47:44

from the stream object as stream object

47:46

only shows the changes and not the

47:48

entire data set and that's what we need

47:50

to have it whenever we are going to

47:52

process the Delta so let me go back to

47:54

my

47:55

worksheet so before we run another merge

47:58

operation where we are going to load the

48:00

data from clean location table to

48:03

location Dimension table this is how the

48:05

ddl looks like if you see I have a

48:08

primary key which is called Resturant

48:09

undor locor HK and then all other data

48:13

set is available here and to tag the

48:16

slowly changing Dimension I am having

48:18

effective start date effective end date

48:20

and current flag right so the location

48:23

information may change over a period of

48:25

time and to make sure that we track

48:27

those changes we have to follow this

48:30

scd2 and these are the column which we

48:32

added into our Dimension table so let me

48:35

create this table

48:38

first so my table is created and now I

48:41

have another merge operation if you look

48:43

into this merge operation this merge

48:45

operation is trying to insert the record

48:47

into my restaurant location dim table

48:50

and it is squaring the restaurant undor

48:52

locor STM and let's see what data set do

48:55

I have

48:58

so I have total five records available

49:00

here and if I look into this metadata

49:03

column here metadata action says all the

49:06

records in the Stream objects are insert

49:09

operation none of them are update and

49:12

this is theow ID looks good now if you

49:15

look into this so basically this merge

49:17

operation will join this target table

49:19

and stream object table and it will try

49:22

to fetch if the location ID matches with

49:24

the location ID between these two table

49:26

and when the metadata action is delete

49:28

and metadata update is true it means

49:31

that record has changed so it will make

49:34

if Q ended equals to current time stamp

49:37

and the flag equals to false in the

49:39

sense this row is no more valid row if

49:41

it is not match and it is a insert

49:44

operation and is update is equals to

49:45

true it will insert a new record if it

49:48

is completely new location then it will

49:50

primarily insert the data into the

49:53

dimension table and first let me run

49:55

this query and when we load the next set

49:57

of data you would come to know how this

49:59

merge statement is actually performing

50:01

let me select

50:04

this so whenever we run a Marge

50:07

operation it always shows the number of

50:09

record inserted and the number of record

50:11

updated right now we are doing a initial

50:14

load we are not observing any change on

50:16

an existing record that's why our update

50:18

values are zero looks good now we will

50:21

see how our Dimension table looks like

50:26

so let me refresh this and under the

50:28

consumtion schema I should see one

50:31

location table so I have a restaurant _

50:33

location unor dim looks good and when I

50:36

click on that and let me see the data

50:39

preview I have created a hash key and

50:41

this is my restaurant location and if

50:44

you see these are my location ID which

50:46

I'm keeping it as is and this is a city

50:48

Delhi Mumbai Agra Amad azmir and this is

50:52

a state and this is how the data looks

50:55

like this is active flag which is coming

50:58

from the source system and all these

51:00

records are effective start date of this

51:03

record is this effective end dat is null

51:06

and this is a true it means that if any

51:08

changes happens to any one of the record

51:10

this particular row item will set the

51:12

effective end date and the flag will go

51:14

to false it means that do not refer that

51:17

record now if I click on the lineage you

51:19

would be able to understand how the data

51:20

is

51:22

Flowing I can click on this plus

51:25

sign and I you can click on this plus

51:27

sign again and this plus sign I do not

51:30

have any upper Stream So this location

51:32

Dimension is Created from this

51:35

restaurant location under the clean

51:37

schema and this is populated from this

51:40

which is again in the stage schema and

51:42

this data is coming from this stage

51:44

location which is this so if you see

51:46

this is a stage location this is my

51:48

table this is another table and this is

51:51

another table if you click on

51:53

this this is the description of the

51:55

table and and the number of total column

51:58

is 13 number of rows are five and if you

52:00

would like to know where is this state

52:02

code coming from I can click on it I do

52:05

not have any Downstream lineage I can

52:07

click on Upstream lineage and it will

52:10

show that this is coming from this

52:12

Resturant location under clean schema

52:15

looks good however if I select City and

52:19

click on this view

52:21

lineage click on this up stream and if

52:24

you see distance is 1 and two pi because

52:27

city is not the derived data and this is

52:29

coming from the location table under

52:32

stage

52:33

schema so if you look into this lineage

52:36

this matches with our lineage okay and

52:40

this is what we managed to achieve and

52:42

this is the same process we are going to

52:44

repeat for all other entities here we

52:46

did not follow any primary key and we

52:48

just kept the source data ases by adding

52:51

some four extra column when we came to

52:54

this location we added some

52:57

additional field along with data type

52:59

changes and when we came to this

53:01

location Dimension which is following

53:04

the sd2 so this table will always match

53:07

the source table however this table will

53:10

track all the changes and the number of

53:12

record between this two table will

53:14

always vary now let's go and push some

53:16

Delta record let's see what happens so

53:19

now I'm going to process this Delta day

53:21

one which has only two record and then I

53:23

will process Delta day two where I am

53:26

making some changes and this changes

53:28

will say this cities becomes active it

53:31

is going to add 8 9 10 as insert and

53:35

these are the changes and let's see how

53:37

these two files works when we use the

53:40

stre stream object as a Tracker so if

53:43

you look here I am running a list

53:45

command on my Delta location and let's

53:47

see how many files do I get

53:51

it so I have so many files I'm going to

53:54

pick my day Zero let me copy this entire

53:58

location and then I can change this

54:01

location up to CSC file this is going to

54:03

the data folder location folder and this

54:06

is the name of the CSV and now let me

54:08

run this copy command before I run this

54:10

copy command which will load the data

54:13

into this location table let me quickly

54:15

query the location table and location

54:17

undorm table under this stage

54:25

schema I have have total five record and

54:27

all the data is coming from this initial

54:29

location CSC

54:36

files My Stream object is completely

54:39

empty it does not have any record now

54:42

let me run this copy command and before

54:45

running this copy command if I check my

54:47

copy history it has only one

54:50

file now let's run

54:53

this so it says two records loaded

54:56

which matches with this two record looks

54:59

good now if I go to this location

55:03

table I have two record number location

55:06

ID 6 and location ID 7 and again when

55:09

I'm talking about stage location it is

55:11

having exactly the same representation

55:13

as a source so looks

55:15

good and if you look into this stage

55:18

file name it is telling that these two

55:20

records are loaded from this CSV file at

55:23

this particular time which is

55:26

and if you look into this this time is

55:28

having a different time stamp so when

55:30

you're trying to debug where this data

55:32

set is coming from you can really run a

55:34

query on this technical column and

55:37

understand where from which file this

55:39

rows are

55:41

populated now pause and think what would

55:44

be a location _ streem object bring when

55:48

we run a select

55:50

statement since we inserted only two

55:52

record into the main location table or

55:55

we call it base table stream object

55:58

captured only the insert operation

56:00

because that's what this stream object

56:01

is designed for and that's what we need

56:04

to push the data from our stage layer to

56:06

the clean

56:08

layer right now I have only five records

56:11

into my clean restaurant

56:13

location and I'm going to run this merge

56:17

statement again and this merge statement

56:20

will pick only the two record from the

56:22

stream object and insert into the

56:24

restaurant uncore location able inside

56:27

the cleaning schema so let me run

56:30

it since it is a merge operation it says

56:33

to record inserted and zero updated

56:35

looks good now let me reload the entire

56:41

page now this also got all the surrogate

56:45

key starting from 1 to5 and then 67 and

56:48

I can see my data set is populated

56:51

clearly and all of them are looking good

56:54

okay now I need to make make sure from

56:57

this stream the data is consumed and

56:59

goes to the location dim table let's try

57:02

that out so if I run the select

57:06

statement on my restaurant undor locor

57:09

STM I should expect to record looks good

57:13

and this two record will move to the

57:16

dimension table through this merge

57:17

statement so let me run

57:21

it now I can see two record inserted and

57:24

zero rows updated now

57:28

let me go to the consumption schema and

57:30

see how this location Dimension looks

57:36

like so here I can see also location ID

57:39

6 and 7 and this is my hash

57:42

key here my sd2 columns are looking good

57:47

we managed to process this newly coming

57:49

record through this day 01 CSC file now

57:51

if you look into the day 02 CSC file it

57:54

has three new record location ID 8910

57:57

and it has four old record and what this

58:00

old record says it is primarily making

58:02

the active flag equals to True which was

58:05

false in the first case so this four

58:08

location ID will also be true again this

58:10

is a very hypothetical case just trying

58:12

to show you how this Delta processing

58:14

through the stream object really

58:16

captured your slowly changing dimension

58:18

in your consumption layer so let me run

58:20

this list

58:22

command and this is my Delta 2 so I

58:25

copied the

58:27

name and I just change this and if I run

58:31

into my location streen I do not have

58:35

any data because I already consumed that

58:36

and I am going to run this copy

58:39

command so seven rows loaded which is

58:43

here I have total seven rows because I

58:46

can skip the first header row now if I

58:50

run this statement again you can pause

58:52

and think for all insert since it is a

58:54

insert happen only stream I will have

58:57

only seven

59:00

record so here all the flags are yes so

59:04

when this stream object is going to be

59:06

joined with restore location table it

59:10

will compare which location ID already

59:12

exist into my K schema if it exist it

59:16

will pick the changes and the changes

59:18

will be applied and if it is a newly

59:20

inserted record which is 8 9 and 10 it

59:23

will get inserted into this one so here

59:26

if you see the modify date is also

59:27

coming from the source system but here

59:29

there is no modification it is coming

59:31

null looks good so let me run this merge

59:34

operation and this will populate those

59:37

seven records by following this approach

59:40

if the location ID matches with the

59:41

location ID but any one of them got

59:43

changed then it will run the update

59:45

statement if not then it will run the

59:47

insert statement so I am running this

59:51

merge

59:53

statement three record inserted

59:56

and four record got updated looks good

60:00

so this is the three insert and this is

60:02

your four

60:04

update now for Dimension table to

60:07

capture those update and insert let's

60:09

see how the record looks like in the

60:11

restaurant location stream

60:18

object so I'm going to pick one example

60:21

Mumbai so there are two record because

60:23

this record got updated and

60:28

and so I just applied a filter and if I

60:32

look into this

60:34

record and if I look into this stream

60:37

meded operation first the stream object

60:40

record the delete operation is updated

60:42

flag is true and for the second new

60:44

record it is called insert update

60:47

operation is also true and if you see

60:48

the change initially the active flag was

60:51

no which got a delete flag and and when

60:55

the new record which has captured the

60:57

active flag equals to yes this is coming

60:59

as a insert as a part of update so this

61:01

is how the stream object captures a

61:03

detail so now I'm going to run this

61:05

merge statement and you can go to my

61:07

blog page and you can understand how

61:09

this merge statement is written let me

61:11

run

61:13

this so total seven in record inserted

61:18

and four rows updated so when I say 4 +

61:21

3 = to 7 and to understand more let's

61:24

see how the data set looks

61:31

like so let me show the entire table

61:38

first

61:43

uh so if I look into this example Mumbai

61:47

and initially Mumbai was having a no

61:50

active flag now it is having a yes

61:51

active flag and if you look into the

61:54

result here it current flag is true and

61:57

here the current flag is false and

61:59

whenever a change happens effective end

62:01

dat column gets a value okay so this is

62:05

how we managed to get the sd2

62:07

implemented in our location Dimension

62:09

table our entire data is Flowing from

62:12

Source location to the stage location

62:14

through the stream it is getting into

62:15

the clean layer and again through the

62:17

stream it is getting into the

62:18

consumption layer we are also able to

62:20

get the sd2 type in our location

62:22

Dimension so now we manage to cover

62:25

location dimension let's go to the next

62:28

object restaurant and we are going to

62:30

follow the same process and I'm not

62:32

going to explain everything in detail

62:34

before we go to the next entity let's

62:35

try to simulate if we run a copy command

62:39

with invalid data what happens to entire

62:41

flow this is another Delta file called

62:44

Delta hyund day 03 which has invalid

62:46

delimiter so if you look into the

62:48

delimiter it is a pipe sign instead of

62:50

AMA let's say your data platform gets a

62:52

data like this what would happen let's

62:55

try that out it me go back to my

62:58

worksheet so this is the file and let me

63:02

copy

63:04

it I copied

63:06

name and here I pasted the name and

63:11

again I would like to check if I have

63:13

any data in stream I do not have any

63:16

data in my stream now I'm running this

63:19

copy

63:22

command so it says three rows passed and

63:25

three row was loaded because if there is

63:27

no comma in that case all the data set

63:30

will be loaded into the First

63:31

Column now if I look into select a star

63:36

from this

63:37

location so if you see into this three

63:41

option I got incorrect data now my

63:44

stream will also have the incorrect

63:47

data because all the columns are coming

63:49

filled into the First Column and if you

63:52

look into this this is coming from

63:53

invalid mimer

63:57

now I'm going to run my merge statement

63:59

and let's see what

64:03

happens so here it failed because it

64:06

says that numeric value cannot be

64:09

recognized okay and

64:11

since this statement got

64:14

failed my stream object will still have

64:17

the data and it will not be cleaned up

64:19

now I would allow you to go and think

64:21

how would you solve this problem this is

64:23

one of the common problem when we see

64:25

that if bad record gets into your system

64:28

how would you handle that but somehow I

64:30

need to clean this stream data otherwise

64:33

it will continue to fail and it would

64:35

not allow me to proceed to the next

64:41

level so I created a location uncore

64:45

temp table through the stream and by

64:47

that I consumed all the data from my

64:50

stream My Stream is clean now however my

64:53

location table is still having the odd

64:56

so I would go and delete that odd record

64:59

from the location

65:08

table so wherever I have this pipe sign

65:11

those record would be

65:13

deleted so it deleted three records and

65:16

since my stream object created on the

65:20

top of location is happened only this

65:23

would not be recorded Let Me reconfirm

65:25

this so I do not

65:27

have now this is my fourth day invalid

65:30

file let's see what happens if I try to

65:33

push

65:42

this so I ran the copy command with that

65:45

file and it says four rows loaded and if

65:49

I go and this select a

65:53

statement I can see this junk data Got

65:56

Loaded into my system and likewise My

65:59

Stream will also have this data set

66:01

available which is not

66:06

right and again when I try to run this

66:09

merge State since it will not get

66:10

converted into a number this merge

66:12

statement will fail let me run

66:16

this so numeric value is not recognize

66:20

when you're trying to run this entire

66:22

process through orchestration tool you

66:24

need to make sure that you run the copy

66:26

command only when you have a clean data

66:28

otherwise your merge statement will

66:30

continue to fail so let me clean this

66:33

once

66:35

again and I will use the word

66:41

junk so whenever you are building an N

66:44

to data pipeline there could be many

66:46

such scenarios which you need to

66:48

simulate in advance and accordingly you

66:50

have to write your merge statement or

66:52

you need to have a SQL statement which

66:54

would check whether it is possible or

66:56

not one of the possible solution you can

66:58

try to cast it and if the cast is not

67:00

possible instead of on error you can say

67:02

continue that way you can solve this

67:04

problem so only those record which

67:06

qualifies this check will get into the

67:09

stage table otherwise it will

67:17

not we finished the location entity to

67:20

be loaded into our snowflake environment

67:23

and converted this master data into to

67:25

our Dimension table now let's work on

67:29

the resturant entity which is another

67:31

Master data set here we have quite a lot

67:34

of column we are going to load this data

67:37

and follow the similar pattern what we

67:39

have seen and done for location

67:42

entity our objective at the end of this

67:45

section is to convert the restaurant

67:47

Master data into a restaurant Dimension

67:49

and here we will have a hash key

67:52

followed by restaurant ID and rest of

67:54

the information about the

67:56

restaurant for restaurant I have this

67:59

initial load which has total five

68:02

records starting from restaurant ID 1

68:04

to5 then if I go to the day one Delta

68:07

record it adds three additional

68:09

restaurant IDs and if you look into the

68:11

third day I add two additional

68:14

restaurant but there are some changes on

68:17

restaurant ID 3 4 and 5 just to simulate

68:20

if our sd2 is working appropriately or

68:23

not so let's go to our worksheet

68:31

so my context is already set this is my

68:34

restaurant entity worksheet and this is

68:37

my system admin ad hoc virtual Warehouse

68:40

so I'm going to create a restaurant his

68:43

stage

68:45

table and all the column are text value

68:49

if you look into this restaurant phone

68:51

number I am adding a tag where the

68:54

restaurant phone number is a sensitive

68:56

information or this column is classified

68:59

as a sensitive and you will see how this

69:02

appears the table description

69:05

page so this table got created

69:08

successfully and now I'm going to create

69:10

a stream object on the top of this

69:13

restaurant

69:14

table so I can see a restaurant table

69:18

here and all the information is visible

69:20

looks good likewise I have the

69:23

restaurant stream also available this is

69:25

looking good now we are going to run a

69:27

copy command and this copy command will

69:30

have all the domain entities from the

69:33

restaurant and these four additional

69:35

columns which we are going to add it and

69:37

let's run this copy

69:40

command so my copy got executed total

69:43

five rows

69:44

loaded when I come to this

69:47

table I can

69:50

see so these are my five rows

69:56

all the data Got Loaded successfully no

69:59

issues if I look into the copy

70:02

history it also looks green

70:05

great now you can also run this

70:09

information SCH schema table function to

70:11

check whether your copy executed or not

70:14

rather than changing the screen and let

70:15

me run

70:17

that so this is it will show that where

70:21

this copy command so what is the file

70:22

name and what is the stage location and

70:24

the load time and all other

70:27

detail now next I'm going to create

70:31

exactly the same structure of Resturant

70:33

table the only difference is that this

70:34

time I'm using surrogate key as a

70:36

Resturant surrogate key Auto increment

70:38

primary key and all the informations are

70:41

almost same and now you can see I have

70:43

applied some data type and uh some

70:46

additional constraint and I'm keeping

70:48

the same tag or restaurant phone number

70:51

now let me create this quickly

70:57

so this got created and I'm also going

71:00

to create stream

71:03

object so this also got created now this

71:07

is the insert statement which you can do

71:10

or you can also run the merge statement

71:13

as we have done earlier so if you look

71:14

into the merge statement it will select

71:17

from the stream and it will try to cast

71:21

because this time I have used try cast

71:23

in case if it fails it will convert into

71:24

a null value however if it is not null

71:27

column in our clean layer it will not

71:29

get into that table if the clean layer

71:31

restaurant ID matches with the ex stream

71:33

restaurant ID all the columns will be

71:35

updated otherwise it will follow insert

71:38

approach so let's run

71:43

this so I got total five rows inserted

71:46

into my clean restaurant table so this

71:49

is my clean this is my table this is my

71:52

restaurant and if I look into the data

71:54

preview view I have the data and if I

71:57

look into the lineage I can see the

71:59

lineage is built so this is my clean

72:04

goes to stage and Stage table goes to

72:08

Stage location looks

72:10

good now I am creating a restaurant

72:13

Dimension very simple here I have

72:16

restaurant HK restaurant ID and all the

72:19

informations are almost same if you look

72:21

into the line number 265 I have this

72:23

effective start date effective ended and

72:25

is current if the record is coming first

72:27

time then this will be a true value if

72:31

it is being updated then this gets the

72:34

current time stamp of the time of update

72:36

and this becomes false so I am creating

72:40

this Resturant

72:45

Dimension this got created and now I am

72:48

following exactly the same merg

72:50

statement we going to check if the

72:52

stream has the new restaurant ID

72:55

then it will be inserted otherwise it

72:57

will be

72:59

updated here it checks this flag in the

73:02

Stream object allows to change the

73:05

current time stamp and is current equals

73:07

to false and this update statement gets

73:10

into a new insert statement with current

73:13

date as a effective date and is current

73:15

is current column equals to true and if

73:17

it is a completely new data set then it

73:20

will be anyhow inserted and if you look

73:23

into the hash we are creating the hash

73:25

by taking all the column which uniquely

73:27

identify the record now let me run this

73:30

merge

73:33

statement now it got the five record

73:36

inserted let me go to the table view now

73:38

you can see restaurant

73:41

Dimension and here if I see the data

73:44

preview so I have all the restaurant 1 2

73:47

3 4 5 they all belong to City Delhi and

73:51

my location ID FK is 111 so I'm making

73:54

sure that this location ID which is

73:56

coming as a referential Integrity from

73:57

The Source system is maintained as is

73:59

for better

74:01

traceability so if you look into the

74:03

system the data from here got into the

74:06

stage from the stage we ran a copy

74:08

command into the restorant table and as

74:11

soon as a copy command inserted data

74:13

into the restant table the stream

74:15

captured those changes and using a merge

74:18

statement it moved from stage table to

74:21

the clean table and again here also we

74:24

have all the changes is captured and

74:26

finally it moves to the restaurant

74:28

Dimension table under the consumption

74:30

schema so this is how the resturant data

74:32

moves now let's check the Delta data

74:54

processing

75:23

e e

75:56

so let me run this list

76:00

command so I have two Delta Data let me

76:03

first pick the first

76:12

one so I have record 678 and this should

76:16

get into the

76:17

system now let me run this copy

76:22

command so my copy command got executed

76:25

and it confirm that total three rows got

76:31

added and then C copy history also

76:34

confirms this now I don't need to do

76:37

anything I can simply go and run my

76:40

merge statement and the data will go to

76:42

the dimension table

76:49

automatically so this is my first merge

76:51

statement into my Keen Resturant table

76:57

so it confirms that three record got

76:59

inserted now I am going

77:02

to run my next merge statement which

77:05

will perform the merge operation on the

77:07

restaurant deam based on the restaurant

77:09

STM from the

77:14

king so I got three record added

77:22

good now here I can see 1 2 3 4 5 and 6

77:26

7 8 looks

77:28

good now if I go to this restaurant

77:33

dim and if I refresh

77:36

this I also got 1 2 3 4 5 6 7 8

77:43

and all the is current flag is true for

77:46

all of them because so far we haven't

77:48

got any updated record it is all insert

77:50

record now let me rerun this list

77:53

command

77:55

this is my second day

78:02

Delta this is my secondary Delta where 9

78:06

and 10 are new record but 3 45 are old

78:09

record with some changes we do not know

78:11

what changes are there but let's see now

78:13

I running the copy

78:16

command so my copy got executed with

78:20

five records where two are new three are

78:22

sanges

78:30

I am running my merge statement on my t

78:36

restaurant so here it made two insert

78:40

and three updated which looks good and

78:43

with that we can certainly track the

78:47

stream object so if I run this stream

78:50

let's see what

78:51

happens if you look 9 and 10 are new

78:54

record because because here my flag is

78:56

false for all other record the flag is

78:59

true and there are three delete in the

79:01

sense old records are internally deleted

79:04

and then it is newly inserted and if I

79:08

go to a particular

79:12

restaurant if I select

79:16

this so here this type information got

79:19

changed okay and all other information

79:23

here also the restaurant phone number

79:24

number got changed all other information

79:27

looks good so this change should reflect

79:29

into my Dimension table and for that I

79:32

need

79:33

to run this merge statement on my

79:36

restaurant Dimension table so let's

79:38

execute

79:43

this so it says five record got inserted

79:46

so 3 + 2 and three got upgrated let's

79:49

refresh

79:51

this so if you see this the big chill

79:55

here I have two entry for a pig chill

79:57

one of them should be false and one of

79:59

them should be true

80:02

so so first one looks true and this one

80:05

looks false and wherever I have a false

80:07

I got the effective

80:09

ended so I manage to populate second

80:12

dimension table looks

80:20

good if I refresh my ER diagram I can

80:23

see my restaurant Dimension and

80:26

restaurant location Dimension so as we

80:28

continue to add more and more Dimension

80:30

this ER diagram will refresh so we got

80:34

the restaurant object populated in our

80:37

consumption layer as a restaurant undor

80:39

team now the next we are going to touch

80:41

upon the customer underscore

80:47

dim this is my customer entity worksheet

80:50

and so first I'm going to create this

80:53

customer a in the stage location without

80:56

specifying any data type and all of them

80:58

are having a text data type so this got

81:01

created successfully now as we have done

81:04

for other entities we are going to

81:06

create a stream

81:08

object let's refresh and see how does it

81:12

look like here so I I got the

81:14

customer and if I look into the

81:18

column I can see my piia tag likewise in

81:22

the restaurant also we had one

81:29

column this Resturant phone number also

81:32

having the sensitive tag looks good so

81:36

this is my initial load CSC file and

81:39

this is standard copy command which I'm

81:40

going to run

81:42

[Music]

81:46

so my copy got executed and total five

81:49

customers are loaded let's see how this

81:51

customer file looks

81:53

like so this is my customer CFC file

81:56

where I have total five customer as

81:58

initial load in the second file I'm

82:01

having 678 three additional customer and

82:04

then on the third file I have 910 to

82:07

customer but customer 234 have got some

82:10

changes as we have done for our

82:13

restaurant just to see whether our std2

82:15

is working fine or not so I already got

82:18

the five record loaded now I'm going to

82:21

create another customer table under

82:23

clean schema this time I have customer

82:25

surate key customer ID and rest of the

82:28

informations are exactly same and each

82:30

of this column got appropriate data type

82:33

with not null wherever applicable so let

82:35

me quickly run

82:44

this now I'm going to create a stream

82:47

object on my clean

82:52

customer now that is also done

82:55

now as we have done in our earlier

82:58

entities I'm am going to run a merge

83:00

statement if it is a new customer it

83:02

will get inserted if it is an existing

83:04

customer then it will update each of

83:06

this record so my key layer exactly

83:09

represent my source data now let me

83:11

quickly run this merg

83:16

statement so I got my five record

83:19

inserted now customer is also a master

83:22

information so I'm going to create a

83:24

customer

83:25

here I have a customer HK which is a

83:26

hash key and customer ID as a standard

83:29

Source primary key and rest of the

83:31

information as

83:35

is so this table got created I'm going

83:39

to follow the merge approach where if

83:41

the customer Dimension will take the

83:43

customer stream object from the clean

83:45

layer and if all the values are same

83:47

including the action equals to delete

83:49

flag then it is a change and new record

83:53

will be inserted and if if it is a new

83:55

customer then a new insert will happen

83:57

as a part of this merge

84:00

strategy since we are getting all the

84:01

new five rows it will insert and let's

84:03

go back and check how the customer

84:05

Dimension looks

84:07

like so let me refresh

84:11

it and I can see a customer Dimension

84:13

available here and if I see the data

84:16

preview so this matches with my data set

84:21

so this is looking great again this is

84:24

all mock data created for our simulation

84:27

purpose so let's run this list command

84:30

first to check the file name so this is

84:32

my file

84:34

name which matches day one insert

84:37

customer looks good so let me run this

84:40

copy

84:41

command so I ran the copy command and it

84:45

loaded three

84:48

rows now this is my first merg statement

84:52

let me run this

84:55

Z rows got added moving to the next

85:00

Merch this is my customer dim

85:03

much so now let me refresh this here so

85:07

I got Total 1 to7 all looks good now

85:11

let's quickly check our update with

85:15

sd2 so this

85:19

is let me copy the name

85:27

so this time I have total five entries

85:29

looks good so in this five entries two

85:32

are insert operation and three are

85:35

update operation so going to my team

85:40

merge so if you see two record got

85:43

inserted and three record got updated as

85:46

a part of

85:47

first customer insert now let me go to

85:50

the dimension margin statement

85:58

and here I have got five record inserted

86:00

which is 3 + 2 and three row got updated

86:04

okay which is expected let's go and

86:05

refresh our customer deam

86:08

table now if you see some of this

86:11

customer like number two and number two

86:13

are having two entries so one of them

86:15

will be

86:17

false so this is false and all of them

86:21

are true so this is working fine let's

86:23

quickly check the

86:24

each so the customer dim is coming from

86:28

customer clean layer and customer clean

86:30

has up a stream of customer stage and

86:33

customer stage has up a stream from CSV

86:36

stage looks good so now let me refresh

86:40

this now when I refresh my ER diagram I

86:43

can see my customer Dimension is also

86:47

available along with my restaurant

86:49

Dimension and my location Dimension so

86:52

we managed to create complete the flow

86:54

for three

86:55

entities now let's see how we can follow

86:58

the same approach for customer address

87:01

which is having a one to many

87:02

relationship with customer Dimension so

87:04

these are the sample CSV file which we

87:06

loaded for customer address and the

87:09

first file has total five address

87:11

entries second file following exactly

87:14

the same rule where we have a 678 ID for

87:17

a customer address and then I do not

87:20

have any update here I'm just keeping it

87:22

very simple so the next records are

87:24

having a 9 and 10 okay so here I'm also

87:28

following the same process I'm going to

87:29

create the customer address

87:35

table looks good and next I'm going to

87:38

create the customer address

87:43

stream now I'm am running a copy

87:45

statement and loading the data from this

87:48

customer address location

87:56

now five rows Got Loaded let's quickly

87:59

validate

88:01

that so if I go to data

88:05

preview I can see address ID customer ID

88:08

flat number floor and that is all the

88:11

data looks good so now I'm going to

88:13

create a customer address under the

88:15

clean

88:17

location so it got created now we are

88:20

going to create a stream object

88:24

this is also done now I'm running a

88:26

merge operation as we have done it

88:28

earlier and this merge operation again

88:30

running a customer address

88:36

Stream So five rows Got Loaded now I'm

88:40

going to create a customer Dimension

88:41

here this is my customer address HK and

88:44

this customer ID I change the name to

88:47

customer ID FK so we know that this is a

88:49

source system for en key relationship

88:51

though we are not going to build a

88:52

relationship here but I'm just keeping

88:54

it for better

88:58

traceability so my address Dimension

89:02

table got created now I'm running the

89:03

merge statement as we have done for

89:05

other three

89:08

entities so my five row Got Loaded Let

89:12

me refresh this so I got my customer

89:16

dim and if I click on

89:19

a data preview I have all the data set

89:23

since I have all the updates available

89:25

I'm directly running from this Delta

89:27

customer

89:29

address and I'm not going to run one by

89:33

one so since I just gave the root folder

89:36

name it loaded all the five rows and two

89:40

different files looks good now I will

89:42

quickly run the merge statement so this

89:45

is the first merge statement on clean

89:51

layer and this is my second m statement

89:54

on dim

89:57

table looks good so if I refresh

90:02

this I got all the 10 records order is

90:05

not maintained during the copy

90:08

so this is how it looks like so we

90:11

managed to complete the customer address

90:14

team

90:17

also so now when I refresh this I also

90:22

got my customer address in this ER

90:24

diagram looks

90:27

good next we are talking about menu

90:31

object and here this is my menu initial

90:34

load CSV file and I have two other files

90:38

at the first go we are going to load

90:39

around 12 menus then on the second Delta

90:43

I have from 13 to 20 and on third I have

90:46

21 to 26 again we are not making any

90:49

changes here and this is all onetime

90:52

operation so let's quickly create a menu

90:56

Dimension table so I'm going to create

90:59

menu table inside my stage

91:05

schema so this got created next I'm

91:08

going to create menu

91:14

stream then we are going to run a copy

91:16

command which will pick the CSV file

91:18

from initial menu location

91:24

so 12 record got

91:26

updated if I see I have total 12 record

91:29

looks

91:30

good and let's quickly see how the data

91:33

looks

91:37

like so this is my menu let me shrink I

91:41

see copy got executed and data reviews I

91:45

have total 1 to 12 menus under different

91:49

restaurants and this looks good to me

91:56

now I'm going to create a menu table

91:59

under the clean

92:04

schema and this is my merg

92:11

statement now next I'm going to create

92:15

menu Dimension and if you look into this

92:17

this is menu Dimension HK hash key and I

92:20

have given menu Dimension hash key which

92:22

is my inter data warehouse key and this

92:26

primary key is from The Source system

92:28

and this foreign key is from The Source

92:29

system so this is how you can also

92:31

follow some kind of a column level

92:34

description to make sure that your users

92:36

who are doing the development on the top

92:39

of this table will have a better

92:41

understanding so let me create this menu

92:43

dim

92:45

table it got created now I'm going to

92:49

run a merge operation on menu dim by

92:52

considering menu stream from the clean

92:55

schema so total 12 rows got

93:01

added Let me refresh

93:05

this I can see menu

93:08

dim and when I go to the data

93:11

preview I can see Total 1 to 12 menu

93:14

item and here different restaurant and

93:17

all those informations are visible looks

93:19

good now again we are not loing

93:24

individual files because both the files

93:26

are having only new record so let me run

93:29

it

93:33

quickly so I can see total eight and six

93:36

record got loaded so here I have eight

93:38

record and here I have six record looks

93:42

good

93:43

now I will run my first merch on clean

93:48

menu so expected number of rows inserted

93:53

14 and now running the next Dimension

93:56

merge

93:57

statement now let me refresh

94:01

this so I can see all the data set

94:04

available here looks good sometime this

94:07

does not get refreshed and if you have

94:09

to get this refreshed I have to reload

94:10

the page so let me do

94:13

that so I can see total 26 entries which

94:17

matches with this looks good Let me

94:20

refresh this to so I can see my menu

94:24

Dimension is available here looks

94:28

good next we are going to work on

94:30

Delivery Agent which is another Master

94:33

data set and these are the CSC file

94:36

which we have already loaded into our

94:38

stage location if you look into the

94:41

first file we have total five Delivery

94:44

Agent in the second file we have another

94:47

five delivery agents and in the third

94:49

file it is only updating the first and

94:52

10th record just to see if our SD is

94:54

working properly for this Delivery Agent

94:57

entity so first we are going to create

95:00

Delivery Agent table in the stage

95:02

location let me quickly create

95:05

this and we are going to create the

95:09

delivery agentcor

95:14

STM so now I'm going to load the data

95:18

from this initial Delivery Agent

95:20

location

95:26

let's quickly validate

95:34

this

95:36

so this looks

95:40

good now I am creating the Delivery

95:43

Agent table under the clean schema with

95:46

primary key Auto increment as surrogate

95:49

key and rest of the information looks

95:51

almost same

95:55

now like we have done earlier going to

95:57

create a stream object in my clean layer

95:59

for to populate Dimension

96:03

table this got created running a merge

96:07

statement very similar pattern so five

96:10

rows got

96:14

inserted so I'm creating Delivery Agent

96:16

dim table and here I have hash key and

96:20

other primary keys and rest of the

96:22

columns from the domain

96:25

entity so this is created and now my

96:28

merge statement which is insert or

96:30

update data into the Delivery Agent DM

96:33

table based on the Delivery Agent stream

96:41

object so this is a similar process and

96:45

uh now I got my delivery aent dim

96:49

Dimension table created let's quickly

96:51

review all looks good since I have

96:54

insert an update I am going to run one

96:56

by one so I will first take

97:01

this and this is my Delta I'm going to

97:04

take

97:05

this and make sure that I run the copy

97:09

command so copy command got executed

97:12

five

97:14

record this

97:17

is my clean layer merg statement let me

97:20

run it

97:23

this got

97:25

updated now my next merge statement on

97:28

the dim

97:32

table this also got updated let's review

97:36

this so I have total 10 record

97:41

great and now let's rerun and get the

97:46

second file

97:51

name let We Run The Copy state

97:56

so two rows got added looks good now

98:01

they are both update statement so let's

98:03

run this clean layer

98:06

first it is only update no insert looks

98:10

good now if I go to my Dimension

98:13

table it will have two update and two

98:16

delete looks

98:20

good so for the 10 I have two entri and

98:23

the for one I have two entries so two of

98:26

them will be having false and two of

98:28

them will be having true looks good

98:30

these are false and these are true so I

98:33

managed to get Delivery Agent Dimension

98:34

table populated now let me refresh

98:44

this so this is my Delivery Agent

98:47

Dimension table so we managed to get six

98:51

Dimension table in my consumption layer

98:53

now now we are going to process the

98:54

transaction table but for the

98:56

transaction table we are not going to

98:57

populate the dimension table they will

99:00

represent as a fact

99:02

table so these are the dimension table

99:06

or Master data set which we have already

99:08

populated in our consumtion schema and

99:11

it is already having slowly changing

99:13

Dimension type to implemented now this

99:16

next three set of table are primarily

99:19

transaction table and we are not going

99:20

to populate them as a dimension table in

99:22

the consum option layer rather than they

99:24

will be created as a fact table and when

99:27

we talk about a fact table we always ask

99:29

ourself what would be the granul here

99:31

the granularity of the fact table would

99:34

be order item so delivery entity order

99:37

entity and order item entity will be

99:40

populated until in layer and from there

99:42

we will create only one single fact and

99:45

we also have to create a date Dimension

99:47

before that okay so let's start with

99:49

delivery entity

99:55

so let me change the context

100:00

quickly so if you look into this data

100:02

set this is my delivery initial load I

100:05

have total eight delivery IDs and this

100:08

is my order ID and this is the Delivery

100:10

Agent ID so it's starting from 1 eight

100:13

they are one to one mapping here when

100:15

I'm going to two I'm having 9 10 11 12

100:18

and these are all 9 10 11 12 order ID

100:21

when I'm going to second Delta file I

100:24

have 13 14 15 16 exactly 13 14 15 16

100:28

order IDs let me first create the

100:31

delivery

100:35

table it got created now as usual I'm

100:40

creating this stream

100:42

object this is all done now I'm running

100:45

my copy

100:47

command so eight records Got Loaded

100:50

looks

100:51

good now I'm creating a delivery on the

100:55

clean

100:59

layer so it is created likewise I'm

101:02

creating a stream

101:04

object this got created now I'm running

101:07

this merge statement as we have done

101:11

earlier so this is looking good good and

101:15

Let me refresh here so this is my stage

101:19

delivery and I will go to clean delivery

101:22

first

101:26

so I have total eight record looks good

101:28

now I would not go beyond that because

101:30

this is my transaction

101:32

table moving to order entity and let's

101:36

quickly see how the order entity CSC

101:38

file looks like so I have total eight

101:40

orders then I have total 12 9 10 11 12

101:45

orders in the first Dela and I have

101:47

another four orders in the second Dela

101:49

which is exactly matching with our

101:51

delivery CSV file so and I'm going to

101:54

follow the same

101:55

path

101:58

so let me change the context quickly

102:03

first and this is my order table here I

102:06

have given the column level

102:08

command and I have kept this is a source

102:11

system primary key FK nfk so looks

102:17

good so my orders table created now I'm

102:21

creating this stream object

102:24

on orders table this got created now I

102:28

am simply running this copy

102:35

statement and total eight rows Got

102:37

Loaded looks good now I am creating a

102:40

orders table in the clean

102:45

schema this also got created and I'm

102:48

running this stream object on my order

102:51

object on the Klean schema

102:55

looks good now this is my merge

102:57

statement which will merge data on clean

103:00

schema order stable from this STM so I

103:03

have to make it

103:07

STM now let's run

103:11

this okay so eight rows Got Loaded looks

103:20

good now let me run the Delta 1 by

103:25

one I think there is a spelling mistake

103:28

this is orders

103:31

okay let me quickly check

103:35

this yes it is orders so let me run

103:39

this and since I'm not making any

103:42

changes to the orders I will run both

103:44

the orders order file first and then

103:47

follow the merg

103:48

statement so Got Loaded

103:53

I have single merge statement because

103:55

I'm not going to the dimension

103:58

layer so total eight rows got added now

104:02

let me refresh this in the clean so this

104:04

is my orders and this is how the data

104:07

preview order ID customer ID restaurant

104:09

ID FK and if I go to the

104:12

column I can see these descriptions and

104:15

snowflake also allows you to change the

104:18

description if you have necessary

104:20

privileges on the table okay and this is

104:24

my surrogate key uh looks good now

104:27

moving to order item entity this is the

104:30

granularity which we are going to follow

104:32

to create our a item fact

104:35

table let's see how the CSV file looks

104:37

like so this is my initial load having

104:41

total 14 entries uh for you can see here

104:46

order one has only one item order two

104:48

has two item order three has one order

104:50

four has one order five has two item

104:53

likewise we have that if I go to Second

104:55

Delta file this also has around 14

104:57

records and if I go to this I have total

105:00

another 12 record looks good again I am

105:03

not making any changes here these are

105:05

all direct insert

105:06

statements so let me create the order

105:09

item table inside my stage

105:12

schema so first let me change the

105:20

context so my order item table is

105:22

created I am creating the order item

105:25

stream object on my order item

105:27

table and now I'm running this copy

105:32

command so there is some problem I think

105:37

so this is order

105:42

items now I am creating a clean layer

105:45

order

105:49

item and now I'm creating this stream

105:52

object

105:55

so all good

105:57

now this is my merge statement which

106:00

will read data from stream object and

106:01

insert into order item inside the clean

106:04

layer looks

106:07

good so 14 rows Got Loaded now here I do

106:12

not have anything uh update so I'm just

106:16

directly running this copy statement

106:23

I can run this merge

106:27

statement so 25 rows got at it and let

106:31

me refresh this so I got my order item

106:33

also populated here and if I say order

106:36

item here this is also populated I can

106:39

see data preview and I have total 39

106:43

rows available looks good now every data

106:46

warehouse project needs a date Dimension

106:48

and that's what we are going to do after

106:50

creating this transactional table up to

106:53

clean layer so to create a date

106:54

Dimension we will start with a minimum

106:57

order date and we will get the order

106:59

date from the order table and we'll take

107:01

the minimum of that and from there we

107:03

are going to use a Common Table

107:05

expression approach to create a

107:06

dimension table if you're not very sure

107:08

how to use a Common Table expression I

107:10

have a detailed video which shows how

107:12

you can use a recursive Common Table

107:14

expression to populate a date Dimension

107:17

now let me shrink

107:21

this so if you look into this date

107:23

Dimension table I have hash key calendar

107:27

date followed by different uh date

107:30

columns and let me create this

107:32

particular date

107:35

Dimension it is created and if you look

107:38

into this insert statement which is

107:40

primarily taking the minimum value from

107:43

the order tables and once it gets this

107:46

one it recursively follows the approach

107:49

to get the rate Dimension let me run

107:52

this insert state statement with the

107:53

Common Table

107:57

expression now I got 376 rows inserted

108:01

looks

108:02

good and if I come to this

108:07

layer so this is my date Dimension I

108:10

have all the data available 376 rows we

108:13

have quarter month week days of the year

108:16

and everything is available good now we

108:18

got all the necessary transaction table

108:20

up to the clean layer and the date Di

108:22

mention in the next part of the video we

108:24

are going to create order item fact and

108:27

before that let's refresh our ER diagram

108:30

to see whether all these tables are

108:32

available there or

108:45

not so now I can see date Dimension also

108:48

available okay so I'm going to create a

108:51

fact table quickly along with the

108:53

relationship and obviously it will

108:55

convert into a star

108:58

schema so this is my 12th order item

109:01

fact

109:04

worksheet so first I'm going to create a

109:07

order item fact table which has a

109:10

primary key which is auto increment I

109:12

can also make it SK surrogate key feel

109:15

free to change as per your requirement

109:17

but this is surrogate key so I will make

109:19

it escape and now I have order item

109:21

which is primary little key key coming

109:23

from the source system and then it is a

109:25

order ID and the customer dim key

109:29

customer address dim Key Restaurant dim

109:31

Key Restaurant location dim key menu

109:33

Dimension key Delivery Agent Dimension

109:36

key order date Dimension key followed by

109:38

some of the measure which I have added

109:40

here so let me create this fact

109:44

table now this fact table is created now

109:47

this is a merge statement which is

109:48

joining all the dimension table based on

109:52

the FK available in each of the table

109:55

and finally if the record gets changed

109:58

then it gets updated and if the record

110:01

is a new record then it is a insert

110:03

statement so let me quickly run

110:12

this so I can see total 63 rows got

110:16

added and let's see how this fact table

110:19

looks like so if I refresh this this is

110:21

my consumption schema the fact table

110:24

appears here and I can this is the

110:29

complete fact table structure and if I

110:32

say data preview I can see a lot of data

110:35

available and this is how there are some

110:39

null values because there might be some

110:41

issues with my data but finally it is

110:43

showing the quantity and subtotal and

110:46

everything right so the small data set I

110:49

have it might have some Gap and that's

110:51

why we got some null value values so if

110:53

you have to avoid the null values you

110:55

cannot have a left join you can just

110:57

keep inner join and then you will not

110:59

have a null values in your

111:02

system once our fact and dimensional

111:05

tables are populated I'm going to

111:07

quickly run this alter statement which

111:10

will create this relationship so let me

111:14

run

111:15

this so all of them got executed

111:18

successfully if I refresh to this ear

111:21

diagram let's see what happen

111:27

friend now you can see my ER diagram got

111:31

refreshed and I have this star schema

111:34

where at the center of this star schema

111:36

I have this order item fact table and if

111:39

I click on any of this relationship I

111:41

can see this relationship is bued based

111:44

on the hash

111:46

key now since my fact table is created I

111:50

can quickly create View

111:53

let me create this view which is

111:54

primarily annual revenue and let's do

112:00

that and if I run select a star from

112:03

this annual revenue let's see how does

112:05

it look

112:08

like so if you see on 2024 my total

112:12

revenue was

112:14

15,44 and on 2023 it is

112:17

9,880 5 orders 10 orders this is looking

112:20

good now I'm going to create create

112:23

monthly Revenue

112:26

View and this is also this also got

112:29

created so if I see what is my monthly

112:31

number looks

112:33

like this is how it looks like on the

112:36

year 2024 this is all month and these

112:38

are the sample data looking good now if

112:42

I go and create a daily revenue

112:45

then this is my daily revenue and if I

112:49

run for 2024 month 7

112:54

this is how it looks

112:56

like and I am using this views into my

113:01

streamlet dashboard and let's see how

113:03

does it look

113:07

like so here this is my select year

113:10

based on the dummy data we have

113:11

populated this is my total revenue this

113:14

is my average revenue per order this is

113:16

my maximum order value total number of

113:19

order and average revenue per item

113:23

and if you see the total revenue for all

113:26

years number of orders and maximum

113:29

order and this is all the monthly Trend

113:33

though they are not very well organized

113:35

but at least you can see the clear data

113:37

and if I select year 2023 the number

113:41

changes since I only have data for two

113:43

years for this sample data set it is not

113:46

showing the comparative

113:48

analysis and this is December and

113:51

November Okay so so this is how you can

113:54

really create end to end data flow and

113:57

make sure that you do everything in a

113:59

single platform and in the next step I

114:02

will rerun the entire script by

114:03

populating large amount of data and

114:05

we'll come back to this dashboard and

114:07

see how does it look like let's quickly

114:10

discuss how this lineage looks like once

114:13

we create all the object under different

114:17

schema so this is my consumption schema

114:20

and this is my view and this is my daily

114:23

revenue kpi and if you look into this

114:26

this is my daily revenue kpi and this

114:29

daily revenue kpi is based on my fact

114:32

table and this is also depends on my

114:35

date Dimension and this fact table

114:37

depends on multiple Dimension object and

114:41

that's what you can see here these all

114:42

dimensions are Upstream object for my

114:45

fact table now if you look into this

114:47

order item delivery item they are under

114:50

the clean schema and this fact table is

114:53

also populated based on this two object

114:56

and from here I can see

114:59

my entire lineage up to this stage

115:03

location and this lineage diagram helps

115:05

me to understand where all my fact table

115:08

is being populated and how I'm creating

115:11

my view so this is a very very useful

115:14

feature which snowflake has added as a

115:16

preview as on today but maybe in future

115:19

it will add more and more column lineage

115:21

and many other enhancement to this

115:23

preview version we already loaded a very

115:25

small set of data and that data flow

115:29

automatically build this lineage diagram

115:31

now I have already loaded the large

115:34

amount of data under a different

115:36

database called production and I have

115:39

followed exactly the same process what I

115:41

have already shown you now let me show

115:43

you what is the data size which I have

115:45

loaded and

115:49

processed so this is my stage location

115:52

and if I look into my

115:54

orders I have quite a lot of data set

115:58

available here and if I go to order

116:03

item here you see I have total 41 MB of

116:07

data size and I ran exactly the same

116:09

process what we have done for a small

116:11

data set under this production database

116:15

and if I go to my consumption

116:17

layer and if I click on my order item

116:21

fact table

116:23

it has got close to 6.2 million record

116:27

and if I click onto this

116:30

tables you can see the number of rows as

116:33

well as the size of the data so we have

116:36

total 6.2 million records and that data

116:40

set is available from

116:42

2019 to 2024 total five year of data set

116:47

available and now let's see how our

116:50

dashboard looks like so this is my

116:53

project and this is my streamlet

116:55

dashboard and I am clicking on this

116:58

swiggy Revenue

117:02

dashboard right now it is getting

117:05

loaded and here you see I have total 1 2

117:10

3 4 5 six years data set available if I

117:12

click on 2019 the data gets changed so

117:16

this is my total revenue this is my

117:18

average revenue per order maximum order

117:21

value total orders and average revenue

117:25

per

117:27

item this is my monthly Revenue trend

117:31

for year 2019 and if I H into this bar

117:36

it is showing the exact monthly

117:38

revenue and this is the monthly Revenue

117:40

Trend and if you look into the top 10

117:43

restaurant for December 2019 this is how

117:46

the data looks like I can

117:50

check so data gets changed for a

117:53

different month likewise if I go and

117:56

choose like 2021 let's see what happens

117:59

now here it shows how much revenue has

118:02

grown compared to the last year so you

118:05

can really use this streamlet widget and

118:08

this is the total revenue and again this

118:10

is all dummy data which I have created

118:13

to demonstrate this example now here

118:16

this is 2021 monthly revenue and this is

118:19

the train right

118:22

I say February let's see what happens so

118:25

this got changed and here I can also

118:29

maximize it and minimize it whatever I

118:31

want to

118:32

do now let's see and choose 23 so the

118:36

venue has jumped a lot and this is how

118:39

this overall dashboard looks like and

118:40

you can create a different kind of

118:41

dashboard once you have this fact and

118:43

dimension table you can always take one

118:45

of the dimension and around the

118:47

dimension you can ask different kind of

118:49

questions and help your business users

118:51

to get the answer through this data

118:53

platform once everything is done now it

118:56

is time to think how to automate the

118:58

entire flow so it runs without any human

119:01

intervention and as soon as the CSV

119:04

files are dropped into the stage

119:05

location this automated job runs and

119:08

start processing the CC file one by one

119:11

there are multiple approaches and we

119:13

will not practice them here in this

119:15

video else this tutorial will be too

119:17

long we will only discuss the commonly

119:20

practiced approach that many data

119:22

project teams follows let's start with

119:24

first approach in this approach each

119:27

data set like customer menu delivery and

119:30

so on will be wrapped inside the stor

119:32

procedures using snowflake SQL script so

119:35

if we have around 9 to 10 entities if I

119:38

take this example each of this store

119:40

procedure will run the copy command then

119:43

run the merge statement to push the data

119:45

from stage to consumption layer now this

119:48

individual stored procedure will be

119:49

called by another parent stored

119:51

procedure and this parent stor procedure

119:54

be invoked using a task and that task

119:56

runs once a day or multiple times a day

119:59

based on the data arrival frequency

120:02

alternatively the same approach can also

120:04

be implemented in a slightly different

120:06

manner where all the copy commands can

120:09

be wrapped in a single stor procedures

120:11

all the stage to clean merge SQL

120:13

statements can be wrapped in a single

120:15

stor procedure and all the clean to

120:17

consumption merge statement can be

120:19

wrapped in another single store

120:21

procedures and and then all of these

120:23

will be called one after another from

120:25

the parent stored procedures and this

120:28

parent stored procedure will be again

120:30

called using a task based on certain

120:33

frequency there are another approaches

120:35

like creating a dag using task tree

120:38

where each of the task is a stream aware

120:40

so it will trigger only if there is a

120:42

data available in the Stream object

120:45

alternatively you can also solve this

120:47

problem using Dynamic tables this video

120:50

become very very long if I start

120:52

explaining each of these approaches and

120:54

if you are really really interested to

120:56

go beyond this video I suggest you to

120:58

visit my udmi page where all the

121:00

scenarios are discussed and explained in

121:02

detail with a small as well as large

121:05

data set so you can download the content

121:08

and also practice to enhance your

121:09

snowflake data engineering

121:12

skill thank you so much for watching I

121:15

hope you found this video informative if

121:17

you learned something new please give it

121:19

a thumbs up and share your thoughts in

121:22

the comment section below I would love

121:24

to hear from you don't forget to

121:26

subscribe for more content like this

121:28

happy learning and keep growing

UNLOCK MORE

Sign up free to access premium features

INTERACTIVE VIEWER

Watch the video with synced subtitles, adjustable overlay, and full playback control.

SIGN UP FREE TO UNLOCK

AI SUMMARY

Get an instant AI-generated summary of the video content, key points, and takeaways.

SIGN UP FREE TO UNLOCK

TRANSLATE

Translate the transcript to 100+ languages with one click. Download in any format.

SIGN UP FREE TO UNLOCK

MIND MAP

Visualize the transcript as an interactive mind map. Understand structure at a glance.

SIGN UP FREE TO UNLOCK

CHAT WITH TRANSCRIPT

Ask questions about the video content. Get answers powered by AI directly from the transcript.

SIGN UP FREE TO UNLOCK

GET MORE FROM YOUR TRANSCRIPTS

Sign up for free and unlock interactive viewer, AI summaries, translations, mind maps, and more. No credit card required.