SWIGGY Data Pipeline | End To End Data Engineering Project In Snowflake
FULL TRANSCRIPT
swigi is a well-known food delivery and
quick Commerce platform in India I'm
sure many of us have used swigi to order
food or groceries and every time we do
those flashy offers pop up on our mobile
screens tempting us to place an order
before the deals vanish I'm sure you
know that these deals are not created
randomly platforms like swii use a data
driven approach to design these offers
and customize them based on their user
preferences the goal is to retain the
users and encourage them to spend more
and more on the platform so it helps
increase their overall Revenue as well
as Revenue per customer per
order according to data published in
newspapers swii handles 1.4 million foot
orders daily across India that's nearly
1,000 order per minute last year this
number was
700k and now it is doubled they operate
in 500 cities partner with 140k
restaurants and have 200k active
delivery Executives incredible isn't it
this clearly shows how fast they are
growing and to sustain this growth they
must have a powerful data and analytics
platform to support their data driven
decision making millions of users
actively use platform like swii and all
their transactions like placing the
order Etc and mobile activities are
stored in oltp systems and some SAS
platforms forms respectively these oltp
systems could be one or more rdms under
the hood to support transactional
business processes the transactional
data that are captured by olp System
then move to their data warehouse or
data Lake platform for reporting
Advanced analytics and machine learning
workloads to handle such high volume and
highspeed data many of us might think of
massive parallel processing so-called
MPP or distributed computing like Apache
spark or a data braks or cloud services
like AWS data services or Azure data
services however in this end to end data
engineering Hands-On tutorial we will
use snowflake data platform where you
don't need to worry about any
infrastructure setup and we will learn
how easily you can build a powerful
self-service data warehouse system in
just few
hours you might be wondering why is swiy
for this end to end data engineering
project well to truly Lear learn end2end
data engineering you need to solve a
real life business problem and for that
you need some understanding of how a
business application works and how it
store Data before diving into the data
engineering and analytics activities we
have all used the swiggy or similar food
delivery apps so as we go through this
project it will be much easier to
connect the dots and understand why we
are doing what we are doing any food
aggregator app will have restaurant
entity that will have catalog or a menu
to serve the food it will be clubbed
with some discount and promotions there
will be orders delivery locations and
payment activities and then delivery
executive will deliver the food and
finally a rating for the service it is
also easy to understand that what kind
of reporting requirement such food
aggregator will have to monitor the
health of business month or month and
hence their Founders and leaders would
like to know total revenue average
revenue per order average revenue per
item top performing restaurants Revenue
Trend over time Revenue by customer
segment Revenue by restaurant and
location delivery performance Geographic
Revenue insight and many more such
metrics and that's why we will use this
food aggregator example to learn end to
end data engineering
work so who is this course designed for
if you are a data engineer or a data
analyst or a SQL Developer or a python
developer working on a data engineering
project or a cloud developer and wants
to learn how to build an end to end data
warehouse platform using snowflake then
this course is for you if your next
question is do I need to buy any tool to
complete this Hands-On guide the answer
is no we will use the free trial edition
of a snowlake which provides $400 in
free credits and last for 30 days all
the exercise in this tutorial can be
completed at no cost your only
investment is your
time it is important to let you know
what is not covered in this tutorial
when you say an end to end data project
is it starts with extracting data from
Source systems and pushing it to the
data platform automatically however in
this project we will use synthetically
generated data in CSV format of
different entities we will load this
data using snowflakes file upload
feature via snite web UI which allows up
to 250 MB of data to be loaded at one
time if we have to visualize the overall
architecture diagram of this end to end
data project this is how it looks like
we will load the data using snowflakes
file load feature into the stage
location after that the data will go
through a series of Transformations and
finally we will use the streamlet app to
visualize the kpi for the food
aggregator business
a food aggregator platform like swiy has
many business processes and it is
impossible to cover them all in one
tutorial so in this guide we will focus
on their food ordering subprocess the
various tables and data set that need to
be ingested and how they are interl in
their oltp
system welcome to my channel data
engineering simplified the only YouTube
channel offering practical snowflake
video tutorials for both beginners and
expert data
professionals if you are curious where
you can find source code and the data
files that are used in this tutorial
please find the link below in the
description section and before we
proceed a quick note all the Hands-On
exercise are done using snowflakes free
trial Enterprise Edition hosted on
AWS I suggest watch this video in 4K
resolution and if you are a fast learner
considering speeding up the video to
1.25x or 1.5x you can also change the
language and listen the tutorial in your
preferred language like German or
Hindi if you are looking to learn in a
more structured way with complete source
code and data files be sure to check out
my courses on udmi and some of them are
also available through udmi business all
of my courses have a rating of 4.7 or
higher for queries suggestion or
feedback drop a direct message to my
Instagram account the link is given
below and yes if you would like to stay
updated on snowflake new and popular
features end to data engineering
projects architectural concept live demo
videos don't forget to subscribe to this
channel data engineering simplified so
let's start this
tutorial so before any food order is
placed food aggregator needs to Target a
location where they will serve once
location is finalized they onboard
restaurants each restaurant then will
have their menu uploaded into the system
then the customer will login using their
email or a phone number create a
customer profile including their
delivery location or so-called customer
address then this customer will search
and place an order once order is placed
the Delivery Agent will be picked for
delivery and then delivery will happen
so if you look into this picture at very
high level minimum 9 to 10 such entities
or tables are involved to support the
order business subprocess in real life
it would be much more than this table
but for this tutorial we are considering
9 to 10
entities now first let's go through the
ER model that represent this entities
with relationship I'm using DB Community
Edition to draw this ER model and I am
assuming my oltp system is hosted on a
post rdms database and if you use D and
if there are relationship between the
tables this ER diagram will come
automatically you don't have to draw
that and using reverse engineering you
would be able to get this diagram
through DB so this is my customer entity
and each customer has one or many
addresses so there's relationship
between these two entities here on the
other side I have this
restaurant and each restaurant will have
menu item again this is one to many
relationship and on the other side we
have this Delivery Agent
and this is my delivery table whenever
an order is placed by the customer the
order table gets one record and order
item may get one or more record based on
the number of order item per order and
this is again one to many relationship
because each customer may login multiple
time during this order process and this
is my location entity where the location
is tracked so when I click on this thing
it says customer has a relationship and
this is one to many when I click on this
customer to login and here my customer
to order likewise if I say this is menu
this is order item is also linked with
menu and here this is order to order
item this is my order to delivery and
this is my delivery to Delivery
Agent so whenever you are working on any
data engineering project for any domain
it is very very important to understand
The Source system ER diagram because
based on that relationship only you
would be able to design your data
warehouse table under different layers
let's see how the data looks like
quickly so if I say customer this is how
the synthetic data looks like here I
have a customer ID customer name mobile
email address a login using Gmail or
Facebook account and this is a gender
data birth
anniversary and different food
preferences when this record is created
and when this record is modified okay
now going to the customer address
book so here I have customer ID as a
foreign key rest of the information is
available here like flat number house
number floor building landmark and this
if you pay attention this is a number
field this is a text field and so this
tool also helps you to verify the data
type and you can decide by looking at
the data type what kind of data type you
will have it when you host this data
into your data platform now let's
quickly see the delivery
agent now this is the Delivery Agent
where Delivery Agent ID is a primary key
then we have a name phone number vehicle
type location ID status rating and
created dat if I go to the
delivery so here when I talk about
delivery so delivery is a transaction
table where every delivery will be
mapped to order ID and and it will be
delivered by a Delivery Agent so this is
a child table of order and Delivery
Agent and for every delivery primary key
you will have this foreign key
associated with that and here we have a
delivery status estimated time and
address ID and so on now let's talk
about restaurant so here for our n2n
data processing we have generated quite
a lot of synthetic data and uh this data
only belongs to location New Delhi and
we are not considering all the cities so
for a Simplicity I am only considering
one location and these are the number of
restaurant we have it under this
location so this is a master table where
you have the name of the restaurant type
of Restaurant pricing for two okay and
restaurant phone number operating hours
location ID if you see the location ID
is always one because this belongs to
one location and then we have active
flag open status and all those things
again this is a completely synthetic
data if I go to the location let's see
how the location looks
like so if you see location I have a
Delhi Delhi state ZIP code an active
flag and all those other information and
this is a standard created date and
modified date which is all technical
field so this is my another master table
okay and uh this will be converted into
a location Dimension when we get into
our data warehouse platform now let's go
to menu so menu is a child master table
and uh here this is a restaurant ID okay
so for restaurant one I have two menus
restaurant 2 I have quite a lot of menu
again restaurant 5 has a menu so the
data is not very consistent this is for
only the demonstration purpose but this
will be converted into another dimension
will we move this into snowflake data
warehouse platform now let's go to the
order so when you look into the order ID
which is a primary key of this table
each order belongs to a single customer
and I can have many orders for one
single customer then they will place an
order from a particular restaurant so
this is a restaurant ID foreign key this
is a date when the order is placed the
total amount what is the status
currently uh payment method so these are
the important
information this is not a master table
this is a transaction table now let's
see how the order item looks like so one
single order can have multiple order
item this is a menu ID because you can
have one or many order item as a part of
your order this is a quantity total
price and subtotal and when it is
created and when it is modified in case
if it
is this is also a transactional table
when we are going to create a fact and
dimension The fact granularity will be
at the item level and not at the order
level so every Enterprise grade data
project start with Source analysis
sometime you may have access to the ER
diagram of the source system and
sometime you may not have it however you
need to understand relationship between
two tables and what is the data type
what is the volume of the data how the
data is being generated whether it is a
master table or a transaction table
because this information is very very
important to map it when you try to
create dimensional modeling and we will
see that at the later part of this video
so in this section we have already seen
how the ER diagram looks like for our
food aggregator system and how the data
looks like so the first part of tutorial
will deal with subset of data and we are
going to see them in a CSV form so let
me jump into my vs code editor and show
you how those sample data looks like
this is my vs code editor where I have
already kept those generated data under
different folder in this n2n data
project we will demonstrate how we can
have the initial load and then how we
can process the Delta load which is a
very common pattern in many badge
processing system this is my location
data Resturant Data customer customer
address men
order order item Delivery Agent and
delivery CSV for each of this entity I
have handful of record so first we will
build end to end solution then we'll go
and try to load thousands of record and
this is a common practice in any data
engineering project where business team
will give you a sample data set and
based on the sample data set you will
build your entire NN data Pipeline and
then once it is deployed you would be
able to validate the L data set so if
you're looking to this location I have
this location five row CSV and then I
have Delta Data which simulate insert as
well as insert followed by update
likewise if I go to the restaurant I
have only five restaurant and then in
the Delta file I have again handful of
restaurant then this is my customer
where I have this customer detail and
then you can see customer insert and
customer insert
update then I have customer
address then we'll go to the menu this
is how the menu looks like and if you
see here I have some invalid record and
in any data engineering project you will
see lot of inconsistent data set which
you need to handle as a part of your
transformation
layer now I have handful of order here
then I have the Delta order coming in
two different
files likewise we have order item and
then you have order item Delta
files this is the delivery agent and
this is the Delivery Agent Delta files
finally we have delivery transaction
table and these are the two Delta
files this is how my CSV file looks like
which is exported from my post database
as a handful of record
now let's quickly understand the overall
end to end data architecture for this
food aggregator data platform we already
have lot of CSV file for entity and
those CSC file would be loaded through
the file loader in
Snowflake those CSC files would be
finally placed into the stage location
and from the stage location we will run
the copy command and that copy Comm
command will load the data into a table
now from the table the data will move
further under the clean layer and this
clean layer is nothing but a schema
where we will do a basic cleansing
processing and transformation and from
this clean layer the data will finally
move to the consumption layer where we
are going to perform the fact and
dimension table we will go step by step
starting from our location entity in our
order item fact table once the data is
available in fact and dimension table we
will create some View and display the
data through our streamate app for our
end user 90 to 95% of data engineering
project will follow this architecture
where you may have a slightly different
naming convention some organizations say
raw or stage some organization will give
the name called ining curation and here
in the consumption they may give a name
called modered layer if you follow the
datab brakes architecture they call this
architecture as a medallion archit Ure
where it is start with bronze then
silver and then gold as you go from left
to right the data quality increases and
become more and more
consumable at the end of this process
what we are trying to achieve by
following this overall architecture our
primary objective is to consume Source
data into a data platform and after it
goes through set of transformation it
will form the data which follows the
star schema dimensional modeling where
you have all the master data like a
customer customer address restaurant
menu restaurant location Delivery Agent
will be converted into a dimension table
and order item along with order table
will be converted into a order item fact
so your clean schema would primarily
represent the oldp style data set and
your consumption schema will have the
data available in form of dimensional
modeling so let's go to our SN site web
UI and start working on a code
so the this is my isos site webui and I
have already created different worksheet
and each of this worksheet will help us
to create different object under
different schema so this is my first
script called create database and schema
and as a part of this script we are
going to create a database called
sandbox and that is what we generally do
and inside the sandbox I'm going to
create stage schema clean schema
consumption schema and and one schema
called common where all the common
object would recite so let me shink the
left
panel and first let me switch the
role now my role is Switched and I'm
also going to choose a different virtual
Warehouse so I will do ad hoc virtual
warehouse now let's create a database
called
sandbox and I will first use this
database so if you notice my context got
changed to sandbox dopu and public is a
default schema here and I'm going to
create the four different schema called
stage schema clean schema consumption
schema and common
schema if you look into our architecture
here so I have created stage schema lean
schema consumption schema and one of the
common schema that will hold all the
policy object another common object
which can be accessed by any of the
schema in our data flow now I we run
this use schema command now here my
schema is stage schema and as we are
going to process lot of CSV file I'm
going to create a file format called CSV
file
format so snowflake supports six type of
file format and CSV is a very common one
it also support par Jon XML orc AO and
if you are not familiar what is the use
of this file format and what is the use
of stage location I would suggest you to
go and watch this particular
video now I'm going to create a stage
called CSV stage where we are going to
load all the CSV file what we have
reviewed at the beginning of this
tutorial and I'm also enabling the
directory service so so this is
created now this is another tab where I
would like to see the sandbox database
and all other schema let's refresh it so
this is my sandbox database this is my
stages schema
followed by Clean schema followed by
consumtion schema and this is my common
schema if I go inside this stage I have
this one CSV stage visible and another
is file format and this file format will
be helpful to pass our CSV file when we
are loading them through the copy
command I'm going to create couple of
policy tag and we will see how this
policy tag will appear when we attach
them with the table if you're not
familiar with the tag and masking
policies I would suggest you to go and
watch these videos which covers tags and
policy in
detail so my policy created under common
schema and let me select all of them and
run
it if I refresh let's see whether those
policy tags are visible here or not
policy TXS are in general not visible as
on today on snos site web UI so as a
first step we manage to create a Sandbox
couple of SCH schema stage environment
internal stage and the F format so stage
schema is created Kean schema is created
consumtion schema is created and this
location is also created now we are
going to use the file upload feature and
try to load the data into this stage
location so that's our next step you can
find all the scripts available in my
block refer the link below the step one
script is already available so you can
go and copy paste them and before going
to the next step please make sure that
these objects are available in your
snowflake data platform so now we got
our
stage location which is called cvore
stage this is an internal snake stage
where we can load the CSV data so since
we have a two type of data files as we
have seen the First Data file we call it
initial load and then we have Delta
loads so we are going to create two
different partition one is called
initial and second is called Delta so
let me click on this plus files and here
this is my sandbox do STG stage
underscore schema is already selected
and this is the location where the data
would be loaded and here I will give
initial the first I will start with
location and I will quickly drag drop
the location file so this is a location
file and this location file will be
residing under folder or a context
called initial followed by
location so this is my initial likewise
I will create a data and I will drag
drop all all the file from my location
Delta folder I have multiple files and I
will demonstrate you how the copy
command works when you try to load
invalid file or duplicate file here it
will also go under the context
location snowflake is very fast so even
if you have 100 MB or 200 MB data file
it will be loaded very very fast
compared to any other platform now let
me shrink this so this is my stage
location this is my context initial this
is my another context location and then
I have this location hyphen 5 row. CSV
likewise I have this Delta inside Delta
I have a location and inside location I
have a multiple files so my location
data is loaded now I will follow the
same process to load other data file and
I will do it in a fast forward
mode so for this tutorial we are only
and only focusing on or I would say one
location at this stage to keep the
tutorial simple and easy to understand
now here you can see my another context
under this initial is created so as I
add more and more file I will always
create a context and under the context
or a partition I'm going to place the
file let me go how does it look like so
here this is restor in Delhi plus
NCR here I have two
files so now you can see this is my
stage location this is the Delta as a
partition or a context under that I have
two different context or partition
restaurant and location and inside
restaurant I have these CSV files looks
good now I will go ahead and do for
other data file
so you can see total 10 initial load CSV
files are loaded and this is again
following a different partition so it is
easy to understand one thing to remember
when we are going to run a copy command
we need to make sure that we do not
follow the pattern and we will follow
the exact path because sometime if I
follow delivery it will consider both
this location and will try to copy all
the CSC file available under this these
are the small small things you will only
come to know when you work in a project
like this by doing yourself otherwise
and if you are going to appear for a
snowflake interviews and if somebody ask
what kind of challenges have you faced
you can clearly tell them these are the
challenges we have seen while loading
the data or while processing the data or
while transforming the data now I'm
going to load the Delta files quickly in
a fast forward mode so this is a Delta
and I have only Resturant and location I
will quickly process other eight Delta
files so for every entity I have two
Delta files primarily one Delta file
represent your insert operation and
another Delta file will represent insert
or update operation and when we are
trying to follow the std2 slowly
changing Dimension type two architecture
in our consumption layer these sample
files will help you
another important thing to remember in
any SQL based tool order is a keyword
and that's what instead of giving order
I have given orders and again when you
are following these naming conventions
and partition make sure that you cover
different kind of scenarios and
accordingly you create a pattern for
your project and then follow that
pattern for all the stage location or a
table or a schema
so here we managed to load total 23
files and these 23 files are part of
this total 10 partitions or so-called
context so when we are going to use the
copy command we are going to follow
stage name followed by Delta or initial
followed by name of the entity and then
we are going to specify the name of the
CSV file and then we'll run the copy
command the snowflake has recently added
this feature called lineage and if you
want to know more about lineage how does
it work you can refer this particular
video which primarily focus on this data
lineage and how it works in Snowflake
and as we progress in this tutorial we
are going to use the lineage feature to
understand the overall flow and validate
that flow with our architecture so in
this step all the CSC files are loaded
inside initial and Delta location so I
would say this part is done and in the
next step we are going to create tables
for each entity and run a copy command
and we'll see how we can utilize the
stream object to capture the changes
while loading the data let's quickly
explore one more thing how you can check
the files which are available in your
internal stage or external stage for
that snowflake gives a command called
list followed by at theate sign and then
you can give the name of your stage
location either fully qualified or just
the name of the stage so we know that we
created our stage location inside this
stage undor and the name of the stage
location is cvore STG so if I run this
command let's see what result does it
bring
I can see total 33 files are available
all the files are loaded on 1st December
2024 Sunday and here I have this Tila
partition followed by customer address
and so on all the file size are
available including the md5 if I have to
narrow down and check how many files are
available under my initial partition I
can add the partition name and let's
rerun the
query now I can see only 10 record and
all the partitions with initials are
available here remember whenever you are
referring to this stage location this is
case sensitive in Snowflake so if it is
lower case you have to follow the lower
case if it is uppercase you have to
follow the uppercase so during your
interview if somebody ask you is stage
location files are case sensitive or not
you have to say yes you can also use the
remove command to delete the file or
from the web UI you can also click on
Triple Dot and delete the file you can
also download from
there so you can remove it or you can go
to a particular file and with triple dot
you can download or you can load into a
table or you can just copy the path okay
now it is also possible that you can
query this stage location files without
loading into a table and for that you
can use this dollar notation so you have
to write select the name of the column
as we have shown here from stage
location so this is the schema name this
is stage location name this is the
partition initial and this is the
partition location and I can attach a
file format and this file format
understand what is the CSV file and if I
go
up here we have created this file format
where type is CSV compression is auto so
whether it is CSV or gzip it will
or compressed file it will it will
understand theimer is comma record
delimer is new line es skip header yes
it will es skip the header and if the
description fields are enclosed in
double code then I have to keep this
code so this is what the definition of
my CSC file so if I run this let's see
what
happens so it is clearly representing
the location data which we have seen so
this is my location data first line is
skipped because it is a header and rest
of the five records are available looks
good if I make any mistake for example
if I give a capital N and try to rerun
this it will not bring any result so you
have to be very very careful with your
stage location partition name followed
by your name of the file I can also go
and so let me pick this file name and I
can give very specific file name if I
have a conflict with my partition name
and I can rerun
this and I will still get the same
result always remember since we are
running a query and getting the data
from the stage location this needs a
virtual warehouse and if you do not have
a virtual Warehouse you may not be able
to run the query so we finished creating
database object different schema object
stage location file format we loaded the
file into the stage location and now we
are going to process each of this entity
one by one until we reach to the
consumption layer and in the next step
we start with the location which is our
which is our topmost Master information
in this use case this is my next
worksheet called location entity let me
shrink this and first I'm going to
change my
context so this is my sandbox stage
schema this is the role and this is the
virtual Warehouse all looks
good we have already tried this query
where we are trying to fetch the data
from the location CSV file let me rerun
this so I have total five record looks
good so whenever we are loading the data
from stage location it is important that
we add some additional column which is
called audit column and this audit
column captures your file name from
where this data is coming from what time
the file was modified what is the md5
value and the time stamp when the file
is loaded so let me run this query and
show you the
result so let me shrink
this so this is the stage file name so
in the next time if you're going to load
the same file by running this query you
would be able to understand where this
record is coming from and this is a load
time and this is the
md5 this is the time when we queried the
data okay so let's go through this
architecture diagram once again where we
will look from the lens of location
entity so first we loaded this location
CSV file into the stage location so this
is available here into this stage
location which is my CSP STG from here
we will run a copy command and that copy
command would follow the Sate structure
what we have just seen and from there we
will move the data to this clean schema
and from the clean schema we are going
to create location uncore team table
before we run the copy command we have
to create this location table along with
the location table we are also going to
create a stream object and stream object
will help us to track the changes and if
you do not know how a stream object
Works in Snowflake I have couple of
tutorials which had already explained
the stream object in detail and from
this stream object we are going to
create a next location table under the
clean schema and when we go to the clean
schema the primary objective of the
schema is to have the appropriate data
type and from there we will create this
location Dimension table with hashkey so
let's go back to our worksheet so this
is the ddl for my location table where
location ID city state ZIP code
activation flag created date and
modified date these are the domain
entities and all of them are having a
data type text and this is the
additional technical column which
generally I prefer to give with
underscore
so you are very clear that these are not
coming from the source system good and
this is the
commment now let me create this
table so my table is created
successfully Let me refresh this is my
another Tab and let's see my location
table is created and the description is
visible and all the columns are
visible so I have total 11 column and I
can always go back and check all the
column names are matching as per this
CSV file definition looks good I have
total seven column coming from the
location entity and four column which
are technical columns before I do
anything I am going to create a stream
object and the primary purpose of the
stream object is to capture any change
which happens to this table and once we
run the copy command you would
understand why this stream object is
created and thisam object is created on
the location table which is this
particular table and the proper p app
and only is true and this is the comment
so let me create this stream
object so this also got created when I
switch my
tab I can see here I have a stream
object and it is created on the location
this is the copy command where I am
loading this specific columns and this
is my dollar notation to fetch the data
from the CSV file and this is the
location of the CSV file I have just
given the part name and this is the file
format and in case of error it will
abort it so this is how the copy
construct looks like in Snowflake and as
soon as I run this let's see what
happens so this is the file which is
being processed the status is Success it
means all the records are loaded and
this is the total number of record which
Got Loaded I can go back to my table
page and Let me refresh this and when I
click on this location here I can see
total five record got loaded and let's
see how how the data looks
like so I got all the data and here is
my audit column shown so data is loaded
on this particular time and if I click
on a copy
history I would be able to see that this
particular file is copied and if I click
on the
lineage this is a table lineage and it
clearly says that this table is
populated from this stage location looks
good I we can run the select a statement
here
and I can see the data and I can also
run the select a statement on the stream
object let's see what result does it
bring so it has brought exactly the same
data set I will show you the
difference so up to here I have total 11
column and these three columns are added
by the stream object and and it clearly
says that whatever operation you will
perform on the main object a stream
object will capture that and I can run a
query on the stream object and push the
data from the stream object to the next
layer so if you look into this diagram
this is my stream object and when I'm
going to populate data to my clean
schema under location table the data
will not come from this location table
rather it will come from this stream to
the location table and I will show you
what do I mean by that coming back to my
location worksheet here first of all I
will change the schema name to clean
schema now you can see my schema name
got changed and here I'm going to create
exactly same table the only difference
you would notice that here the data type
is not text and the data type is
matching the expected data type from The
Source system so location ID is number
city state code they are all following
the source system data type including
the limitation however here I have added
some additional information which says
whether it is Union territory or not
whether it is a capital city flag or not
zip code and additional things right and
we will see how we can enrich the data
into the clean layer okay and here every
clean table will have a surrogate key
which is nothing but Auto incremented
and primary key for this table location
ID should be the unique in this table so
this is how this table looks like so let
me create this
table so this look location table or a
restaurant _ location table is created
Let me refresh
this and I come here and I can see I
have a Resturant location
table this does not have any data at
this stage and I am ALS going to create
a stream object called restaurant unor
locor STM and this stream object is
created on restaurant unor location it
means that any kind of change whether it
insert update delete this stream object
would be able to capture that so let me
do
that quickly refresh to check yes my
stream object is available I'm going to
run a merge operation and that merge
operation will load the data into the
restaurant uncore location where it is
going to run a query and this query is
running on your stream object let me
show you so here if you see this is a
stream object STG stream object and in
this query it is trying to enrich the
data so so here I have added couple of
case statement if it is Delhi then make
it New Delhi and for other state it will
retain the same text and I'm adding a
new column called state code likewise I
am also adding another field called is
Union territory so if any state matches
with this name it will be flagged as a
yes otherwise it is no this is another
case statement the state is this and
city is that then we have another called
Capital City flag likewise if it is a
tier one tier 2 tier three cities so
these different additional columns are
added to this and when the location ID
matches with the location ID of the
source if any changes are being recorded
into the source table then it will be
updated if the new location is coming
then there will be insert
operation e
I assume you are fully aware how the
merge operation
works you can pause and go to my blog
post and go through the query to
understand how this merge statement is
running on the stream object let me run
this merge
statement so it says five record
inserted and there is no update because
our data does not have any update
operation so looks good now let me
refresh this and sometime you may have
to refresh the entire page because this
only refresh the object entries but it
does not refresh this page but if I go
to the restaurant location and this is
still showing zeros so let me refresh
the entire browser
page now I can see total five record if
I click on a data
preview so this is looking nice and if I
see the
column I have total 16 column I added
some additional column and let me see
how does it look like so if you look
into this the state code is added is
Union territory true or false is a
capital flag true or false and if you
see that this is false because in Union
territory we do not have a capital but
we can change the Case Logic and make
sure that this stands true and then I
have a different tier tier one tier 2
tier 3 City and this is zip code
likewise active flag yes right now only
New Delhi is a active City where this
food aggregator is serving all other
cities are having a new flag okay and
then these are additional information
looks good so coming back to my diagram
from the stream object we ran the merge
operation to populate the location and
this location table te schema also
having a stream object so any changes
done on this location table can be
tracked by this stream object so as a
next step we are going to create a
location Dimension table and location
Dimension table will also take the data
from the stream object as stream object
only shows the changes and not the
entire data set and that's what we need
to have it whenever we are going to
process the Delta so let me go back to
my
worksheet so before we run another merge
operation where we are going to load the
data from clean location table to
location Dimension table this is how the
ddl looks like if you see I have a
primary key which is called Resturant
undor locor HK and then all other data
set is available here and to tag the
slowly changing Dimension I am having
effective start date effective end date
and current flag right so the location
information may change over a period of
time and to make sure that we track
those changes we have to follow this
scd2 and these are the column which we
added into our Dimension table so let me
create this table
first so my table is created and now I
have another merge operation if you look
into this merge operation this merge
operation is trying to insert the record
into my restaurant location dim table
and it is squaring the restaurant undor
locor STM and let's see what data set do
I have
so I have total five records available
here and if I look into this metadata
column here metadata action says all the
records in the Stream objects are insert
operation none of them are update and
this is theow ID looks good now if you
look into this so basically this merge
operation will join this target table
and stream object table and it will try
to fetch if the location ID matches with
the location ID between these two table
and when the metadata action is delete
and metadata update is true it means
that record has changed so it will make
if Q ended equals to current time stamp
and the flag equals to false in the
sense this row is no more valid row if
it is not match and it is a insert
operation and is update is equals to
true it will insert a new record if it
is completely new location then it will
primarily insert the data into the
dimension table and first let me run
this query and when we load the next set
of data you would come to know how this
merge statement is actually performing
let me select
this so whenever we run a Marge
operation it always shows the number of
record inserted and the number of record
updated right now we are doing a initial
load we are not observing any change on
an existing record that's why our update
values are zero looks good now we will
see how our Dimension table looks like
so let me refresh this and under the
consumtion schema I should see one
location table so I have a restaurant _
location unor dim looks good and when I
click on that and let me see the data
preview I have created a hash key and
this is my restaurant location and if
you see these are my location ID which
I'm keeping it as is and this is a city
Delhi Mumbai Agra Amad azmir and this is
a state and this is how the data looks
like this is active flag which is coming
from the source system and all these
records are effective start date of this
record is this effective end dat is null
and this is a true it means that if any
changes happens to any one of the record
this particular row item will set the
effective end date and the flag will go
to false it means that do not refer that
record now if I click on the lineage you
would be able to understand how the data
is
Flowing I can click on this plus
sign and I you can click on this plus
sign again and this plus sign I do not
have any upper Stream So this location
Dimension is Created from this
restaurant location under the clean
schema and this is populated from this
which is again in the stage schema and
this data is coming from this stage
location which is this so if you see
this is a stage location this is my
table this is another table and this is
another table if you click on
this this is the description of the
table and and the number of total column
is 13 number of rows are five and if you
would like to know where is this state
code coming from I can click on it I do
not have any Downstream lineage I can
click on Upstream lineage and it will
show that this is coming from this
Resturant location under clean schema
looks good however if I select City and
click on this view
lineage click on this up stream and if
you see distance is 1 and two pi because
city is not the derived data and this is
coming from the location table under
stage
schema so if you look into this lineage
this matches with our lineage okay and
this is what we managed to achieve and
this is the same process we are going to
repeat for all other entities here we
did not follow any primary key and we
just kept the source data ases by adding
some four extra column when we came to
this location we added some
additional field along with data type
changes and when we came to this
location Dimension which is following
the sd2 so this table will always match
the source table however this table will
track all the changes and the number of
record between this two table will
always vary now let's go and push some
Delta record let's see what happens so
now I'm going to process this Delta day
one which has only two record and then I
will process Delta day two where I am
making some changes and this changes
will say this cities becomes active it
is going to add 8 9 10 as insert and
these are the changes and let's see how
these two files works when we use the
stre stream object as a Tracker so if
you look here I am running a list
command on my Delta location and let's
see how many files do I get
it so I have so many files I'm going to
pick my day Zero let me copy this entire
location and then I can change this
location up to CSC file this is going to
the data folder location folder and this
is the name of the CSV and now let me
run this copy command before I run this
copy command which will load the data
into this location table let me quickly
query the location table and location
undorm table under this stage
schema I have have total five record and
all the data is coming from this initial
location CSC
files My Stream object is completely
empty it does not have any record now
let me run this copy command and before
running this copy command if I check my
copy history it has only one
file now let's run
this so it says two records loaded
which matches with this two record looks
good now if I go to this location
table I have two record number location
ID 6 and location ID 7 and again when
I'm talking about stage location it is
having exactly the same representation
as a source so looks
good and if you look into this stage
file name it is telling that these two
records are loaded from this CSV file at
this particular time which is
and if you look into this this time is
having a different time stamp so when
you're trying to debug where this data
set is coming from you can really run a
query on this technical column and
understand where from which file this
rows are
populated now pause and think what would
be a location _ streem object bring when
we run a select
statement since we inserted only two
record into the main location table or
we call it base table stream object
captured only the insert operation
because that's what this stream object
is designed for and that's what we need
to push the data from our stage layer to
the clean
layer right now I have only five records
into my clean restaurant
location and I'm going to run this merge
statement again and this merge statement
will pick only the two record from the
stream object and insert into the
restaurant uncore location able inside
the cleaning schema so let me run
it since it is a merge operation it says
to record inserted and zero updated
looks good now let me reload the entire
page now this also got all the surrogate
key starting from 1 to5 and then 67 and
I can see my data set is populated
clearly and all of them are looking good
okay now I need to make make sure from
this stream the data is consumed and
goes to the location dim table let's try
that out so if I run the select
statement on my restaurant undor locor
STM I should expect to record looks good
and this two record will move to the
dimension table through this merge
statement so let me run
it now I can see two record inserted and
zero rows updated now
let me go to the consumption schema and
see how this location Dimension looks
like so here I can see also location ID
6 and 7 and this is my hash
key here my sd2 columns are looking good
we managed to process this newly coming
record through this day 01 CSC file now
if you look into the day 02 CSC file it
has three new record location ID 8910
and it has four old record and what this
old record says it is primarily making
the active flag equals to True which was
false in the first case so this four
location ID will also be true again this
is a very hypothetical case just trying
to show you how this Delta processing
through the stream object really
captured your slowly changing dimension
in your consumption layer so let me run
this list
command and this is my Delta 2 so I
copied the
name and I just change this and if I run
into my location streen I do not have
any data because I already consumed that
and I am going to run this copy
command so seven rows loaded which is
here I have total seven rows because I
can skip the first header row now if I
run this statement again you can pause
and think for all insert since it is a
insert happen only stream I will have
only seven
record so here all the flags are yes so
when this stream object is going to be
joined with restore location table it
will compare which location ID already
exist into my K schema if it exist it
will pick the changes and the changes
will be applied and if it is a newly
inserted record which is 8 9 and 10 it
will get inserted into this one so here
if you see the modify date is also
coming from the source system but here
there is no modification it is coming
null looks good so let me run this merge
operation and this will populate those
seven records by following this approach
if the location ID matches with the
location ID but any one of them got
changed then it will run the update
statement if not then it will run the
insert statement so I am running this
merge
statement three record inserted
and four record got updated looks good
so this is the three insert and this is
your four
update now for Dimension table to
capture those update and insert let's
see how the record looks like in the
restaurant location stream
object so I'm going to pick one example
Mumbai so there are two record because
this record got updated and
and so I just applied a filter and if I
look into this
record and if I look into this stream
meded operation first the stream object
record the delete operation is updated
flag is true and for the second new
record it is called insert update
operation is also true and if you see
the change initially the active flag was
no which got a delete flag and and when
the new record which has captured the
active flag equals to yes this is coming
as a insert as a part of update so this
is how the stream object captures a
detail so now I'm going to run this
merge statement and you can go to my
blog page and you can understand how
this merge statement is written let me
run
this so total seven in record inserted
and four rows updated so when I say 4 +
3 = to 7 and to understand more let's
see how the data set looks
like so let me show the entire table
first
uh so if I look into this example Mumbai
and initially Mumbai was having a no
active flag now it is having a yes
active flag and if you look into the
result here it current flag is true and
here the current flag is false and
whenever a change happens effective end
dat column gets a value okay so this is
how we managed to get the sd2
implemented in our location Dimension
table our entire data is Flowing from
Source location to the stage location
through the stream it is getting into
the clean layer and again through the
stream it is getting into the
consumption layer we are also able to
get the sd2 type in our location
Dimension so now we manage to cover
location dimension let's go to the next
object restaurant and we are going to
follow the same process and I'm not
going to explain everything in detail
before we go to the next entity let's
try to simulate if we run a copy command
with invalid data what happens to entire
flow this is another Delta file called
Delta hyund day 03 which has invalid
delimiter so if you look into the
delimiter it is a pipe sign instead of
AMA let's say your data platform gets a
data like this what would happen let's
try that out it me go back to my
worksheet so this is the file and let me
copy
it I copied
name and here I pasted the name and
again I would like to check if I have
any data in stream I do not have any
data in my stream now I'm running this
copy
command so it says three rows passed and
three row was loaded because if there is
no comma in that case all the data set
will be loaded into the First
Column now if I look into select a star
from this
location so if you see into this three
option I got incorrect data now my
stream will also have the incorrect
data because all the columns are coming
filled into the First Column and if you
look into this this is coming from
invalid mimer
now I'm going to run my merge statement
and let's see what
happens so here it failed because it
says that numeric value cannot be
recognized okay and
since this statement got
failed my stream object will still have
the data and it will not be cleaned up
now I would allow you to go and think
how would you solve this problem this is
one of the common problem when we see
that if bad record gets into your system
how would you handle that but somehow I
need to clean this stream data otherwise
it will continue to fail and it would
not allow me to proceed to the next
level so I created a location uncore
temp table through the stream and by
that I consumed all the data from my
stream My Stream is clean now however my
location table is still having the odd
so I would go and delete that odd record
from the location
table so wherever I have this pipe sign
those record would be
deleted so it deleted three records and
since my stream object created on the
top of location is happened only this
would not be recorded Let Me reconfirm
this so I do not
have now this is my fourth day invalid
file let's see what happens if I try to
push
this so I ran the copy command with that
file and it says four rows loaded and if
I go and this select a
statement I can see this junk data Got
Loaded into my system and likewise My
Stream will also have this data set
available which is not
right and again when I try to run this
merge State since it will not get
converted into a number this merge
statement will fail let me run
this so numeric value is not recognize
when you're trying to run this entire
process through orchestration tool you
need to make sure that you run the copy
command only when you have a clean data
otherwise your merge statement will
continue to fail so let me clean this
once
again and I will use the word
junk so whenever you are building an N
to data pipeline there could be many
such scenarios which you need to
simulate in advance and accordingly you
have to write your merge statement or
you need to have a SQL statement which
would check whether it is possible or
not one of the possible solution you can
try to cast it and if the cast is not
possible instead of on error you can say
continue that way you can solve this
problem so only those record which
qualifies this check will get into the
stage table otherwise it will
not we finished the location entity to
be loaded into our snowflake environment
and converted this master data into to
our Dimension table now let's work on
the resturant entity which is another
Master data set here we have quite a lot
of column we are going to load this data
and follow the similar pattern what we
have seen and done for location
entity our objective at the end of this
section is to convert the restaurant
Master data into a restaurant Dimension
and here we will have a hash key
followed by restaurant ID and rest of
the information about the
restaurant for restaurant I have this
initial load which has total five
records starting from restaurant ID 1
to5 then if I go to the day one Delta
record it adds three additional
restaurant IDs and if you look into the
third day I add two additional
restaurant but there are some changes on
restaurant ID 3 4 and 5 just to simulate
if our sd2 is working appropriately or
not so let's go to our worksheet
so my context is already set this is my
restaurant entity worksheet and this is
my system admin ad hoc virtual Warehouse
so I'm going to create a restaurant his
stage
table and all the column are text value
if you look into this restaurant phone
number I am adding a tag where the
restaurant phone number is a sensitive
information or this column is classified
as a sensitive and you will see how this
appears the table description
page so this table got created
successfully and now I'm going to create
a stream object on the top of this
restaurant
table so I can see a restaurant table
here and all the information is visible
looks good likewise I have the
restaurant stream also available this is
looking good now we are going to run a
copy command and this copy command will
have all the domain entities from the
restaurant and these four additional
columns which we are going to add it and
let's run this copy
command so my copy got executed total
five rows
loaded when I come to this
table I can
see so these are my five rows
all the data Got Loaded successfully no
issues if I look into the copy
history it also looks green
great now you can also run this
information SCH schema table function to
check whether your copy executed or not
rather than changing the screen and let
me run
that so this is it will show that where
this copy command so what is the file
name and what is the stage location and
the load time and all other
detail now next I'm going to create
exactly the same structure of Resturant
table the only difference is that this
time I'm using surrogate key as a
Resturant surrogate key Auto increment
primary key and all the informations are
almost same and now you can see I have
applied some data type and uh some
additional constraint and I'm keeping
the same tag or restaurant phone number
now let me create this quickly
so this got created and I'm also going
to create stream
object so this also got created now this
is the insert statement which you can do
or you can also run the merge statement
as we have done earlier so if you look
into the merge statement it will select
from the stream and it will try to cast
because this time I have used try cast
in case if it fails it will convert into
a null value however if it is not null
column in our clean layer it will not
get into that table if the clean layer
restaurant ID matches with the ex stream
restaurant ID all the columns will be
updated otherwise it will follow insert
approach so let's run
this so I got total five rows inserted
into my clean restaurant table so this
is my clean this is my table this is my
restaurant and if I look into the data
preview view I have the data and if I
look into the lineage I can see the
lineage is built so this is my clean
goes to stage and Stage table goes to
Stage location looks
good now I am creating a restaurant
Dimension very simple here I have
restaurant HK restaurant ID and all the
informations are almost same if you look
into the line number 265 I have this
effective start date effective ended and
is current if the record is coming first
time then this will be a true value if
it is being updated then this gets the
current time stamp of the time of update
and this becomes false so I am creating
this Resturant
Dimension this got created and now I am
following exactly the same merg
statement we going to check if the
stream has the new restaurant ID
then it will be inserted otherwise it
will be
updated here it checks this flag in the
Stream object allows to change the
current time stamp and is current equals
to false and this update statement gets
into a new insert statement with current
date as a effective date and is current
is current column equals to true and if
it is a completely new data set then it
will be anyhow inserted and if you look
into the hash we are creating the hash
by taking all the column which uniquely
identify the record now let me run this
merge
statement now it got the five record
inserted let me go to the table view now
you can see restaurant
Dimension and here if I see the data
preview so I have all the restaurant 1 2
3 4 5 they all belong to City Delhi and
my location ID FK is 111 so I'm making
sure that this location ID which is
coming as a referential Integrity from
The Source system is maintained as is
for better
traceability so if you look into the
system the data from here got into the
stage from the stage we ran a copy
command into the restorant table and as
soon as a copy command inserted data
into the restant table the stream
captured those changes and using a merge
statement it moved from stage table to
the clean table and again here also we
have all the changes is captured and
finally it moves to the restaurant
Dimension table under the consumption
schema so this is how the resturant data
moves now let's check the Delta data
processing
e e
so let me run this list
command so I have two Delta Data let me
first pick the first
one so I have record 678 and this should
get into the
system now let me run this copy
command so my copy command got executed
and it confirm that total three rows got
added and then C copy history also
confirms this now I don't need to do
anything I can simply go and run my
merge statement and the data will go to
the dimension table
automatically so this is my first merge
statement into my Keen Resturant table
so it confirms that three record got
inserted now I am going
to run my next merge statement which
will perform the merge operation on the
restaurant deam based on the restaurant
STM from the
king so I got three record added
good now here I can see 1 2 3 4 5 and 6
7 8 looks
good now if I go to this restaurant
dim and if I refresh
this I also got 1 2 3 4 5 6 7 8
and all the is current flag is true for
all of them because so far we haven't
got any updated record it is all insert
record now let me rerun this list
command
this is my second day
Delta this is my secondary Delta where 9
and 10 are new record but 3 45 are old
record with some changes we do not know
what changes are there but let's see now
I running the copy
command so my copy got executed with
five records where two are new three are
sanges
I am running my merge statement on my t
restaurant so here it made two insert
and three updated which looks good and
with that we can certainly track the
stream object so if I run this stream
let's see what
happens if you look 9 and 10 are new
record because because here my flag is
false for all other record the flag is
true and there are three delete in the
sense old records are internally deleted
and then it is newly inserted and if I
go to a particular
restaurant if I select
this so here this type information got
changed okay and all other information
here also the restaurant phone number
number got changed all other information
looks good so this change should reflect
into my Dimension table and for that I
need
to run this merge statement on my
restaurant Dimension table so let's
execute
this so it says five record got inserted
so 3 + 2 and three got upgrated let's
refresh
this so if you see this the big chill
here I have two entry for a pig chill
one of them should be false and one of
them should be true
so so first one looks true and this one
looks false and wherever I have a false
I got the effective
ended so I manage to populate second
dimension table looks
good if I refresh my ER diagram I can
see my restaurant Dimension and
restaurant location Dimension so as we
continue to add more and more Dimension
this ER diagram will refresh so we got
the restaurant object populated in our
consumption layer as a restaurant undor
team now the next we are going to touch
upon the customer underscore
dim this is my customer entity worksheet
and so first I'm going to create this
customer a in the stage location without
specifying any data type and all of them
are having a text data type so this got
created successfully now as we have done
for other entities we are going to
create a stream
object let's refresh and see how does it
look like here so I I got the
customer and if I look into the
column I can see my piia tag likewise in
the restaurant also we had one
column this Resturant phone number also
having the sensitive tag looks good so
this is my initial load CSC file and
this is standard copy command which I'm
going to run
[Music]
so my copy got executed and total five
customers are loaded let's see how this
customer file looks
like so this is my customer CFC file
where I have total five customer as
initial load in the second file I'm
having 678 three additional customer and
then on the third file I have 910 to
customer but customer 234 have got some
changes as we have done for our
restaurant just to see whether our std2
is working fine or not so I already got
the five record loaded now I'm going to
create another customer table under
clean schema this time I have customer
surate key customer ID and rest of the
informations are exactly same and each
of this column got appropriate data type
with not null wherever applicable so let
me quickly run
this now I'm going to create a stream
object on my clean
customer now that is also done
now as we have done in our earlier
entities I'm am going to run a merge
statement if it is a new customer it
will get inserted if it is an existing
customer then it will update each of
this record so my key layer exactly
represent my source data now let me
quickly run this merg
statement so I got my five record
inserted now customer is also a master
information so I'm going to create a
customer
here I have a customer HK which is a
hash key and customer ID as a standard
Source primary key and rest of the
information as
is so this table got created I'm going
to follow the merge approach where if
the customer Dimension will take the
customer stream object from the clean
layer and if all the values are same
including the action equals to delete
flag then it is a change and new record
will be inserted and if if it is a new
customer then a new insert will happen
as a part of this merge
strategy since we are getting all the
new five rows it will insert and let's
go back and check how the customer
Dimension looks
like so let me refresh
it and I can see a customer Dimension
available here and if I see the data
preview so this matches with my data set
so this is looking great again this is
all mock data created for our simulation
purpose so let's run this list command
first to check the file name so this is
my file
name which matches day one insert
customer looks good so let me run this
copy
command so I ran the copy command and it
loaded three
rows now this is my first merg statement
let me run this
Z rows got added moving to the next
Merch this is my customer dim
much so now let me refresh this here so
I got Total 1 to7 all looks good now
let's quickly check our update with
sd2 so this
is let me copy the name
so this time I have total five entries
looks good so in this five entries two
are insert operation and three are
update operation so going to my team
merge so if you see two record got
inserted and three record got updated as
a part of
first customer insert now let me go to
the dimension margin statement
and here I have got five record inserted
which is 3 + 2 and three row got updated
okay which is expected let's go and
refresh our customer deam
table now if you see some of this
customer like number two and number two
are having two entries so one of them
will be
false so this is false and all of them
are true so this is working fine let's
quickly check the
each so the customer dim is coming from
customer clean layer and customer clean
has up a stream of customer stage and
customer stage has up a stream from CSV
stage looks good so now let me refresh
this now when I refresh my ER diagram I
can see my customer Dimension is also
available along with my restaurant
Dimension and my location Dimension so
we managed to create complete the flow
for three
entities now let's see how we can follow
the same approach for customer address
which is having a one to many
relationship with customer Dimension so
these are the sample CSV file which we
loaded for customer address and the
first file has total five address
entries second file following exactly
the same rule where we have a 678 ID for
a customer address and then I do not
have any update here I'm just keeping it
very simple so the next records are
having a 9 and 10 okay so here I'm also
following the same process I'm going to
create the customer address
table looks good and next I'm going to
create the customer address
stream now I'm am running a copy
statement and loading the data from this
customer address location
now five rows Got Loaded let's quickly
validate
that so if I go to data
preview I can see address ID customer ID
flat number floor and that is all the
data looks good so now I'm going to
create a customer address under the
clean
location so it got created now we are
going to create a stream object
this is also done now I'm running a
merge operation as we have done it
earlier and this merge operation again
running a customer address
Stream So five rows Got Loaded now I'm
going to create a customer Dimension
here this is my customer address HK and
this customer ID I change the name to
customer ID FK so we know that this is a
source system for en key relationship
though we are not going to build a
relationship here but I'm just keeping
it for better
traceability so my address Dimension
table got created now I'm running the
merge statement as we have done for
other three
entities so my five row Got Loaded Let
me refresh this so I got my customer
dim and if I click on
a data preview I have all the data set
since I have all the updates available
I'm directly running from this Delta
customer
address and I'm not going to run one by
one so since I just gave the root folder
name it loaded all the five rows and two
different files looks good now I will
quickly run the merge statement so this
is the first merge statement on clean
layer and this is my second m statement
on dim
table looks good so if I refresh
this I got all the 10 records order is
not maintained during the copy
so this is how it looks like so we
managed to complete the customer address
team
also so now when I refresh this I also
got my customer address in this ER
diagram looks
good next we are talking about menu
object and here this is my menu initial
load CSV file and I have two other files
at the first go we are going to load
around 12 menus then on the second Delta
I have from 13 to 20 and on third I have
21 to 26 again we are not making any
changes here and this is all onetime
operation so let's quickly create a menu
Dimension table so I'm going to create
menu table inside my stage
schema so this got created next I'm
going to create menu
stream then we are going to run a copy
command which will pick the CSV file
from initial menu location
so 12 record got
updated if I see I have total 12 record
looks
good and let's quickly see how the data
looks
like so this is my menu let me shrink I
see copy got executed and data reviews I
have total 1 to 12 menus under different
restaurants and this looks good to me
now I'm going to create a menu table
under the clean
schema and this is my merg
statement now next I'm going to create
menu Dimension and if you look into this
this is menu Dimension HK hash key and I
have given menu Dimension hash key which
is my inter data warehouse key and this
primary key is from The Source system
and this foreign key is from The Source
system so this is how you can also
follow some kind of a column level
description to make sure that your users
who are doing the development on the top
of this table will have a better
understanding so let me create this menu
dim
table it got created now I'm going to
run a merge operation on menu dim by
considering menu stream from the clean
schema so total 12 rows got
added Let me refresh
this I can see menu
dim and when I go to the data
preview I can see Total 1 to 12 menu
item and here different restaurant and
all those informations are visible looks
good now again we are not loing
individual files because both the files
are having only new record so let me run
it
quickly so I can see total eight and six
record got loaded so here I have eight
record and here I have six record looks
good
now I will run my first merch on clean
menu so expected number of rows inserted
14 and now running the next Dimension
merge
statement now let me refresh
this so I can see all the data set
available here looks good sometime this
does not get refreshed and if you have
to get this refreshed I have to reload
the page so let me do
that so I can see total 26 entries which
matches with this looks good Let me
refresh this to so I can see my menu
Dimension is available here looks
good next we are going to work on
Delivery Agent which is another Master
data set and these are the CSC file
which we have already loaded into our
stage location if you look into the
first file we have total five Delivery
Agent in the second file we have another
five delivery agents and in the third
file it is only updating the first and
10th record just to see if our SD is
working properly for this Delivery Agent
entity so first we are going to create
Delivery Agent table in the stage
location let me quickly create
this and we are going to create the
delivery agentcor
STM so now I'm going to load the data
from this initial Delivery Agent
location
let's quickly validate
this
so this looks
good now I am creating the Delivery
Agent table under the clean schema with
primary key Auto increment as surrogate
key and rest of the information looks
almost same
now like we have done earlier going to
create a stream object in my clean layer
for to populate Dimension
table this got created running a merge
statement very similar pattern so five
rows got
inserted so I'm creating Delivery Agent
dim table and here I have hash key and
other primary keys and rest of the
columns from the domain
entity so this is created and now my
merge statement which is insert or
update data into the Delivery Agent DM
table based on the Delivery Agent stream
object so this is a similar process and
uh now I got my delivery aent dim
Dimension table created let's quickly
review all looks good since I have
insert an update I am going to run one
by one so I will first take
this and this is my Delta I'm going to
take
this and make sure that I run the copy
command so copy command got executed
five
record this
is my clean layer merg statement let me
run it
this got
updated now my next merge statement on
the dim
table this also got updated let's review
this so I have total 10 record
great and now let's rerun and get the
second file
name let We Run The Copy state
so two rows got added looks good now
they are both update statement so let's
run this clean layer
first it is only update no insert looks
good now if I go to my Dimension
table it will have two update and two
delete looks
good so for the 10 I have two entri and
the for one I have two entries so two of
them will be having false and two of
them will be having true looks good
these are false and these are true so I
managed to get Delivery Agent Dimension
table populated now let me refresh
this so this is my Delivery Agent
Dimension table so we managed to get six
Dimension table in my consumption layer
now now we are going to process the
transaction table but for the
transaction table we are not going to
populate the dimension table they will
represent as a fact
table so these are the dimension table
or Master data set which we have already
populated in our consumtion schema and
it is already having slowly changing
Dimension type to implemented now this
next three set of table are primarily
transaction table and we are not going
to populate them as a dimension table in
the consum option layer rather than they
will be created as a fact table and when
we talk about a fact table we always ask
ourself what would be the granul here
the granularity of the fact table would
be order item so delivery entity order
entity and order item entity will be
populated until in layer and from there
we will create only one single fact and
we also have to create a date Dimension
before that okay so let's start with
delivery entity
so let me change the context
quickly so if you look into this data
set this is my delivery initial load I
have total eight delivery IDs and this
is my order ID and this is the Delivery
Agent ID so it's starting from 1 eight
they are one to one mapping here when
I'm going to two I'm having 9 10 11 12
and these are all 9 10 11 12 order ID
when I'm going to second Delta file I
have 13 14 15 16 exactly 13 14 15 16
order IDs let me first create the
delivery
table it got created now as usual I'm
creating this stream
object this is all done now I'm running
my copy
command so eight records Got Loaded
looks
good now I'm creating a delivery on the
clean
layer so it is created likewise I'm
creating a stream
object this got created now I'm running
this merge statement as we have done
earlier so this is looking good good and
Let me refresh here so this is my stage
delivery and I will go to clean delivery
first
so I have total eight record looks good
now I would not go beyond that because
this is my transaction
table moving to order entity and let's
quickly see how the order entity CSC
file looks like so I have total eight
orders then I have total 12 9 10 11 12
orders in the first Dela and I have
another four orders in the second Dela
which is exactly matching with our
delivery CSV file so and I'm going to
follow the same
path
so let me change the context quickly
first and this is my order table here I
have given the column level
command and I have kept this is a source
system primary key FK nfk so looks
good so my orders table created now I'm
creating this stream object
on orders table this got created now I
am simply running this copy
statement and total eight rows Got
Loaded looks good now I am creating a
orders table in the clean
schema this also got created and I'm
running this stream object on my order
object on the Klean schema
looks good now this is my merge
statement which will merge data on clean
schema order stable from this STM so I
have to make it
STM now let's run
this okay so eight rows Got Loaded looks
good now let me run the Delta 1 by
one I think there is a spelling mistake
this is orders
okay let me quickly check
this yes it is orders so let me run
this and since I'm not making any
changes to the orders I will run both
the orders order file first and then
follow the merg
statement so Got Loaded
I have single merge statement because
I'm not going to the dimension
layer so total eight rows got added now
let me refresh this in the clean so this
is my orders and this is how the data
preview order ID customer ID restaurant
ID FK and if I go to the
column I can see these descriptions and
snowflake also allows you to change the
description if you have necessary
privileges on the table okay and this is
my surrogate key uh looks good now
moving to order item entity this is the
granularity which we are going to follow
to create our a item fact
table let's see how the CSV file looks
like so this is my initial load having
total 14 entries uh for you can see here
order one has only one item order two
has two item order three has one order
four has one order five has two item
likewise we have that if I go to Second
Delta file this also has around 14
records and if I go to this I have total
another 12 record looks good again I am
not making any changes here these are
all direct insert
statements so let me create the order
item table inside my stage
schema so first let me change the
context so my order item table is
created I am creating the order item
stream object on my order item
table and now I'm running this copy
command so there is some problem I think
so this is order
items now I am creating a clean layer
order
item and now I'm creating this stream
object
so all good
now this is my merge statement which
will read data from stream object and
insert into order item inside the clean
layer looks
good so 14 rows Got Loaded now here I do
not have anything uh update so I'm just
directly running this copy statement
I can run this merge
statement so 25 rows got at it and let
me refresh this so I got my order item
also populated here and if I say order
item here this is also populated I can
see data preview and I have total 39
rows available looks good now every data
warehouse project needs a date Dimension
and that's what we are going to do after
creating this transactional table up to
clean layer so to create a date
Dimension we will start with a minimum
order date and we will get the order
date from the order table and we'll take
the minimum of that and from there we
are going to use a Common Table
expression approach to create a
dimension table if you're not very sure
how to use a Common Table expression I
have a detailed video which shows how
you can use a recursive Common Table
expression to populate a date Dimension
now let me shrink
this so if you look into this date
Dimension table I have hash key calendar
date followed by different uh date
columns and let me create this
particular date
Dimension it is created and if you look
into this insert statement which is
primarily taking the minimum value from
the order tables and once it gets this
one it recursively follows the approach
to get the rate Dimension let me run
this insert state statement with the
Common Table
expression now I got 376 rows inserted
looks
good and if I come to this
layer so this is my date Dimension I
have all the data available 376 rows we
have quarter month week days of the year
and everything is available good now we
got all the necessary transaction table
up to the clean layer and the date Di
mention in the next part of the video we
are going to create order item fact and
before that let's refresh our ER diagram
to see whether all these tables are
available there or
not so now I can see date Dimension also
available okay so I'm going to create a
fact table quickly along with the
relationship and obviously it will
convert into a star
schema so this is my 12th order item
fact
worksheet so first I'm going to create a
order item fact table which has a
primary key which is auto increment I
can also make it SK surrogate key feel
free to change as per your requirement
but this is surrogate key so I will make
it escape and now I have order item
which is primary little key key coming
from the source system and then it is a
order ID and the customer dim key
customer address dim Key Restaurant dim
Key Restaurant location dim key menu
Dimension key Delivery Agent Dimension
key order date Dimension key followed by
some of the measure which I have added
here so let me create this fact
table now this fact table is created now
this is a merge statement which is
joining all the dimension table based on
the FK available in each of the table
and finally if the record gets changed
then it gets updated and if the record
is a new record then it is a insert
statement so let me quickly run
this so I can see total 63 rows got
added and let's see how this fact table
looks like so if I refresh this this is
my consumption schema the fact table
appears here and I can this is the
complete fact table structure and if I
say data preview I can see a lot of data
available and this is how there are some
null values because there might be some
issues with my data but finally it is
showing the quantity and subtotal and
everything right so the small data set I
have it might have some Gap and that's
why we got some null value values so if
you have to avoid the null values you
cannot have a left join you can just
keep inner join and then you will not
have a null values in your
system once our fact and dimensional
tables are populated I'm going to
quickly run this alter statement which
will create this relationship so let me
run
this so all of them got executed
successfully if I refresh to this ear
diagram let's see what happen
friend now you can see my ER diagram got
refreshed and I have this star schema
where at the center of this star schema
I have this order item fact table and if
I click on any of this relationship I
can see this relationship is bued based
on the hash
key now since my fact table is created I
can quickly create View
let me create this view which is
primarily annual revenue and let's do
that and if I run select a star from
this annual revenue let's see how does
it look
like so if you see on 2024 my total
revenue was
15,44 and on 2023 it is
9,880 5 orders 10 orders this is looking
good now I'm going to create create
monthly Revenue
View and this is also this also got
created so if I see what is my monthly
number looks
like this is how it looks like on the
year 2024 this is all month and these
are the sample data looking good now if
I go and create a daily revenue
then this is my daily revenue and if I
run for 2024 month 7
this is how it looks
like and I am using this views into my
streamlet dashboard and let's see how
does it look
like so here this is my select year
based on the dummy data we have
populated this is my total revenue this
is my average revenue per order this is
my maximum order value total number of
order and average revenue per item
and if you see the total revenue for all
years number of orders and maximum
order and this is all the monthly Trend
though they are not very well organized
but at least you can see the clear data
and if I select year 2023 the number
changes since I only have data for two
years for this sample data set it is not
showing the comparative
analysis and this is December and
November Okay so so this is how you can
really create end to end data flow and
make sure that you do everything in a
single platform and in the next step I
will rerun the entire script by
populating large amount of data and
we'll come back to this dashboard and
see how does it look like let's quickly
discuss how this lineage looks like once
we create all the object under different
schema so this is my consumption schema
and this is my view and this is my daily
revenue kpi and if you look into this
this is my daily revenue kpi and this
daily revenue kpi is based on my fact
table and this is also depends on my
date Dimension and this fact table
depends on multiple Dimension object and
that's what you can see here these all
dimensions are Upstream object for my
fact table now if you look into this
order item delivery item they are under
the clean schema and this fact table is
also populated based on this two object
and from here I can see
my entire lineage up to this stage
location and this lineage diagram helps
me to understand where all my fact table
is being populated and how I'm creating
my view so this is a very very useful
feature which snowflake has added as a
preview as on today but maybe in future
it will add more and more column lineage
and many other enhancement to this
preview version we already loaded a very
small set of data and that data flow
automatically build this lineage diagram
now I have already loaded the large
amount of data under a different
database called production and I have
followed exactly the same process what I
have already shown you now let me show
you what is the data size which I have
loaded and
processed so this is my stage location
and if I look into my
orders I have quite a lot of data set
available here and if I go to order
item here you see I have total 41 MB of
data size and I ran exactly the same
process what we have done for a small
data set under this production database
and if I go to my consumption
layer and if I click on my order item
fact table
it has got close to 6.2 million record
and if I click onto this
tables you can see the number of rows as
well as the size of the data so we have
total 6.2 million records and that data
set is available from
2019 to 2024 total five year of data set
available and now let's see how our
dashboard looks like so this is my
project and this is my streamlet
dashboard and I am clicking on this
swiggy Revenue
dashboard right now it is getting
loaded and here you see I have total 1 2
3 4 5 six years data set available if I
click on 2019 the data gets changed so
this is my total revenue this is my
average revenue per order maximum order
value total orders and average revenue
per
item this is my monthly Revenue trend
for year 2019 and if I H into this bar
it is showing the exact monthly
revenue and this is the monthly Revenue
Trend and if you look into the top 10
restaurant for December 2019 this is how
the data looks like I can
check so data gets changed for a
different month likewise if I go and
choose like 2021 let's see what happens
now here it shows how much revenue has
grown compared to the last year so you
can really use this streamlet widget and
this is the total revenue and again this
is all dummy data which I have created
to demonstrate this example now here
this is 2021 monthly revenue and this is
the train right
I say February let's see what happens so
this got changed and here I can also
maximize it and minimize it whatever I
want to
do now let's see and choose 23 so the
venue has jumped a lot and this is how
this overall dashboard looks like and
you can create a different kind of
dashboard once you have this fact and
dimension table you can always take one
of the dimension and around the
dimension you can ask different kind of
questions and help your business users
to get the answer through this data
platform once everything is done now it
is time to think how to automate the
entire flow so it runs without any human
intervention and as soon as the CSV
files are dropped into the stage
location this automated job runs and
start processing the CC file one by one
there are multiple approaches and we
will not practice them here in this
video else this tutorial will be too
long we will only discuss the commonly
practiced approach that many data
project teams follows let's start with
first approach in this approach each
data set like customer menu delivery and
so on will be wrapped inside the stor
procedures using snowflake SQL script so
if we have around 9 to 10 entities if I
take this example each of this store
procedure will run the copy command then
run the merge statement to push the data
from stage to consumption layer now this
individual stored procedure will be
called by another parent stored
procedure and this parent stor procedure
be invoked using a task and that task
runs once a day or multiple times a day
based on the data arrival frequency
alternatively the same approach can also
be implemented in a slightly different
manner where all the copy commands can
be wrapped in a single stor procedures
all the stage to clean merge SQL
statements can be wrapped in a single
stor procedure and all the clean to
consumption merge statement can be
wrapped in another single store
procedures and and then all of these
will be called one after another from
the parent stored procedures and this
parent stored procedure will be again
called using a task based on certain
frequency there are another approaches
like creating a dag using task tree
where each of the task is a stream aware
so it will trigger only if there is a
data available in the Stream object
alternatively you can also solve this
problem using Dynamic tables this video
become very very long if I start
explaining each of these approaches and
if you are really really interested to
go beyond this video I suggest you to
visit my udmi page where all the
scenarios are discussed and explained in
detail with a small as well as large
data set so you can download the content
and also practice to enhance your
snowflake data engineering
skill thank you so much for watching I
hope you found this video informative if
you learned something new please give it
a thumbs up and share your thoughts in
the comment section below I would love
to hear from you don't forget to
subscribe for more content like this
happy learning and keep growing
UNLOCK MORE
Sign up free to access premium features
INTERACTIVE VIEWER
Watch the video with synced subtitles, adjustable overlay, and full playback control.
AI SUMMARY
Get an instant AI-generated summary of the video content, key points, and takeaways.
TRANSLATE
Translate the transcript to 100+ languages with one click. Download in any format.
MIND MAP
Visualize the transcript as an interactive mind map. Understand structure at a glance.
CHAT WITH TRANSCRIPT
Ask questions about the video content. Get answers powered by AI directly from the transcript.
GET MORE FROM YOUR TRANSCRIPTS
Sign up for free and unlock interactive viewer, AI summaries, translations, mind maps, and more. No credit card required.