TRANSCRIPTEnglish

Snowflake Vs. AWS RedShift Vs. GCP BigQuery Vs. Azure Synapse for Data Warehousing!

15m 9s2,740 words418 segmentsEnglish

FULL TRANSCRIPT

0:00

hey y'all data guy here so I've seen

0:03

some comments asking for some more

0:05

comparison videos for people that might

0:07

be just starting out in kind of the data

0:09

verse and I thought a really pertinent

0:11

video to make would be comparing

0:14

contrasting kind of each of the major

0:16

Cloud providers data warehouse options

0:18

that they're pushing versus snowflake

0:20

which is kind of the cloud agnostic but

0:23

the only one that's really une equal

0:24

popularity um if not greater popularity

0:27

than the you know Cloud specific

0:28

providers cuz have a cloud provider it's

0:31

really easy to just pull that off the

0:32

rack so you kind of have a captive

0:34

audience of you know Amazon is all AWS

0:37

customers and so on and so forth um

0:39

versus snowflake you know you have to go

0:40

out and buy it um but snowflake is so

0:43

great that a lot of people do go out and

0:44

buy it so what I'm going to go through

0:46

is just kind of talk about high level

0:48

the architecture performance scalability

0:52

and then the ecosystem

0:53

integration of each of these different

0:55

tools each these different types of data

0:57

warehouses um to just give you a frame

1:00

work and kind of understanding of hey

1:01

which one is best for me um while they

1:04

are all you know data warehouses they do

1:06

all have kind of their own Specialties

1:08

and their own unique quirks and

1:10

strengths um that make them better

1:12

suited to a particular use case so

1:14

that's what we're going to explore today

1:16

and without further Ado let's get into

1:18

it so now the first database we're going

1:20

to talk about is the star of the show in

1:22

my opinion which is snowflake um and

1:25

snowflake is a fully managed Cloud

1:27

native data warehouse that's designed

1:29

with a unique unique architecture that

1:30

actually separates compute and storage

1:32

you can see that Illustrated here where

1:34

you have your compute available uh under

1:36

snow park container services directly

1:38

next to your snowflake data warehouse

1:40

and this separation allows users to

1:43

scale each of those components

1:44

individually which provides greater

1:46

flexibility for workloads that have

1:48

varying performance and storage needs

1:50

now snowflakes architecture relies on

1:52

Virtual warehouses for processing and

1:55

then a Central Storage layer that is

1:57

accessible to all compute nodes you

1:58

essentially have one large storage layer

2:00

which is the snowflake optimized compute

2:02

and then you have the snowflake native

2:04

apps which are the uh smaller data

2:06

warehouses the virtual data warehouses

2:08

that while they can access all of that

2:10

data they are typically only containing

2:13

a certain small subset of it um and this

2:15

multi cluster shared data design ensures

2:18

High concurrency without compromising

2:20

performance because each of those data

2:21

warehouses can run their own queries

2:23

independently so this makes it a really

2:25

great choice for super large scale

2:27

analytics another reason why snowflake

2:30

excels and scalability is because again

2:32

of that decoupled storage compute

2:34

architecture so a user can spin out

2:36

multiple virtual warehouses to handle

2:38

concurrent workloads without affecting

2:40

one another and this is really

2:42

beneficial for organizations that are

2:43

running complex queries and you need

2:45

consistent performance during peak times

2:47

then you might have a really large

2:49

amount of users executing those queries

2:51

concurrently snowflakes automatic

2:54

scaling ensures that optimal resource

2:56

utilization is attained while they're

2:58

minimizing the downtime there um so you

3:00

really make sure that hey you know I

3:02

don't need to single thread and

3:04

bottleneck everything through one data

3:06

warehouse or one query engine I can

3:07

actually have many different query

3:08

engines handling simpler smaller jobs at

3:11

the same time for a lower overall

3:13

processing time of those individual jobs

3:16

now in terms of pricing and how you're

3:18

going to be paying for all this

3:19

snowflake uses a consumption based

3:21

pricing model where you'll pay

3:23

separately for storage and compute and

3:26

so compute costs are based on the amount

3:28

of time that your virtual warehous are

3:30

actually running and those virtual

3:31

warehouses are how you're running those

3:32

queries uh versus the storage Co cost

3:35

for just you know that base layer

3:36

storage um just just going to depend on

3:38

the amount of data stored for x amount

3:40

of time um and this model is flexible

3:43

and semi-predictable um a big issue with

3:46

a lot of uh organizations is because

3:48

snowflake offers discounts for Reserve

3:50

capacity people will either overestimate

3:53

their capacity um and just have a bunch

3:55

of excess snowflake credits or vice

3:57

versa they won't buy enough and they'll

4:00

end up paying a ton of snowflake credits

4:02

when just kind of run if they don't

4:03

Implement really good cost control

4:04

measures and so just something to always

4:06

keep in mind now snowflake is also a

4:10

cloud agnostic platform so if you're on

4:12

AWS Google Cloud Azure doesn't matter it

4:15

integrates with all three of the major

4:17

Cloud ecosystems um and also integrates

4:19

really well with a bunch of different

4:21

tools for data ingestion transformation

4:23

analytics um airflow spark uh Tableau

4:26

and powerbi uh you know really any kind

4:29

of of of Analytics tool um and it's also

4:32

got a really strong data Marketplace

4:34

ecosystem and that's actually what

4:35

you're seeing here with the snowflake

4:36

Marketplace so there's actually a ton of

4:39

different uh organizations that have

4:40

their own kind of defined connectors or

4:42

data sharing tools that you can actually

4:44

leverage through the snowflake

4:45

Marketplace which is nice if you don't

4:46

want to have to build everything

4:47

yourself um and then also just

4:49

underpinning all of this is you know

4:51

Enterprise security grade features um so

4:54

end to end encryption rule-based Access

4:56

Control hipa gdpr sock compliance

5:00

um and also because it's multicloud you

5:02

can meet those Regional data residency

5:04

requirements um so that is snowflake in

5:07

a nutshell now let's move on to AWS

5:10

redshift now here to help us with this

5:12

redshift expertise is the data dog say

5:15

hi everyone um and so now moving into

5:18

redshift so AWS redshift kind of follows

5:21

a more traditional data warehouse

5:23

architecture where you have really

5:25

tightly coupled compute and storage but

5:28

you also have new features like red

5:30

shift spectrum that allow ex quering

5:33

external data and kind of a a um I would

5:35

say data Lake style format right here

5:37

you can see um for you know querying

5:41

less uh schema dependent data so more

5:43

object storage type data um and red

5:46

shift also as you can see here uses a

5:48

cluster based approach where users

5:50

Define node types and node sizes to

5:52

manage storage and performance um and

5:55

while this makes red shift predictable

5:57

in terms of resource allocation it also

5:59

leads to challenges in scaling because

6:01

compute and storage need to be scaled

6:03

together unlike in Snowflake so you

6:05

can't just scale up additional compute

6:07

to handle more complex queries unless

6:09

you also scale up

6:11

storage now you do have some recent

6:13

improvements such as ra3 nodes that

6:15

enable separate scaling of compute and

6:17

storage which is helped to narrow the

6:19

Gap with snowflake but generally not as

6:22

easy of a process and user experience as

6:25

within snowflake um and then red shift

6:28

also supports scalability through

6:29

cluster resizing as I said but requires

6:32

manual inter intervention um so instead

6:35

of snowflake which will pretty much just

6:36

Auto scale for you red shift Spectrum um

6:40

or the compute resource within red shift

6:42

needs to be explicitly defined um and so

6:45

things like hey if I need to use red

6:47

shift Spectrum to query external S3 data

6:50

without moving into the cluster I will

6:52

still need to scale compute resources in

6:55

the red shift cluster and that's going

6:57

to involve downtime um just to be able

6:59

to access those R those S3 that S3 data

7:02

and again you have the introduction of

7:04

ra3 nodes that has allowed you to

7:06

decouple the two from for some extent

7:09

but still not a perfect solution and

7:12

hopefully they kind of go more towards

7:13

the path of actually decoupling storage

7:15

compute um now so red shift also just

7:18

kind of how they work pricing wise they

7:20

offer both on demand and reserved

7:22

instance uh type instance types so on

7:25

demand pricing charges for the active

7:27

cluster which includes Computing storage

7:30

and then reserved instances actually can

7:33

you know basically say hey I'm going to

7:35

reserve an instance use at this time and

7:37

that can help you provide significant

7:38

cost savings um if you have really

7:40

predictable workloads um and then also

7:43

if you want to use that red shift

7:45

Spectrum tool I mentioned it does incur

7:48

extra cost for quering that external S3

7:50

data um

7:52

so not you know it's not great you to

7:55

pay extra just to access S3 but it does

7:56

allow you to avoid upfront storage cont

7:59

EXP es for Less frequently accessed data

8:01

so you can kind of use S3 as your

8:02

archival storage and only query it as

8:05

needed um and as an AWS product as part

8:08

of the AWS ecosystem red shift

8:10

integrates really well natively with you

8:12

know all the other Amazon services so S3

8:15

glue Athena um and if you're using a

8:17

full AWS stack it's a really you know

8:21

easy tool to integrate into ETL

8:22

workflows pretty attractive option for

8:25

organizations already using AWS but it's

8:28

not amazingly compatible with non-ads

8:31

tools so definitely keep that in mind if

8:33

you have a lot of non-ads tools in your

8:34

stack um and then similarly to snowflake

8:38

but honestly even more so because ads

8:40

has a bunch of government contracts um

8:43

provides a ton of security features

8:44

encryption at rest Transit VPC isolation

8:47

IM based Access Control uh got Hippa

8:50

gdpr PCI sock 2 so if you have an

8:55

organization that has really stringent

8:56

regulatory requirements you're using AWS

8:58

probably a good choice for you now next

9:01

on the docket we have Google big query

9:04

um and so Google big query again is

9:07

different from you know red shift and

9:09

and Snowflake and that it is a

9:10

serverless fully managed data warehouse

9:13

that is obviously really deeply

9:14

integrated with Google Cloud um and uses

9:16

a distributed architecture um which

9:18

abstracts compute and storage entirely

9:20

from the user which allows you to just

9:23

focus on running queries without

9:24

worrying about infrastructure management

9:27

um and so it's basically just pay per

9:29

query pricing so you only pay for the

9:30

queries you use um and big query is

9:33

really optimized for analytics workloads

9:35

hence the pay for query model because

9:37

that's what analytics workloads are

9:39

centered around and it also uses colum

9:41

or storage which makes it really

9:42

efficient for those kind of large scale

9:44

queries that are needed for analytics

9:47

workloads um and then also you know as

9:50

kind of intend with paper query it's

9:52

really seamlessly autoscaling um so it

9:54

makes it really ideal for dynamic or

9:55

unpredictable workloads because it'll

9:57

just automatically adjust to meet the

9:58

needs that workloads um and really it

10:01

the biggest benefit of big query serous

10:04

nature is it completely eliminates the

10:06

need for resource provisioning or

10:08

cluster management and it'll just

10:10

automatically scale compute resources

10:12

based on query demands um which helps

10:14

Ensure High Performance for even the

10:16

largest workloads and this Dynamic

10:18

scalability is also really ideal for

10:20

organizations that can experience

10:21

unpredictable or spiky query

10:24

patterns Google's uh and you know big

10:27

queries distribute infrastructure

10:28

because they have so much infrastructure

10:30

can help ensure minimal query latency

10:32

even for really really large data sets

10:35

um and bit query's pricing is also

10:37

relatively straightforward it's just

10:39

purely based on storage and query

10:40

execution um and storage costs are

10:42

charged per gigabyte of data stored

10:44

queries are based on the amount of data

10:46

processed and so very much just a pay as

10:48

you go model so pretty ideal for

10:51

organizations that have unpredictable

10:53

workloads or ones that don't really know

10:54

hey how much this workload might

10:56

actually needs and you want to avoid

10:58

over or under provision

11:00

um and then Google also offers flat rate

11:03

pricing for organizations that do have

11:04

more consistent query needs so you can

11:06

get some discounts there so it really is

11:08

still versatile for a wide range of use

11:11

cases um and similarly to AWS but Google

11:15

big query is really tightly integrated

11:17

with Google cloud services like data

11:18

flow data proc looker supports Federated

11:21

queries which actually allows you

11:23

analyze data stored in cloud storage

11:25

Google Sheets or even external databases

11:27

which is really cool um and then also

11:29

bigquery has a really tight integration

11:31

with integr with machine learning tools

11:32

like tensor flow so it's good choice for

11:34

AI and ml workloads as well um and then

11:37

additionally similarly AWS big query uh

11:40

emphasizes security pretty heavily so

11:43

you got things like customer manage

11:44

encryption Keys fine gra IM permissions

11:46

and support for gdpr and CCPA compliance

11:50

and then obviously Google's Global

11:51

infrastructure pretty much ensures High

11:53

availability and also data residency

11:55

compliance no matter what region you

11:57

deploy your data into so so now the

11:59

final stop on this tour is azure synapse

12:03

analytics kind of the newer kit on the

12:05

Block and and honestly kind of a

12:07

honestly almost not even a total data

12:09

warehouse um because it's more of a

12:11

combined big data and data warehouse

12:14

platform where there's a bunch of

12:15

different kind of subtools that make up

12:17

Azure synapse analytics that are now

12:18

bundled into synapse by Azure um so in a

12:22

provision mode so it has two different

12:24

modes provisioned which is like

12:25

dedicated and serverless provision mode

12:28

is a similar to you know red shift you

12:31

have a traditional cluster based

12:32

architecture you allocate resources up

12:34

front versus a serverless mode which is

12:36

more similar to big query where it's on

12:38

demand query execution pay by query pay

12:41

by gigabyte stored um and so you have

12:44

the optionality based on you know what

12:45

you need so if you're more dynamic or

12:47

more predictable you can choose the type

12:49

of mode you want to run in um which is

12:52

you know really nice to have and so

12:54

Azure synapse has some good scalability

12:57

options through through that dual

12:59

architecture where the provision mode

13:01

gives you that upfront resource

13:02

allocation scaling might lead to some

13:04

downtime but you can get some you know

13:06

credits and you can get some discounts

13:07

for provisioning and buying everything

13:09

up front versus the serus mode will

13:11

automatically scale your compute

13:13

resources based on the complexity of

13:14

queries um and then for large scale

13:17

operations synaps can actually

13:19

parallelize query execution across

13:21

distributed compute nodes um which makes

13:23

it really well suited for big data

13:25

analytics um and similarly synapse has

13:28

two pricing models you have provisioned

13:30

you have the provision model which is

13:31

going to charge based on the allocated

13:33

resources regardless of utilization so

13:34

even it's sit Idol you pay for it versus

13:37

the serverless model is going to charge

13:39

per terabyte of data processed um and so

13:42

while this provides flexibility it also

13:45

can lead to much higher costs for

13:46

workloads that require really frequent

13:48

resource adjustments or if you don't

13:49

have a defined strategy um so with great

13:53

power comes great responsibility um so

13:55

keep that in mind here and then

13:57

similarly to AWS query synapse is really

14:00

deeply embedded in the Azure ecosystem

14:02

so if you're using tools like data

14:04

Factory powerbi Azure data Lake it's got

14:06

really what nice built-in support for

14:08

spark pools and sinas pipelines for just

14:11

a you know all-in-one unified experience

14:13

across State Engineering data science

14:15

analytics um but obviously it's going to

14:17

struggle going outside of the Azure

14:19

ecosystem so it's a pretty good tool if

14:21

you're already invested in Azure

14:23

Services otherwise probably not worth

14:26

getting into Azure just for um and then

14:28

similarly

14:30

you know Azure is a government uh

14:32

contractor now so synap supports all the

14:36

advanced security features um you know

14:38

so you have Hippa gdpr Isis

14:40

certifications um and you also have

14:42

Callum level security Dynamic data math

14:45

masking and you can integrate with Azure

14:47

active directory um for Federation so

14:51

those are the big four data warehouses

14:54

out there in the market just kind of a

14:55

quick and dirty guide to each of them I

14:58

hope you've enjoyed this video video I

14:59

hope it's helped you make a informed

15:01

decision on which one is best for you

15:02

and I hope you have a great rest of your

15:04

day day to guy out day to dog out too

UNLOCK MORE

Sign up free to access premium features

INTERACTIVE VIEWER

Watch the video with synced subtitles, adjustable overlay, and full playback control.

SIGN UP FREE TO UNLOCK

AI SUMMARY

Get an instant AI-generated summary of the video content, key points, and takeaways.

SIGN UP FREE TO UNLOCK

TRANSLATE

Translate the transcript to 100+ languages with one click. Download in any format.

SIGN UP FREE TO UNLOCK

MIND MAP

Visualize the transcript as an interactive mind map. Understand structure at a glance.

SIGN UP FREE TO UNLOCK

CHAT WITH TRANSCRIPT

Ask questions about the video content. Get answers powered by AI directly from the transcript.

SIGN UP FREE TO UNLOCK

GET MORE FROM YOUR TRANSCRIPTS

Sign up for free and unlock interactive viewer, AI summaries, translations, mind maps, and more. No credit card required.

GET STARTED FREE SIGN IN