Data & Analytics
info
These were the topics I created flashcards for (Remnote) and would revise them using spaced repetition. The formatting is an export from Remnote.
- Amazon Athena
- use Amazon Athena when you're asked to analyse something using "
\{\{serverless\}\}
SQL" - features
\{\{columnar\}\}
data for cost savings\{\{compresses\}\}
data for smaller retrieval- use
\{\{larger\}\}
files to minimise overhead
- Federated Query
- you can run
\{\{SQL\}\}
queries across data in\{\{relational\}\}
&\{\{non-relational\}\}
DBs, on-prem and cloud.
- you can run
- use Amazon Athena when you're asked to analyse something using "
- Amazon Redshift
- based on PostgreSQL
- used for Data Warehousing (DWH)
- is
\{\{10\}\}
x faster than other DWH.
- is
- why is Redshift faster than Athena in terms of queries, joins, aggregations?―Redshift uses Indexes.
- What kind of queries is Athena better for than Redshift?―Ad-hoc query's
- Redshift Cluster
- What is a Redshift Cluster made up of (components)? ↓
- Leader Node
- Compute Nodes
- What interface connector is used to talk to a Redshift Cluster?―JDBC or ODBC
- Is Redshift serverless?―No. You need to provision the nodes.
- What is a Redshift Cluster made up of (components)? ↓
- Disaster Recovery (DR)
- Does Redshift support Multi-AZ?―No.
- how would you do Redshift DR?―configure automatic copy of cluster snapshots to another AWS Region.
- Data Loading
- what are the three options for loading data into Redshift (hint: near, copy, vm)?― ↓
- Kinesis Data Firehose (KDF)
- S3 Copy
- via "enhanced routing" through
\{\{VPC\}\}
- without "enhanced routing" over the
\{\{internet\}\}
- via "enhanced routing" through
- EC2 Instance (JDBC driver)
- which is better? Large inserts or small inserts?―Large.
- what are the three options for loading data into Redshift (hint: near, copy, vm)?― ↓
- Redshift Spectrum
- this is useful for when you want to
\{\{query\}\}
data in S3 without\{\{loading\}\}
it into your Redshift Cluster
- this is useful for when you want to
- Amazon OpenSearch
- aka
\{\{ElasticSearch\}\}
- ... you can search any
\{\{field\}\}
, even\{\{partial\}\}
matches.. - is OpenSearch serverless?―No. it requires a cluster of instances.
- does OpenSearch support SQL?―No. has its own query language.
- what are the three main INGESTION sources for OpenSearch (hint: near-realtime, I C)?― ↓
- Kinesis Data Firehose (KDF)
- AWS IoT (via CRUD)
- CloudWatch logs
- Security Stack for OpenSearch i.e. what security services? (hint: think mobile, rest, flight)― ↓
- Cognito & IAM (AuthN, AuthZ)
- KMS encryption
- TLS
- visualisation option for OpenSearch?―OpenSearch Dashboards.
- OpenSearch Patterns
- DynamoDB
- CloudWatch
- Kinesis
- Kinesis Data Firehose
- Kinesis Data Streams
- aka
- Amazon EMR
- helps create
\{\{Hadoop\}\}
cluster (Big Data). - think of EMR scale in terms of
\{\{hundreds\}\}
of EC2 instances. - EMR comes bundled with (RDBMS)― ↓
- Apache Spark
- HBase
- Presto
- Apache Flink
- Node Types
- EMR structure consists of (M C T)― ↓
- Master Node i.e. manage cluster
- Core Node i.e. run task, store data (long running)
- Task Node (optional) i.e. run tasks (short usually spot)
- pricing think on-demand, RI's, spot.
- EMR structure consists of (M C T)― ↓
- helps create
- Amazon QuickSight
- Serverless
\{\{machine\}\}
learning-powered\{\{business\}\}
intelligence service to create\{\{interactive\}\}
dashboards. - uses what engine for in-memory compute?―SPICE
- in enterprise edition, what kind of security is available?―Column-Level Security (CLS)
- Dashboards
- can be shared with
\{\{Users\}\}
or\{\{Groups\}\}
(note: these are not\{\{ __**IAM**__ \}\}
users)
- can be shared with
- Serverless
- AWS Glue
- managed
\{\{extract\}\}
,\{\{transform\}\}
,\{\{load\}\}
service - is Glue fully serverless?―Yes.
- can Glue convert data to Parquet format?―Yes.
- Glue functions, what do these do?
- Job Bookmarks ?―prevent pre-processing old data
- Elastic Views?―like table views
- DataBrew?―clean & normalise data
- Streaming ETL?―continuous ETL streaming
- managed
- AWS Lake Formation
- A Data lake is a
\{\{central\}\}
data store for the purposes of\{\{analytics\}\}
. - TWO key features of AWS Lake formation (hint: quickly start with..., security)?― ↓
- (Data) Source Blueprints e.g. S3, RDS, Relational & NoSQL DBs
- Fine-grained Access Controls for your apps at row and column-level.
- Centralised Permissions
- A Data lake is a
- Kinesis Data Analytics for SQL Apps
- KDA for SQL ingests what Kinesis sources? ↓
- Kinesis Data Streams
- Kinesis Data Firehose
- KDA sends to what downstream sinks (hint: K K)?― ↓
- Kinesis Data Streams ⇒ Lambda ⇒ anywhere
- Kinesis Data Firehose ⇒ S3 or Redshift (COPY through S3)
- KDA for Apache Flink
- what streaming services can 'Kinesis Data Analytics For Apache Flink' ingest from? ↓
- Kinesis Data Streams
- Amazon MSK
- what streaming services can 'Kinesis Data Analytics For Apache Flink' ingest from? ↓
- KDA for SQL ingests what Kinesis sources? ↓
- Amazon Managed Streaming Kafka(MSK)
- TWO options for running Apache Kafka on AWS ↓
- Fully Managed (MSK)
- Data is stored for
\{\{as long as you want\}\}
.
- Data is stored for
- Serverless (MSK)
- Fully Managed (MSK)
- List MSK downstream consumers? (hint: think analytics + K G L A)― ↓
- KDA for Apache Flink
- AWS Glue with 'Streaming ETL Jobs'
- Lambda
- Apps running on ‒ EC2, ECS, EKS
- TWO options for running Apache Kafka on AWS ↓
- KDS vs MSK
- KDS 1MB message vs MSK 1MB+
- KDS Streams with Shards vs MSG Topics with Partitions
- Big Data Ingestion Pipeline
- scenario IoT incoming data...
- Requirements
- serverless
- data in real time
- transform data
- use SQL
- reports saved to S3
- load data to DWH create dashboards