Continuing with my 30 Days of DevOps experience at NTT Data Impact I came up with a couple of projects where my client was asking to move traditional SQL Server Databases to the cloud, and given that the majority of our services were living on few AWS accounts, I made a spike to show them what will be more feasible for them.

Choosing the right database is tough, as we have a lot of databases to choose from, specially when you move to the cloud. Amazon Web Services offers an amazing variety of database services, and the list and the features of each is so extent that we can get lost in information.

Before we start exploring what database technology we might use, we should ask ourselves the next questions:

How is our workload? Is it balanced in terms of reads and writes or is more Read-heavy or Write-heavy focused.
What’s the throughput we need? Will this throughput change during the day or the solution lifecycle?
Will we need to scale?
How much data are you storing and for how long?
Will the data grow? How big the data object will be? How the data will be accessed?
What’s the retention policy for your data? How often do you want to back the data up?
How many users will access the data? What is the response time we are expecting?
What’s the data model and how will it be queried? Will be structured? Does it has a schema?
What’s the main purpose of the DB? Is it going to be used for searches? Reporting? Analytics?

And of course, on of the most important ones, is there any license cost associated?

Once we have these questions answered, we can start exploring what database type we need.

As I said before, AWS has a big variety of Database Types:

Relational Database Service (based on relational models) makes itself available through RDS. Within RDS you can find:
- Amazon Aurora
- Managed MySQL
- MariaDB
- PostgreSQL
- Oracle
- Microsoft SQL Server
NoSQL Database (Key-value):
- DynamoDB and DynamoDB Accelerator
- ElastiCache: Redis / Memcached
- Neptune
Document:
- Amazong DocumentDB
Object Storage:
- S3 (for big objects)
- Glacier (for backups / archives)
Data Warehouse:
- Redshift (OLAP)
- Athena
Search:
- ElasticSearch (fast unstructured data searches)
Graphs:
- Neptune (represents data relationships)

RDS

At the contrary of what many people think, moving to RDS doesn’t mean the whole DB platform is fully managed by AWS, we have some flexibility on how we setup our services, for example with the deployment we do provision the EC2 instance sizes and the EBS volume type and size. rds

The advantages of using RDS over deploying the DB in our own EC2 are:

OS patching level by AWS.
Continuous backups and restore to specific timestamp (Point in Time Restore).
Monitoring dashboards
Read replicas for improved read performance
Multi AZ setup for DR
Maintenance windows for upgrades
Scaling capability (vertical and horizontal)

Let’s take as an example SQL Server.

We can select, the license model, the db engine version, the ec2 instance size, if we want Multi-AZ deployment with mirroring (Always On), the storage type, size and IOPS, network options (VPC, subnet, public or private access, availability zones, security groups) and it integrates with Windows Authentication through Active Directory.

As in Microsoft Azure, it comes with encryption, backup, monitoring, performance insights and automatic upgrades. Who said we have to go to Azure to use SQL Server?

Cost: You pay for your underlying EC2 instances and EBS volumes.

Amazon Aurora

Amazon Aurora is a MySQL- and PostgreSQL-compatible enterprise-class database. As main characteristics:

Up to 5 times the throughput of MySQL and 3 times the throughput of PostgreSQL
From 10 GB and up to 64TiB of auto-scaling SSD storage
Data is hosted in 6 replicas across 3 Availability Zones
Up to 15 Read Replicas with sub-10ms replica lag
Auto healing capability: automatic monitoring and failover in less than 30 seconds
Multi AZ, Auto Scaling Read Replicas. Also Replicas can be Global
Aurora database can be Global (good for DR)
Aurora Serverless option: here

Use case: as every other RDS service but with more performance and less maintenance.
Operations: less operations than other RDS services
Security: we take care of KMS, SGs, IAM policies, SSL enablement and authorise users.
Reliability: possibly the most reliable of all RDS, with also Serverless as an option.
Performance: 5x performance than other RDS services (also more expensive than others, except for the Enterprise grade editions like Oracle one)

MySQL

MySQL is the most popular open source database in the world. MySQL on RDS offers the rich features of the MySQL community edition with the flexibility to easily scale compute resources or storage capacity for your database.

Supports database size up to 64 TiB.
Supports General Purpose, Memory Optimized, and Burstable Performance instance classes.
Supports automated backup and point-in-time recovery.
Supports up to 5 Read Replicas per instance, within a single Region or cross-region.

MariaDB

MariaDB Community Edition is a MySQL-compatible database with strong support from the open source community, and extra features and performance optimizations.

Supports database size up to 64 TiB.
Supports General Purpose, Memory Optimized, and Burstable Performance instance classes.
Supports automated backup and point-in-time recovery.
Supports up to 5 Read Replicas per instance, within a single Region or cross-region.
Supports global transaction ID (GTID) and thread pooling.
Developed and supported by the MariaDB open source community.

PostgreSQL

PostgreSQL is a powerful, open-source object-relational database system with a strong reputation of reliability, stability, and correctness.

High reliability and stability in a variety of workloads.
Advanced features to perform in high-volume environments.
Vibrant open-source community that releases new features multiple times per year.
Supports multiple extensions that add even more functionality to the database.
Supports up to 5 Read Replicas per instance, within a single Region or cross-region.
The most Oracle-compatible open-source database.

Oracle

You do not need to purchase Oracle licenses as this has been licensed by AWS. “License Included” pricing starts at $0.04 per hour, inclusive of software, underlying hardware resources, and Amazon RDS management capabilities.

Available on different editions:

Oracle Enterprise Edition

Oracle Standard Edition: up to 32 vCPUs

Oracle Standard Edition One: up to 16 vCPUs

Oracle Standard Edition Two: up to 16 vCPUs (replacement for Standard editions)

Microsoft SQL Server

As in Oracle Databases, it supports the “License included” licensing model.

Available on the next editions:

SQL Server Express: up to 10 GiB. No licenses

SQL Server Web Edition: Only used for supporting public websites or web applications.

SQL Server Standard Edition: Supporting up to 16 GiB for data processing.

SQL Enterprise: Supports up to 128 GiB for data processing and data encryption.

NoSQL Database

ElastiCache

Amazon ElastiCache offers fully managed Redis and Memcached.

It is an In-memory data store, with an extremely good latency (sub-millisecond). As in RDS, we have to provide an EC2 instance and the cost comes with the EC2 usage per hour and the storage usage. Among its main characteristics:

It supports clustering (Redis) and Multi AZ, Read Replicas (sharding)
Security is provided through IAM policies. Despite can’t authenticate straight away, we can use Redis Auth for this purpose.
Backup, Snapshot and Point in time restore feature
Managed and Scheduled maintenance
Monitoring is provided through CloudWatch

Use Case: Key/Value Store. Low volume of writes, high volume of reads. Storing sessions data for websites.
Operations: same as RDS
Security: We don’t get IAM authentication, users are provided through Redis Auth
Reliability: Clustering, Multi AZ
Performance: In memory database, under millisecond performance

DynamoDB

Pure AWS NoSQL Database technology working as Serverless, provisioned capacity, auto scaling and on demand capacity.
It can serve as a replacement for ElastiCache as a key/value store, and despite we don’t get as much speed for performance as ElastiCache we don’t need to provide an EC2 instance as is Serverless and still we get between 1 and 9 ms performance on reads. dynamodb

It’s highly available, supports multi AZ
Reads and Writes are decoupled so we can balance them according to our needs.
We pay per Reads/Writes units.
Comes with DAX: a fully-managed, highly-available, in-memory caching service for DynamoDB.
Integrates with SNS and DynamoDB Streams (which integrates with Lambda) that enables table changes monitoring.
Backup/Restore and Global Table (only with DynamoDb Streams) features
Has transactions capability

Use Case: Mostly on pure serverless app development and for small documents (<100Kbs).
Operations: Fully managed.
Security: Authentication and Authorisation done through IAM, KMS encryption and SSL in transit.
Reliability: Multi AZ, Backups.
Performance: No performance degradation on scaling. DAX available for reading cache.
Cons: Only query on primary key, sort keys or indexes.
Pros: Pay per provisioned capacity and storage usage

Document

DocumentDB

Designed to store semi-structured data as documents where data is typically represented as a readable document.

It’s MongoDB compatible

Operations: Fully managed
Reliability: Replicates 6 copies of our data across 3 AZs. Health-checks and failover read replica in less than 30 seconds.
Performance: Millions of requests per second adding up to 15 low latency read replicas. Auto-scales up to 64 TB.
Security: VPC integration, KMS, auditing, SSL in transit.

Use case: Content management, personalisations, and mobile applications.

Object Store

S3 (Amazon Simple Storage Service)

Basically is storage for the Internet. It’s the equivalent to Blob Storage Accounts in Microsoft Azure. It does not replace RDS or NoSQL services.

It comes as S3 Standard, S3 IA, S3 One Zone IA, S3 Intelligent Tiering and Glacier (for backups).

Amazon S3 Standard is designed for high-usage data storage and can be used for website hosting, content distribution, cloud applications, mobile apps and big data.

Amazon S3 IA is designed for the data which require less frequent access. Minimum storage period is 30 days, and the minimum size of the object is 128 KB.

Amazon S3 Standard One-Zone Infrequent Access is 20% less expensive than the Standard IA and has a lower availability as only gets stored in one AZ.

Amazon Glacier is the best solution for long period storage and archiving. It has glacier extremely low cost but the minimum period of storage is 90 days.

Star features:

Versioning
Encryption
Cross region replication
Server access logging
Static website hosting
Object-level logging
Lifecycle rules!
Object lock
Transfer acceleration
Requester pays (requester pays for the data transfer and requests instead us)

Pros:

Great for big objects
Can be used as a key/value store for objects
It’s serverless and scales infinitely, allowing to host objects up to 5 TB each.
Integrates with IAM, has bucket policies and ACL
Supports encryption: SSE-S3, SSE-KMS, SSE-C, client side encryption and SSL in transit.

Cons:

Not great for small objects
Data not indexed

Use Case: Static files, static website hosting and storage of big files.
Operations: fully managed
Reliability: 99.99% availability, Multi AZ and Cross Region Replication
Performance: it supports transfer acceleration with CloudFront, multi-part for big files and can scale up to thousands of read/write operations per second.
Cost: pay per storage use and number of requests.

Data Warehouse

Athena

It’s a fully serverless databe with SQL capabilities. In most of the common use cases is used to query data in S3. Basically can be considered a query engine for S3 and even results are sent back to S3.

Operations: fully managed
Security: IAM and S3 security for bucket policies
Reliability: same as S3
Performance: based on data size
Use Case: queries on S3 and log analytics.
Cost: Pay per query or TB of data scanned

Notes: Output results can be sent back to S3.

Redshift

It’s a fully managed, petabyte-scale data warehouse service in the cloud based on PostgreSQL. It’s OLAP, which means that is used for analytics and data warehousing. redshift

Scales to PBs of data.
Columnar storage data (instead row based).
Pay as you go based on the instances provisioned.
Has a SQL interface for running queries.
Data is loaded from S3, DynamoDB, DMS or other DBs.
Scales from 1 node to 128 nodes, up to 160GB of space per node.
Composed of Leader node and Compute nodes.
Redshift Spectrum: run queries directly against S3.

Operations:
Security: VPC, IAM, KMS. Redshift Enhanced VPC Routing allows to copy or unload straight through the VPC.
Performance: 10x performance than other data warehouses. MPP (Massive Parallel Query Execution).
Reliability: highly available and auto healing.
Use Case: useful to be used with BI tools such as AWS Quicksight or Tableau.

As a final point, AWS states that its costs is around 1/10th of the cost versus other data warehouse technologies.

Search

Elastic Search

Elasticsearch is a search engine based on Lucene and is developed in Java and is released as open source under the terms of the Apache License. At the contrary of DynamoDB, where you can only find an object by primary key or index, with ElasticSearch, you can query by any field that has been previously indexed, even partial matches (see use of ElasticSearch Analysers).
You can find out more about ElasticSearch capabilities in AWS in my previous article

It has integrations with Amazon Kinesis Data Firehose, AWS IoT and Amazon CloudWatch Logs for data ingestion.
It comes with Kibana and Logstash
It comes with a full Rest API.

Operations: Similar to RDS.
Security: Cognito and IAM, KMS encryption, SSL and VPC.
Reliability: Multi-AZ, clustering (shards technology).
Performance: Petabytes of data.
Cost: pay per node.
Use Case: Indexing and catalog searches.

Graphs

Neptune

It’s a fully managed graph database optimised for leading graph query languages. . The core of Neptune is a purpose-built, high-performance graph database engine optimised for storing billions of relationships and querying the graph with milliseconds latency.

Highly available across 3 AZ with up to 15 read replicas
Point-in-time recovery, continuous backup to Amazon S3
Support for KMS encryption at rest and SSL in transit.
Supports Open Graph APIs
Supports network Isolation (VPC)
It can query billions of relationships in milliseconds.
Can be used with Gremlin and SPARQL.

neptune

Security: IAM, VPC, KMS, SSL and IAM Authentication.
Performance: best suited for graphs, clustering to improve performance.
Use Case: Social networking, knowledge graphs, detect network event and anomalies.

In most of my projects at NTT Data Impact we end up recommending the usage of Serverless platforms in AWS, that’s one of the reasons why S3 is a key component in most of our solutions, I really recommend to take a look into its capabilities.

I hope that this short guide helps you to decide what Database service is more adequate to your needs. Also take in consideration all the main questions at the beginning of this article, as sometimes, we tend to think short term and we end up regretting our decisions after a while 🙂

30 Days of DevOps: Choosing the right DB in AWS

RDS

Amazon Aurora

MySQL

MariaDB

PostgreSQL

Oracle

Microsoft SQL Server

NoSQL Database

ElastiCache

DynamoDB

Document

DocumentDB

Object Store

S3 (Amazon Simple Storage Service)

Data Warehouse

Athena

Redshift

Search

Elastic Search

Graphs

Neptune

About Eduardo Ortega Bermejo

Leave a comment Cancel reply

Recent Posts

Categories

Archives

Meta

30 Days of DevOps: Choosing the right DB in AWS

RDS

Amazon Aurora

MySQL

MariaDB

PostgreSQL

Oracle

Microsoft SQL Server

NoSQL Database

ElastiCache

DynamoDB

Document

DocumentDB

Object Store

S3 (Amazon Simple Storage Service)

Data Warehouse

Athena

Redshift

Search

Elastic Search

Graphs

Neptune

Share this:

Related

About Eduardo Ortega Bermejo

Leave a comment Cancel reply

Recent Posts

Category Cloud

Categories

Archives

Meta