Redshift and Athena

Introduction

Companies are increasingly using cloud services to store, work with, and analyze their data in today’s data-driven world. Cloud services provide a convenient and secure way for companies to manage large amounts of information. They rely on cloud services to store data, collaborate on projects, and gain insights from their data analysis. This trend reflects the growing importance of data in the business world. Two popular services offered by Amazon Web Services (AWS) are Amazon Redshift and Athena. Both tools are good for working with data in the cloud. However, they have different methods and purposes.

This article will compare Redshift and Athena, looking at how they save data in the cloud, manage access, and track usage. We will examine the main differences between the two services. The comparison will highlight their approaches to data storage, access control, and usage auditing. When choosing a tool for your data needs, it’s important to make informed decisions.

Understanding the strengths and limitations of each service is crucial. This will help you select the right tool for your specific requirements. Choosing the right option will effectively meet your data needs.

What is Amazon Redshift?

Amazon Redshift is a fully managed, petabyte-scale data warehousing service designed for large-scale data storage and analysis.

It relies on PostgreSQL and provides a cost-effective solution for running complex queries on large datasets. Redshift stores data in columns and uses many processors to quickly run queries and perform well.” It is well-suited for data warehousing, business intelligence, and reporting workloads.

What is Amazon Athena?

Amazon Athena is a service that lets you analyze data from Amazon S3 using standard SQL. It is serverless, meaning you don’t need to provision or manage any infrastructure.

Athena enables users to perform ad-hoc querying and exploration of data stored in S3. It supports a wide range of data formats, including CSV, JSON, ORC, Avro, and Parquet.

Athena is great for quickly analyzing data without dealing with complex ETL processes or data warehousing.

It is ideal for situations that require fast data analysis. Athena eliminates the need for complicated ETL processes and data warehousing.

Cloud Data Storage

When it comes to cloud data storage, Redshift and Athena have different approaches. Redshift stores data in its own managed storage layer, which it optimizes for fast querying and analysis.

You load data into Redshift clusters, which consist of nodes that store and process the data. Redshift automatically distributes data across nodes based on the distribution style you select, such as even, key, or all. This allows for efficient parallel processing and query execution.

Example

To store data in Redshift, you first create a cluster with the desired number of nodes and node types. Then, you can use the COPY command to load data from S3 into Redshift tables. For instance, to load data from a CSV file in S3 into a Redshift table named “users,” you can use the following command:

COPY users FROM 's3://your-bucket/path/to/file.csv'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftLoadRole'
FORMAT AS CSV;

Athena, on the other hand, does not store data itself. Instead, it uses Amazon S3 as the underlying storage layer. You define tables in Athena that map to the data stored in S3, specifying the schema and data format. Athena uses this metadata to query the data directly from S3 without the need for loading or transformation.

Example

To create a table in Athena that maps to data in S3, you can use the CREATE EXTERNAL TABLE statement. Here’s an example:

CREATE EXTERNAL TABLE users (
id INT,
name STRING,
age INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3://your-bucket/path/to/data/';

This creates an external table named “users” that maps to CSV data stored in the specified S3 location. Athena can now query this data using standard SQL.

Access Control and Security

Redshift and Athena both have strong security features to protect your data.

Redshift offers several security mechanisms, including:

Network isolation: Redshift clusters can be launched in a Virtual Private Cloud (VPC), providing network-level security and isolation.
Encryption: Redshift supports encryption at rest using AWS Key Management Service (KMS) and encryption in transit using SSL/TLS.
Access control: Redshift integrates with AWS Identity and Access Management (IAM) to manage user authentication and authorization. You can define fine-grained access policies to control who can access and manipulate data within Redshift.
Auditing: Redshift provides detailed audit logging and integration with AWS CloudTrail to track user activities and changes made to the cluster.

Example: To grant a user read-only access to a specific Redshift table, you can create an IAM policy like this:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "redshift:GetClusterCredentials",
                "redshift:DescribeTable",
                "redshift:SelectFromTable"
            ],
            "Resource": [
                "arn:aws:redshift:us-west-2:123456789012:cluster:your-cluster-name",
                "arn:aws:redshift:us-west-2:123456789012:table:your-schema-name/your-table-name"
            ]
        }
    ]
}

Athena leverages Amazon S3’s security features, including:

Access control: Athena uses IAM policies to control access to S3 buckets and objects. You can define policies to restrict access to specific S3 paths or Athena actions.
Encryption: Athena supports querying data that is encrypted at rest in S3 using AWS KMS.
Auditing: Athena integrates with AWS CloudTrail to log all API calls and user activities, providing an audit trail for compliance and security purposes.

For a user to run queries on a certain Athena table, you can make an IAM policy.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "athena:StartQueryExecution",
                "athena:GetQueryExecution",
                "athena:GetQueryResults"
            ],
            "Resource": [
                "arn:aws:athena:us-west-2:123456789012:workgroup/your-workgroup-name"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::your-bucket-name",
                "arn:aws:s3:::your-bucket-name/*"
            ]
        }
    ]
}

This policy allows the user to run queries in the chosen Athena workgroup and view the required S3 objects.

Cloud Data Auditing and Compliance

Auditing is crucial for maintaining compliance and tracking user activities in both Redshift and Athena.

Redshift provides the following auditing capabilities

System tables: Redshift maintains system tables that contain information about user connections, queries executed, and changes made to the cluster configuration.
AWS CloudTrail integration: Redshift integrates with CloudTrail to log API calls and management events, providing a comprehensive audit trail.
Logging: Redshift can log user activity, connections, and queries to Amazon S3 or Amazon CloudWatch Logs for analysis and monitoring.

To log user connections and activity in Redshift, adjust the cluster parameter group and set specific parameters.

enable_user_activity_logging = true
enable_user_activity_logging_level = STANDARD

Athena, being a serverless service, relies on AWS CloudTrail for auditing

AWS CloudTrail integration: Athena automatically logs all API calls and user activities to CloudTrail, providing a detailed audit trail.
S3 server access logging: You can enable server access logging on the S3 buckets used by Athena to track object-level access and operations.

Example: To analyze Athena queries using CloudTrail logs, you can use Amazon Athena itself to query the logs stored in S3. First, create an Athena table that maps to the CloudTrail logs:

CREATE EXTERNAL TABLE cloudtrail_logs (
eventversion STRING,
useridentity STRUCT<
type: STRING,
principalid: STRING,
arn: STRING,
accountid: STRING,
invokedby: STRING
>,
eventtime STRING,
eventsource STRING,
eventname STRING,
awsregion STRING,
sourceipaddress STRING,
useragent STRING,
errorcode STRING,
errormessage STRING,
requestparameters STRING,
responseelements STRING,
additionaleventdata STRING,
requestid STRING,
eventid STRING,
resources ARRAY<STRUCT<
arn: STRING,
accountid: STRING,
type: STRING
>>,
eventtype STRING,
apiversion STRING,
readonly STRING,
recipientaccountid STRING,
serviceeventdetails STRING,
sharedeventid STRING,
vpcendpointid STRING
)
ROW FORMAT SERDE 'com.amazon.emr.hive.serde.CloudTrailSerde'
STORED AS INPUTFORMAT 'com.amazon.emr.cloudtrail.CloudTrailInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://your-cloudtrail-bucket/AWSLogs/123456789012/CloudTrail/';

Note the usage of < and > for defining struct data types is specific to Amazon Athena’s SQL dialect.

Athena uses a variant of SQL called Presto SQL, which is based on the Presto distributed SQL query engine. Presto SQL introduces some extensions and syntax differences compared to standard SQL.

Then, you can query the CloudTrail logs to analyze Athena usage:

SELECT
useridentity.arn,
eventtime,
eventsource,
eventname,
requestparameters
FROM cloudtrail_logs
WHERE eventsource = 'athena.amazonaws.com'
AND eventname LIKE 'Start%'
ORDER BY eventtime DESC
LIMIT 100;

This query retrieves the last 100 Athena query execution events, including the user ARN, event time, event source, event name, and request parameters.

Redshift vs Athena: Conclusion

They are both powerful solutions offered by AWS for working with data in the cloud. Redshift is a managed data warehousing solution for storing and querying large datasets. Athena is a serverless query service for analyzing data from S3 using SQL. Both services provide robust access control, security, and auditing capabilities to ensure the protection and compliance of your data.

When choosing between Redshift and Athena, consider your specific use case and requirements. If you have a need for complex queries, large-scale data storage, and high-performance analytics, Redshift may be the better choice. On the other hand, if you have data stored in S3 and need ad-hoc querying and exploration without the overhead of managing infrastructure, Athena is a great option.

It is important to have proper access controls, encryption, and auditing in place to protect your data. No matter what you choose, these measures are crucial for the security and integrity of your data.

Proper access controls ensure that only authorized users can access the data. Encryption helps to secure the data from unauthorized access. Auditing helps to monitor and track any changes or access to the data.

You can keep your data safe in the cloud by using security features and following best practices. This will ensure that your data stays secure and complies with regulations.

DataSunrise offers user-friendly and flexible tools for both Redshift and Athena data security, audit rules, masking, and compliance. Our solutions seamlessly integrate with AWS services, providing an additional layer of security and control over your data.

Are you in need of top-notch data security and compliance solutions? Look no further than the DataSunrise team! Join us for an online demo to see how our offerings can benefit your organization.