
For example, here’s the content of a folder for a specific year and date: This time, the output of the query is stored in multiple partitions. TO 's3://MY-BUCKET/DataLake/SalesPartitioned/' UNLOAD ('SELECT sales.*, date.*, total_price, percentile In my case, I want to partition the output based on the year and the calendar date ( caldate in the query) of the sales. To use partitioning, I need to add to the previous SQL command the PARTITION BY option, followed by the columns I want to use to partition the data in different directories. This enables my queries to take advantage of partition pruning and skip scanning irrelevant partitions, improving query performance and minimizing cost. For example, I can unload sales data partitioned by year, month, and day. To optimize access to data, I can specify one or more partition columns so that unloaded data is automatically partitioned into folders in my S3 bucket. $ aws s3 ls s3://MY-BUCKET/DataLake/Sales/ As expected, the output of the query is exported using the Parquet columnar data format: I can see the result of the UNLOAD command using the AWS Command Line Interface (AWS CLI).

To give Redshift write access to my S3 bucket, I am using an AWS Identity and Access Management (IAM) role. (SELECT eventid, total_price, ntile(1000) over(order by total_price desc) / 10.0 as percentileįROM (SELECT eventid, sum(pricepaid) total_priceĪND percentile_events.eventid = sales.eventid')ĬREDENTIALS 'aws_iam_role=arn:aws:iam::123412341234:role/myRedshiftRole' To export the result of the query to S3 in Parquet format, I use the following SQL command: UNLOAD ('SELECT sales.*, date.*, total_price, percentile I want to join the content of the sales and date tables, adding information on the gross sales for an event ( total_price in the query), and the percentile in terms of all time gross sales compared to all events. I need to create a query that gives me a single view of what is going on with sales. Let’s build a query in Redshift to export the data to S3. My data is stored across multiple tables. To understand their relevance, each event should have a way of comparing its relative sales to other events. I want to correlate this data with social media comments on the events stored in my data lake. To try this new feature, I create a new cluster from the Redshift console, and follow this tutorial to load sample data that keeps track of sales of musical events across different venues. Or you can use different tools such as Amazon Athena, Amazon EMR, or Amazon SageMaker. You can then analyze the data in your data lake with Redshift Spectrum, a feature of Redshift that allows you to query data directly from files on S3. The Parquet format is up to 2x faster to unload and consumes up to 6x less storage in S3, compared to text formats. This enables you to save data transformation and enrichment you have done in Redshift into your S3 data lake in an open format.

You can now unload the result of a Redshift query to your S3 data lake in Apache Parquet format. Let’s explain the interactions you see in the diagram better, starting from how you can use these features, and the advantages they provide.

This architectural diagram gives a quick summary of how these features work and how they can be used together with other AWS services. Federated Query to be able, from a Redshift cluster, to query across data stored in the cluster, in your S3 data lake, and in one or more Amazon Relational Database Service (RDS) for PostgreSQL and Amazon Aurora PostgreSQL databases.Data Lake Export to unload data from a Redshift cluster to S3 in Apache Parquet format, an efficient open columnar storage format optimized for analytics.Today, we are launching two new features to help you improve the way you manage your data warehouse and integrate with a data lake: To get information from unstructured data that would not fit in a data warehouse, you can build a data lake. A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. With a data lake built on Amazon Simple Storage Service (Amazon S3), you can easily run big data analytics and use machine learning to gain insights from your semi-structured (such as JSON, XML) and unstructured datasets. Amazon Redshift is a fast, fully managed data warehouse that makes it simple and cost-effective to analyze data using standard SQL and existing Business Intelligence (BI) tools.

A data warehouse is a database optimized to analyze relational data coming from transactional systems and line of business applications.
