S3 Folder Structure and Its Impacts for Redshift Table and Glue Data Catalog

Suppose we export a very large table data into multiple csv files with the same format, or split an existing large csv files into multiple csv files. All these files are stored in a S3 bucket folder or its subfolders. We call this S3 folder the root folder of the table, or root-folder for short. In this blog we will explore the best way to organize the multiple files in the root-folder and its subfolders, so that we can easily access these files in from Redshift or discovery them in the AWS Glue catalog.

We start the experiments with four csv files (test_file1, test_file2, test_file3, and test_file4). All the files have the three fields (cust_id: integer, cust_type: text, cust_name: text), but not contains the header info. They will be organized in different S3 folder structures within the root folder “test”. Then we will observe their behaviors when we access them with Redshift and AW Glue in the three ways below:

Reload the files into a Redshift table using command “COPY”,
Create an Spectrum external table from the files
Discovery and add the files into AWS Glue data catalog using Glue crawler

We set the root folder “test” as the S3 location in all the three methods. For the Redshift, below are the commands use:

Reload the files into a Redshift table “test_csv”:

create table test_csv (cust_id integer, cust_type text, cust_nme text);

copy test_csv
from 's3://redshift-colddata/test/' 
iam_role 'arn:aws:iam::[aws-acct]/'RedshiftSpectrumRole'
delimiter ',';

Create the Redshift External Table:

create external table spectrumtest.test_csv_ext(cust_id integer, cust_type text, cust_nme text)
row format delimited fields terminated by ','
stored as textfile
location 's3://redshift-colddata/test/';

Case 1: All the 4 files are in the root folder “test”:

The S3 root folder structure:

.\test\file1
.\test\file2
.\test\file3
.\test\file4

Redshift Result:

Both the tables "test_csv" and "test_csv_ext" have all the data from the 4 files.

Glue Crawler Catalog Result:

Discoveried one table: "test" (the root-folder name). It has all the data from the 4 files, and it is not partitioned.

Case 2: Some files are in the root folder and some files in subfolders:

The S3 root folder structure:

..\test\sbf1\file1
..\test\file2
..\test\file3
..\test\file4

Redshift Result:

Both the tables "test_csv" and "test_csv_ext" have all the data from the 4 files.

Glue Crawler Catalog Result:

Discoveried four tables: (the sub-folder name and file names)
    "sbf1"  --  Only has the data from "file1". 
    "file2" --  No data found. 
    "file3" --  No data found. 
    "file4" --  No data found.

Case 3: Files are in the same level subfolders:

The S3 root folder structure:

..\test\sbf1\file1, file2
..\test\sbf2\file3, file4

Redshift Result:

Both the tables "test_csv" and "test_csv_ext" have all the data from the 4 files.

Glue Crawler Catalog Result:

Discoveried one table: "test" (the root-folder name). It has all the data from the 4 files, and it is partitioned on one coluumn into two partitions "sbf1", and "sbf2" (sub-folder names become partition values).

Case 4: Files are in the different level of the multi-level subfolders:

The S3 root folder structure:

..\test\sbf1\sbsbf11\file1
..\test\sbf1\sbsbf12\file2
..\test\sbf2\file3, file4

Redshift Result:

Both the tables "test_csv" and "test_csv_ext" have all the data from the 4 files.

Glue Crawler Catalog Result:

Discoveried two tables:
   "sbf1" -- It has the data from the two files: "file1" and "file2". it is partitioned into two partitions "sbsbf11", and "sbsbf12".
   "sbf2" --  It has the data from the two files: "file3" and "file4". It is not partitioned.

Case 5: Files are in the same level of multi-level subfolders:

S3 Folder structure:

..\test\sbf1\sbsbf11\file1
..\test\sbf1\sbsbf12\file2
..\test\sbf2\sbsbf21\file3
..\test\sbf2\sbsbf22\file4

Redshift Result:

Both the tables "test_csv" and "test_csv_ext" have all the data from the 4 files.

Glue Crawler Catalog Result:

Discoveried one table: "test" (the root-folder name). It has all the data from the 4 files, and it is partitioned on two coluumns. The partitions are 
        ["sbf1","sbf1-sbsbf11"]
        ["sbf1","sbf1-sbsbf12"]
        ["sbf2","sbf2-sbsbf21"]
        ["sbf2","sbf2-sbsbf22"]

Conclusions

From the test results above, we can conclude:

Redshift “COPY” or external table don’t care about the S3 subfolder structure and how the files are organized within the root folder or its subfolders. As long as the files have same format, the data from all the files can be accessed from the one single Redshift table or external table by specifying the S3 location as the root folder for the files, no matter where they are stored within the multi-level subfolders.
However, in order the Glue crawler to add the S3 files into the data catalog correctly, we have to follow the rules below to organize and plan the S3 folder structure.
- If you don’t want to utilize partition feature, store all the files in the root folder.
- If you want to utilize the partition feature, design the subfolder structure carefully, and always distribute the files among the subfolders at same level.
- The number of subfolder levels should be the same number of the columns to be partitioned.
- Never place the files into different level subfolders. Never mix files and folders are siblings at any level.

How to Assume IAM Role From AWS CLI

You can fellow the following 3 steps to assume an IAM role from AWS CLI:

Step 1: Grant an IAM user’s privilege (permission) to assume an IAM role or all IAM roles
Step 2: Grant a particular IAM role to the IAM user. That is, modify a particular IAM role’s trust relationship to allow the IAM user to assume this role
Step 3: Configure and use the AWS CLI profile to assume the granted role

1. Grant an IAM user’s privilege:

Create a new or pickup an existing IAM user (say named it as “userTest”) with the security credentials (Access Key ID and Secret Access Key) . Note done its user ARN (like arn:aws:iam::123456789012:user/userTest).
Create a assuming role policy: It can be a generic policy (named it as “Assume-Roles”) to assume any IAM roles with the following custom policy code

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "",
"Effect": "Allow",
"Action": "sts:AssumeRole",
"Resource": "*"
}
]
}

Or you can specify a particular role’s ARN (like “arn:aws:iam::123456789012:role/roleTest”) for the Resource above to limit the policy only applicable to the IAM role named role “roleTest” and name the policy as “”Assume-Role-roleTest”.

Attach the policy to the user “userTest”.

2. Grant an IAM role to an IAM user:

Create a new or pickup an existing IAM role (say it is named as “roleTest”) with the required permissions. Note down its role ARN (like arn:aws:iam::123456789012:role/roleTest).
Modify the role’s trust relationship to add the the IAM user to the Trusted entities:

{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::123456789012:user/userTest"
},
"Action": "sts:AssumeRole"
}

3. Configure and use the AWS CLI profile to assume the granted role :

Edit the config file at ~/.aws/config to add a profile for each of the user userTest and the role “roleTest”:

[profile userTest]
region=us-east-1
output=json

[profile roleTest]
region=us-east-1
output=json
role_arn = arn:aws:iam::123456789012:role/roleTest
source_profile = userTest

If the user “userTest” is using the default profile, the source_profile line can be set to the default profile: “source_profile = default” .

Use the AWS CLI to assume the role: Call and AWS CLI commend using the profile option:

aws AnyEligibleCLICommend --profile roelTest

This will login into AWS using userTest’s security credential and then assume the IAM role “roleTest” to execute the CLI commend.

For example, suppose the user “userTest” uses the default profile, and itself does not have the Redshift access permission but the roleTest does.

If you run the AWS commend:

aws redshift describe-clusters

It will throws “not authorized” error. But if you run the AWS CLI commend:

aws redshift describe-clusters --profile roleTest

It will be successful.

	thx on The Trick of Retrieving the Sa…
	Jayron on The Trick of Retrieving the Sa…
	Abhirath Mahipal on Linear Regression with Multipl…
	zbra on The Trick of Retrieving the Sa…
	Bill Kuhn on The Trick of Retrieving the Sa…

Luke Lushu Li's Blog

Tag Archives: AWS

S3 Folder Structure and Its Impacts for Redshift Table and Glue Data Catalog

S3 Folder Structure and Its Impacts for Redshift Table and Glue Data Catalog

Case 1: All the 4 files are in the root folder “test”:

Case 2: Some files are in the root folder and some files in subfolders:

Case 3: Files are in the same level subfolders:

Case 4: Files are in the different level of the multi-level subfolders:

Case 5: Files are in the same level of multi-level subfolders:

Conclusions

How to Assume IAM Role From AWS CLI

How to Assume IAM Role From AWS CLI

1. Grant an IAM user’s privilege:

2. Grant an IAM role to an IAM user:

3. Configure and use the AWS CLI profile to assume the granted role :

This blog will contain the technical notes for Oracle, Machine Learning, Quant tread, etc .