Tag Archives: Glue Catalog

S3 Folder Structure and Its Impacts for Redshift Table and Glue Data Catalog

S3 Folder Structure and Its Impacts for Redshift Table and Glue Data Catalog

Suppose we export a very large table data into multiple csv files with the same format, or split an existing large csv files into multiple csv files. All these files are stored in a S3 bucket folder or its subfolders. We call this S3 folder the root folder of the table, or root-folder for short. In this blog we will explore the best way to organize the multiple files in the root-folder and its subfolders, so that we can easily access these files in from Redshift or discovery them in the AWS Glue catalog.

We start the experiments with four csv files (test_file1, test_file2, test_file3, and test_file4). All the files have the three fields (cust_id: integer, cust_type: text, cust_name: text), but not contains the header info. They will be organized in different S3 folder structures within the root folder “test”. Then we will observe their behaviors when we access them with Redshift and AW Glue in the three ways below:

  • Reload the files into a Redshift table using command “COPY”,
  • Create an Spectrum external table from the files
  • Discovery and add the files into AWS Glue data catalog using Glue crawler

We set the root folder “test” as the S3 location in all the three methods. For the Redshift, below are the commands use:

  1. Reload the files into a Redshift table “test_csv”:
create table test_csv (cust_id integer, cust_type text, cust_nme text);

copy test_csv
from 's3://redshift-colddata/test/' 
iam_role 'arn:aws:iam::[aws-acct]/'RedshiftSpectrumRole'
delimiter ',';
  1. Create the Redshift External Table:
create external table spectrumtest.test_csv_ext(cust_id integer, cust_type text, cust_nme text)
row format delimited fields terminated by ','
stored as textfile
location 's3://redshift-colddata/test/';

Case 1: All the 4 files are in the root folder “test”:

The S3 root folder structure:

.\test\file1
.\test\file2
.\test\file3
.\test\file4

Redshift Result:

Both the tables "test_csv" and "test_csv_ext" have all the data from the 4 files.

Glue Crawler Catalog Result:

Discoveried one table: "test" (the root-folder name). It has all the data from the 4 files, and it is not partitioned.

Case 2: Some files are in the root folder and some files in subfolders:

The S3 root folder structure:

..\test\sbf1\file1
..\test\file2
..\test\file3
..\test\file4

Redshift Result:

Both the tables "test_csv" and "test_csv_ext" have all the data from the 4 files.

Glue Crawler Catalog Result:

Discoveried four tables: (the sub-folder name and file names)
    "sbf1"  --  Only has the data from "file1". 
    "file2" --  No data found. 
    "file3" --  No data found. 
    "file4" --  No data found. 

Case 3: Files are in the same level subfolders:

The S3 root folder structure:

..\test\sbf1\file1, file2
..\test\sbf2\file3, file4

Redshift Result:

Both the tables "test_csv" and "test_csv_ext" have all the data from the 4 files.

Glue Crawler Catalog Result:

Discoveried one table: "test" (the root-folder name). It has all the data from the 4 files, and it is partitioned on one coluumn into two partitions "sbf1", and "sbf2" (sub-folder names become partition values). 

Case 4: Files are in the different level of the multi-level subfolders:

The S3 root folder structure:

..\test\sbf1\sbsbf11\file1
..\test\sbf1\sbsbf12\file2
..\test\sbf2\file3, file4

Redshift Result:

Both the tables "test_csv" and "test_csv_ext" have all the data from the 4 files.

Glue Crawler Catalog Result:

Discoveried two tables:
   "sbf1" -- It has the data from the two files: "file1" and "file2". it is partitioned into two partitions "sbsbf11", and "sbsbf12".
   "sbf2" --  It has the data from the two files: "file3" and "file4". It is not partitioned. 

Case 5: Files are in the same level of multi-level subfolders:

S3 Folder structure:

..\test\sbf1\sbsbf11\file1
..\test\sbf1\sbsbf12\file2
..\test\sbf2\sbsbf21\file3
..\test\sbf2\sbsbf22\file4

Redshift Result:

Both the tables "test_csv" and "test_csv_ext" have all the data from the 4 files.

Glue Crawler Catalog Result:

Discoveried one table: "test" (the root-folder name). It has all the data from the 4 files, and it is partitioned on two coluumns. The partitions are 
        ["sbf1","sbf1-sbsbf11"]
        ["sbf1","sbf1-sbsbf12"]
        ["sbf2","sbf2-sbsbf21"]
        ["sbf2","sbf2-sbsbf22"]

Conclusions

From the test results above, we can conclude:

  1. Redshift “COPY” or external table don’t care about the S3 subfolder structure and how the files are organized within the root folder or its subfolders. As long as the files have same format, the data from all the files can be accessed from the one single Redshift table or external table by specifying the S3 location as the root folder for the files, no matter where they are stored within the multi-level subfolders.
  2. However, in order the Glue crawler to add the S3 files into the data catalog correctly, we have to follow the rules below to organize and plan the S3 folder structure.
    • If you don’t want to utilize partition feature, store all the files in the root folder.
    • If you want to utilize the partition feature, design the subfolder structure carefully, and always distribute the files among the subfolders at same level.
    • The number of subfolder levels should be the same number of the columns to be partitioned.
    • Never place the files into different level subfolders. Never mix files and folders are siblings at any level.