Asides

Zero-Downtime Summary Table Refresh Using AB Rotating Tables in Oracle Exadata and Redshift

Zero-Downtime Summary Table Refresh Using AB Rotating Tables in Oracle Exadata and Redshift

1. Zero-Downtime Refresh Framework in Oracle Exadata

 

In Enterprise Data Warehouse system, it is quite often some summary tables are required to refresh several times even hourly per day.  Meanwhile most of the summary tables are accessed globally in 24/7 mode . That is, there is no any downtime window allowed for the scheduled  or on-demand refreshes.  Therefore, how to implement the Zero-Downtime table refresh is a challenge.

In our Oracle Exadata-based Enterprise Data Warehouse system, we successfully developed the framework for the Summary Table Refresh process using A/B Rotating strategy to automatically  implement the  Zero-Downtime refresh process.

The Oracle-based framework is outlined below:

      1. Create A/B tables — For each of the summary tables to refresh, say a  sales summary fact table named “F_SLS_SUM”,  create two identical A/B  tables  “F_SLS_SUM_A” and “F_SLS_SUM_B” .
      2. Create a Public Synonym “F_SLS_SUM” pointing to one of the A/B tables appropriately and dynamically during the process.
      3. Configure and Update Refresh Control Table: REFRESH_CRTL_TAB ( Synonym_Name*,  Schema_Name*, Rotate_Table_Name*,  Rotate_Status, Refresh_Status).
      4. The process refresh the “Inactive” Rotate table and update the Rotate and Refresh status correspondingly (columns with * are unique). For example:

2. Zero-Downtime Refresh Implementation  in Redshift

 

In Redshift there is no Public Synonym  concept (at least at this blog writing time). Therefore  we can’t migrate the existing Oracle-based Zero-Downtime refresh framework directly to Redshift by switching the Public Synonym.

As a workaround solution in Redshift, After the refresh process sync the refresh control table,  rather than switch the Public Synonym, we can define a Redshift view using the Refresh Control Table :

CREATE OR REPLACE VIEW EDW.F_SLS_SUM_V  AS
(SELECT * FROM EDW.F_SLS_SUM_A
WHERE EXISTS ( SELECT 1 FROM EDW.REFRESH_CNTRL_TAB WHERE ROTATE_TABLE_NAME ='F_SLS_SUM_A' AND ROTATE_STATUS ='ACTIVE' ) )
UNION ALL
(SELECT * FROM EDW.F_SLS_SUM_B
WHERE EXISTS ( SELECT 1 FROM EDW.EFRESH_CNTRL_TAB WHERE ROTATE_TABLE_NAME ='F_SLS_SUM_B' AND ROTATE_STATUS ='ACTIVE' ) );

For example,


CREATE VIEW EDWADM.F_DLY_SLS_REPT_SUM_V AS
(SELECT * from EDWADM.F_DLY_SLS_REPT_SUM_A
WHERE EXISTS (SELECT 1 FROM EDWADM.EDW_REFRESH_CNTRL_TAB WHERE ROTATE_TABLE_NM = 'F_DLY_SLS_REPT_SUM_A' and ROTATE_STATUS = 'ACTIVE')
UNION ALL
SELECT * from EDWADM.F_DLY_SLS_REPT_SUM_B
WHERE EXISTS (SELECT 1 FROM EDWADM.EDW_REFRESH_CNTRL_TAB WHERE ROTATE_TABLE_NM = 'F_DLY_SLS_REPT_SUM_B' and ROTATE_STATUS = 'ACTIVE')
);

In this way, the view F_SLS_SUM_V will return  one of the rotating table A’s or B ‘s data, according to the rotating status of the refresh control table.

 

 

 

S3 Folder Structure and Its Impacts for Redshift Table and Glue Data Catalog

S3 Folder Structure and Its Impacts for Redshift Table and Glue Data Catalog

Suppose we export a very large table data into multiple csv files with the same format, or split an existing large csv files into multiple csv files. All these files are stored in a S3 bucket folder or its subfolders. We call this S3 folder the root folder of the table, or root-folder for short. In this blog we will explore the best way to organize the multiple files in the root-folder and its subfolders, so that we can easily access these files in from Redshift or discovery them in the AWS Glue catalog.

We start the experiments with four csv files (test_file1, test_file2, test_file3, and test_file4). All the files have the three fields (cust_id: integer, cust_type: text, cust_name: text), but not contains the header info. They will be organized in different S3 folder structures within the root folder “test”. Then we will observe their behaviors when we access them with Redshift and AW Glue in the three ways below:

  • Reload the files into a Redshift table using command “COPY”,
  • Create an Spectrum external table from the files
  • Discovery and add the files into AWS Glue data catalog using Glue crawler

We set the root folder “test” as the S3 location in all the three methods. For the Redshift, below are the commands use:

  1. Reload the files into a Redshift table “test_csv”:
create table test_csv (cust_id integer, cust_type text, cust_nme text);

copy test_csv
from 's3://redshift-colddata/test/' 
iam_role 'arn:aws:iam::[aws-acct]/'RedshiftSpectrumRole'
delimiter ',';
  1. Create the Redshift External Table:
create external table spectrumtest.test_csv_ext(cust_id integer, cust_type text, cust_nme text)
row format delimited fields terminated by ','
stored as textfile
location 's3://redshift-colddata/test/';

Case 1: All the 4 files are in the root folder “test”:

The S3 root folder structure:

.\test\file1
.\test\file2
.\test\file3
.\test\file4

Redshift Result:

Both the tables "test_csv" and "test_csv_ext" have all the data from the 4 files.

Glue Crawler Catalog Result:

Discoveried one table: "test" (the root-folder name). It has all the data from the 4 files, and it is not partitioned.

Case 2: Some files are in the root folder and some files in subfolders:

The S3 root folder structure:

..\test\sbf1\file1
..\test\file2
..\test\file3
..\test\file4

Redshift Result:

Both the tables "test_csv" and "test_csv_ext" have all the data from the 4 files.

Glue Crawler Catalog Result:

Discoveried four tables: (the sub-folder name and file names)
    "sbf1"  --  Only has the data from "file1". 
    "file2" --  No data found. 
    "file3" --  No data found. 
    "file4" --  No data found. 

Case 3: Files are in the same level subfolders:

The S3 root folder structure:

..\test\sbf1\file1, file2
..\test\sbf2\file3, file4

Redshift Result:

Both the tables "test_csv" and "test_csv_ext" have all the data from the 4 files.

Glue Crawler Catalog Result:

Discoveried one table: "test" (the root-folder name). It has all the data from the 4 files, and it is partitioned on one coluumn into two partitions "sbf1", and "sbf2" (sub-folder names become partition values). 

Case 4: Files are in the different level of the multi-level subfolders:

The S3 root folder structure:

..\test\sbf1\sbsbf11\file1
..\test\sbf1\sbsbf12\file2
..\test\sbf2\file3, file4

Redshift Result:

Both the tables "test_csv" and "test_csv_ext" have all the data from the 4 files.

Glue Crawler Catalog Result:

Discoveried two tables:
   "sbf1" -- It has the data from the two files: "file1" and "file2". it is partitioned into two partitions "sbsbf11", and "sbsbf12".
   "sbf2" --  It has the data from the two files: "file3" and "file4". It is not partitioned. 

Case 5: Files are in the same level of multi-level subfolders:

S3 Folder structure:

..\test\sbf1\sbsbf11\file1
..\test\sbf1\sbsbf12\file2
..\test\sbf2\sbsbf21\file3
..\test\sbf2\sbsbf22\file4

Redshift Result:

Both the tables "test_csv" and "test_csv_ext" have all the data from the 4 files.

Glue Crawler Catalog Result:

Discoveried one table: "test" (the root-folder name). It has all the data from the 4 files, and it is partitioned on two coluumns. The partitions are 
        ["sbf1","sbf1-sbsbf11"]
        ["sbf1","sbf1-sbsbf12"]
        ["sbf2","sbf2-sbsbf21"]
        ["sbf2","sbf2-sbsbf22"]

Conclusions

From the test results above, we can conclude:

  1. Redshift “COPY” or external table don’t care about the S3 subfolder structure and how the files are organized within the root folder or its subfolders. As long as the files have same format, the data from all the files can be accessed from the one single Redshift table or external table by specifying the S3 location as the root folder for the files, no matter where they are stored within the multi-level subfolders.
  2. However, in order the Glue crawler to add the S3 files into the data catalog correctly, we have to follow the rules below to organize and plan the S3 folder structure.
    • If you don’t want to utilize partition feature, store all the files in the root folder.
    • If you want to utilize the partition feature, design the subfolder structure carefully, and always distribute the files among the subfolders at same level.
    • The number of subfolder levels should be the same number of the columns to be partitioned.
    • Never place the files into different level subfolders. Never mix files and folders are siblings at any level.

Redshift Naming Standard and Case Sensitivity

Redshift Naming Standard and Case Sensitivity

Everyone knows that Amazon Redshift is based on PostgreSQL. Most people think Redshift and PostgreSQL have the same or similar naming case sensitivity of the identifiers (table names and column names).

But actually Redshift and PostgreSQL have a number of very important differences in the naming standard and case case sensitivity that you must be aware of as you work with Redshift.

Basically, in Redshift there are two types of Identifiers: Standard Identifiers and Double-quoted Identifiers. The identifier types are distinguished when they are created in the DDL script with or without the double-quote marks. The AWS document Names and Identifiers defines the naming rules for these two types of identifiers (table names and column names).

In general, in Redshift the UTF-8 multibyte characters two or four bytes long are allowed in the identifier names. In practice, let’s limit the identifier names to use single-byte ASCII characters only in this blog.

Standard Identifiers Rule

There are three simple rules for the Standard Identifiers:

  1. Begin with an alphabetic character or underscore, followed by any ASCII alphanumeric characters, underscores, or dollar signs.
  2. Not be a reserved SQL key word, like group, user.​
  3. Be between 1 and 127 characters in length.

The first rule implies that the identifier name should not contain any spaces or quotation marks.

Double-quoted Identifiers Rule

There are also three simple rules for the Double-quoted Identifiers:

  1. Begin and end with double-quote marks (“).
  2. The identifier can contain any ASCII printable characters, including the spaces and even the double-quote mark (“) itself.
  3. Be between 1 and 127 characters in length.

Notes:

  • You can use the reserved SQL key word like group, user, etc. as the table name or column name, but they have to be double-quoted.
  • You even can include the double-quote mark (“) itself in the table or column names. But it must be escaped by preceding with another double-quote mark (“).

Database Catalog and Reference of the Identifiers

After creating the objects with the standard Identifiers or double-quoted Identifiers or their mixture, how to use them?

Different databases have different ways to store the identifiers in the database catalog and reference them.

In Redshift, all ASCII letters in the identifiers (for both the standard and double-quoted identifiers) are case-insensitive and are folded to lowercase in the Redshift Catalog.

The identifiers can be referenced in the following simple and convenient ways:

  • For the Standard Identifiers, they can be referenced in any letter-case or mixture, with or without the double-quote marks.
  • For the Double-Quoted Identifiers,

    • If the quoted part meet the Standard Identifiers rules, then it can be referenced in any letter-case, with or without the paired double-quote marks. Just like the Standard Identifiers.
    • Otherwise, you must use the double quotation marks for every reference to that identifier, all ASCII letters can be any case, but the others should be exactly the same as it was defined.

Want to Return the Identifiers to Upper Case?

As said before, all ASCII letters in the identifiers (for both the standard and double-quoted identifiers) are are stored in the database catalog as lowercase . Therefore, in query results, all the column names are returned as lowercase by default. If you want to return all the column names in uppercase, set the describe_field_name_in_uppercase configuration parameter to true.

set describe_field_name_in_uppercase to on;
show describe_field_name_in_uppercase;

Then you will see all the return column names are in upper case. Remember, the column names in catalog are still in lowercase.

Examples

The following is a valid DDL to create a table

create table "TEst_TaBle1"(cOLuMN1 text, "coLuMn #2" text, "coLuMn ""3""" text);

Then table name and column names in the catalog will be

test_table1 (column1, column #2, column "3" )

The following queries are valid and will return the same result:

select   COLUMN1,  "COLumn #2",  "COLumn ""3""" from  TEst_TaBle1;
select "colUmn1",  "coLUmn #2",  "columN ""3""" from "teST_taBLE1";
select  column1,   "colUMN #2",  "COLumn ""3""" from "Test_taBLE1;

How to Assume IAM Role From AWS CLI

How to Assume IAM Role From AWS CLI

You can fellow the following 3 steps to assume an IAM role from AWS CLI:

  • Step 1: Grant an IAM user’s privilege (permission) to assume an IAM role or all IAM roles
  • Step 2: Grant a particular IAM role to the IAM user. That is, modify a particular IAM role’s trust relationship to allow the IAM user to assume this role
  • Step 3: Configure and use the AWS CLI profile to assume the granted role

1. Grant an IAM user’s privilege:

  • Create a new or pickup an existing IAM user (say named it as “userTest”)  with the security credentials (Access Key ID and Secret Access Key) . Note done its user ARN (like arn:aws:iam::123456789012:user/userTest).
  • Create a assuming role policy: It can be a generic policy (named it as “Assume-Roles”) to assume any IAM roles with the following custom policy code
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "",
"Effect": "Allow",
"Action": "sts:AssumeRole",
"Resource": "*"
}
]
}

Or you can specify a particular role’s ARN (like “arn:aws:iam::123456789012:role/roleTest”) for the Resource above to limit the policy only applicable to the IAM role named role “roleTest” and name the policy as “”Assume-Role-roleTest”.

  • Attach the policy to the user “userTest”.

2. Grant an IAM role to an IAM user:

  • Create a new or pickup an existing IAM role (say it is named as “roleTest”) with the required permissions.  Note down its role ARN (like arn:aws:iam::123456789012:role/roleTest).
  • Modify the role’s trust relationship to add the the IAM user to the Trusted entities:
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::123456789012:user/userTest"
},
"Action": "sts:AssumeRole"
}

3. Configure and use the AWS CLI profile to assume the granted role :

  • Edit the config file at ~/.aws/config to add a profile for each of the user userTest and the role “roleTest”:
[profile userTest]
region=us-east-1
output=json

[profile roleTest]
region=us-east-1
output=json
role_arn = arn:aws:iam::123456789012:role/roleTest
source_profile = userTest

If the user “userTest” is using the default profile, the source_profile line can be set to the default profile: “source_profile = default” .

  • Use the AWS CLI to assume the role: Call and AWS CLI commend using the profile option:
aws AnyEligibleCLICommend --profile roelTest

This will login into AWS using userTest’s security credential and then assume the IAM role “roleTest” to execute the CLI commend.

For example, suppose the user “userTest” uses the default profile, and itself does not have the Redshift access permission but the roleTest does.

If you run the AWS commend:

aws redshift describe-clusters

It will throws “not authorized” error. But if you run the AWS CLI commend:

aws redshift describe-clusters --profile roleTest

It will be successful.