Machine Learning, Predictive Analytics, Big Data
Post date Sept. 1, 2018
Author: gopalsharma2001
1635 | 0 | 0
Description:
In this post, I will show you how you can easily build an automated cross-region/cross-account data pipeline using AWS Data Pipeline and AWS lambda.
Let us first define the scenario. There are two AWS accounts – X and Y. You want to create data pipeline between RDS (for example MySQL) which is in account X and Redshift which is in account Y. In other words, you want to move data from RDS in account X to Redshift in account Y and along the way also transform the data and of course you want it automated. Currently, setting up a single data pipeline for resources across two accounts is not supported in AWS. And you will also need to consider the fact that AWS Data pipeline feature is not available in all regions, so it might happen that the region of your account does not support data pipeline. That means the resources will not be automatically available to the data pipeline.
This is achieved by creating two data pipelines one in each accounts. First pipeline would export the data from mysql in account X to an s3 bucket in the account Y. Once this data is in the bucket, the piepline in account Y does s3 import to get the data into redshift. To wire both pipelines so that they work automatically, as required in a production data pipeline set up, we will use AWS Lambda. The lambda function will get invoked by AWS S3 when a new object (file) is created in the s3 bucket in Y account. And then the lambda function in turn will activate the second data pipeline.
Lets say you have your first data pipeline in Account X and S3 bucket in Account Y.
Create a IAM policy as shown below in account Y and attach it to the DataPipelineDefaultRole and DataPipelineDefaultResourceRole:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "",
"Effect": "Allow",
"Action": [
"s3:AbortMultipartUpload",
"s3:GetBucketLocation",
"s3:GetObject",
"s3:ListBucket",
"s3:ListBucketMultipartUploads",
"s3:PutObject"
],
"Resource": [
"arn:aws:s3:::<bucket_name_in_account_Y>/*",
"arn:aws:s3:::<bucket_name_in_account_Y>"
]
}
]
}
Next in Account X. Add the following bucket policy to the destination s3 bucket:
{
"Version": "2012-10-17",
"Id": "",
"Statement": [
{
"Sid": "",
"Effect": "Allow",
"Principal": {
"AWS": [
"<ARN_of_the_data_pipeline_default_role_in_account_X>",
"<ARN_of_the_data_pipeline_role_in_account_X>",
},
"Action": "s3:*",
"Resource": [
"arn:aws:s3:::<name_of_bucket_in_account_Y>",
"arn:aws:s3:::<name_of_bucket_in_account_Y>/*"
]
}
]
}
With the above implemented, the data pipeline in account X will be able to export mysql data to the s3 bucket in account Y. Then you simply run a datapipeline to import the data from S3 in account A to the Redshift cluster.
Example lambda function in python:
import json
import urllib.parse
import boto3
print('Loading function')
s3 = boto3.client('s3')
client = boto3.client('datapipeline', region_name='ap-southeast-2')
pipeline_id = 'xxxxxx' # id of the data pipeline that needs to be activated
def lambda_handler(event, context):
try:
response = client.activate_pipeline(pipelineId=pipeline_id)
return response
except Exception as e:
print(e)
raise e
How to leverage Data Science in Retail Industry
Building AWS Data Pipeline for cross-account resources
Text Classification with Deep Learning in Keras
Unpickling issue in multi-module Python project
Is Big Data Just a Fad?