FanduTech – Data Science Product Development and Consulting:

Machine Learning, Predictive Analytics, Big Data

Building AWS Data Pipeline for cross-account resources

Views 1635 | Likes0 | Dislikes 0

Description:

In this post, I will show you how you can easily build an automated cross-region/cross-account data pipeline using AWS Data Pipeline and AWS lambda.

Let us first define the scenario. There are two AWS accounts – X and Y. You want to create data pipeline between RDS (for example MySQL) which is in account X and Redshift which is in account Y. In other words, you want to move data from RDS in account X to Redshift in account Y and along the way also transform the data and of course you want it automated. Currently, setting up a single data pipeline for resources across two accounts is not supported in AWS. And you will also need to consider the fact that AWS Data pipeline feature is not available in all regions, so it might happen that the region of your account does not support data pipeline. That means the resources will not be automatically available to the data pipeline.

This is achieved by creating two data pipelines one in each accounts. First pipeline would export the data from mysql in account X to an s3 bucket in the account Y. Once this data is in the bucket, the piepline in account Y does s3 import to get the data into redshift. To wire both pipelines so that they work automatically, as required in a production data pipeline set up, we will use AWS Lambda. The lambda function will get invoked by AWS S3 when a new object (file) is created in the s3 bucket in Y account. And then the lambda function in turn will activate the second data pipeline.

Lets say you have your first data pipeline in Account X and S3 bucket in Account Y.

Create a IAM policy as shown below in account Y and attach it to the DataPipelineDefaultRole and DataPipelineDefaultResourceRole:

{
  "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "",
            "Effect": "Allow",
            "Action": [
            "s3:AbortMultipartUpload",
            "s3:GetBucketLocation",
            "s3:GetObject",
            "s3:ListBucket",
            "s3:ListBucketMultipartUploads",
            "s3:PutObject"
            ],
        "Resource": [
        "arn:aws:s3:::<bucket_name_in_account_Y>/*",
        "arn:aws:s3:::<bucket_name_in_account_Y>"
         ]
       }
    ]
}

Next in Account X. Add the following bucket policy to the destination s3 bucket:

{
    "Version": "2012-10-17",
    "Id": "",
     "Statement": [
     {
        "Sid": "",
        "Effect": "Allow",
        "Principal": {
        "AWS": [
                    "<ARN_of_the_data_pipeline_default_role_in_account_X>",
                     "<ARN_of_the_data_pipeline_role_in_account_X>",
    },
    "Action": "s3:*",
    "Resource": [
    "arn:aws:s3:::<name_of_bucket_in_account_Y>",
    "arn:aws:s3:::<name_of_bucket_in_account_Y>/*"
     ]
    }
   ]
}

With the above implemented, the data pipeline in account X will be able to export mysql data to the s3 bucket in account Y. Then you simply run a datapipeline to import the data from S3 in account A to the Redshift cluster.

Example lambda function in python:

import json
import urllib.parse
import boto3

print('Loading function')

s3 = boto3.client('s3')
client = boto3.client('datapipeline', region_name='ap-southeast-2')
pipeline_id = 'xxxxxx' # id of the data pipeline that needs to be activated


def lambda_handler(event, context):
    try:
        response = client.activate_pipeline(pipelineId=pipeline_id)
        return response
    except Exception as e:
        print(e)
        raise e

 

Login to like or dislike

Comments


Login to add a new comment


Recent Blogs
  • Sept. 2, 2020
    Views 1208 | Likes0 | Dislikes 0

    How to leverage Data Science in Retail Industry

  • Sept. 1, 2018
    Views 1635 | Likes0 | Dislikes 0

    Building AWS Data Pipeline for cross-account resources

  • March 30, 2017
    Views 1259 | Likes1 | Dislikes 0

    Text Classification with Deep Learning in Keras

  • Jan. 18, 2017
    Views 1136 | Likes0 | Dislikes 0

    Unpickling issue in multi-module Python project

  • Aug. 13, 2015
    Views 959 | Likes0 | Dislikes 0

    Is Big Data Just a Fad?