Import AWS Cloudwatch logs to Redshift

Usually, web applications log important events.  These logs are usually for reference in case something happens.  And, if those apps are running inside AWS, most probably those logs live in Cloudwatch.  But as it turns out, sometimes instead of actually recording events, applications store real data in Cloudwatch.  It might sound like a strange option, but it works, and very well.  The issue becomes when we want to retrieve the data for analysis.  It’s not as easy as writing a query to read from from Redshift or a MySQL database.  In this case we will explore how to Import AWS Cloudwatch logs to Redshift.

There are several options to read the data stored in Cloudwatch, but none of them enable you to actually manipulate it easily.  So, I resorted to create a set of scripts in Python using the boto3 library, to achieve this task, and then set up a cron job to automate the process of importing the logs daily.

We will take a look on how to import data stored in  JSON format from the logs.  Here’s the the logic for the whole process, which can be implemented in a variety of languages.  The snippets included are in Python, but you shouldn’t have any trouble translating them to the language of your choice.

Short Version

1. Export logs from Cloudwatch to S3

To export Cloudwatch logs to S3, you can use the AWS cli tool.  Part of the the CloudWatch Logs commands in the AWS CLI is the create-export-task.  So, all you have to do is follow the documentation to review all the options available for this command.  Boto3 includes the create_export_task function, that will trigger the same action as the cli command.

Basically, you need to specify log stream info, time period and S3 bucket info, and run the command.

Python        AWS-Cloudwatch        aws-s3

2. Import logs from S3 to Redshift

Importing data from S3 to Redshift is usually simple.  It can be done with the COPY command.  But, log files usually conatin a timestamp, which if they didn’t then, what’s the point of a log?

Before being able to execute the COPY command, we need to download the files from S3, remove the timestamp and upload them back to S3.  These processed files will be ready to be imported to Redshift.

When the data in the S3 files is formatted as JSON objects, FORMAT AS JSON option will do the work, automatically mapping the object’s elements to the Redshift table (in the same order), or you can specify a jsonpaths mapping file.

And with that you are pretty much done.

aws-s3     aws-redshift       aws-sns

Long Version

Now, let’s look at the process in more detail.  There are several things to consider in each of the 2 steps of the process mentioned above.

Before we do anything in our script, we need to instantiate 3 clients that will be used:

cloudWatchLogs = boto3.client('logs')
snsClient = boto3.client('sns')
s3Client = boto3.client('s3')

1. Define the time period to export from Cloudwatch to S3

The very first thing is to define the time period for the logs you want to export.  Specify a Start Time and an End Time.

For the task to work correctly, just make sure that the Start Time and the End Time are specified in milliseconds.

2. Export logs from Cloudwatch to S3

Once you have the time period, there’s  several location parameters related to the Log Group and the S3 Bucket:

exportTaskParams = {}
exportTaskParams['taskName'] = 'TASK_NAME'
exportTaskParams['logGroupName'] = LOG_GROUP_NAME
exportTaskParams['logStreamNamePrefix'] = LOG_STREAM_PREFIX
exportTaskParams['fromTime'] = START_TIME
exportTaskParams['to'] = END_TIME
exportTaskParams['destination'] = S3_BUCKET
exportTaskParams['destinationPrefix'] = S3_FOLDER (if necessary)

After specifying the parameter, you can create and run the export task.

exportTaskResponse = cloudWatchLogs.create_export_task(**exportTaskParams)

3. Download files from S3 to a local directory

At this point, the logs were exported to S3 to the specified location.  You will notice that there are several gz files.  We will come back to this when we get to the last step of the process when we load the data to Redshift.

If you take a look at one of the files, and you will need to download it first, you will notice that each line has 2 components: 1. Log timestamp and 2. JSON object with the actual data.  You will not be able to import the data to Redshift in this format.

We need to process these files in order to later on import the data to Redshift.  This step is all about downloading ALL the files from S3 to a local directory.  We will process them next.

4. Remove log timestamp from files

Once we have all the S3 files in a local directory, it’s time to process them. And by this, I mean remove the log timestamp from each line.  This part is a little easier than it seems, because the separator between the log timestamp and the JSON object is 1 space, always.

All we have to do here is read the file, go through each line, split it and keep the second element from the split.  Specify a fileToRead and a fileToWrite so you don’t overwrite the original file.

*By the way, you don’t need to unzip the files to process them.

for line in contentLines:
     lineSegments = line.split(' ', 1)
     cleanLine = lineSegments[1]
     fileToWrite.write(cleanLine)

5. Upload process files back to S3

After processing the gz files, we need to upload them back with 2 additional files, resulting in the following;

  1. Data Fies
  2. Manifest file
  3. JSON paths file (mapping)

Data files

When the previous process is done, we need to upload the newly processed gz files to S3, to a different folder from the one where the original files are stored.

For this we loop through the local files and upload them using the s3Client created above:

for processedFile in processedFiles:
     uploadParams['Key'] = processedFilesS3Path + processedFileName
     uploadParams['Body'] = open(processedFile, 'rb')
     uploadProcessedFileResult = s3Client.put_object(**uploadParams)

The files are now ready to be imported to Redshift.

Manifest file

Since we will be importing several files from S3, we need to have a list in what is called the Manifest file, which is needed by the COPY command in Redshift, in order to know which files to read from. We will build it after we finish processing and uploading the data files, so we can have the list and their location.

This contents in this file should look something like this:

{
     "entries": [
         {
              "mandatory": true,
              "url": "s3://BUCKET_NAME/FOLDER_NAME/processedFiles/000000.gz"
         },
         {
              "mandatory": true,
              "url": "s3://BUCKET_NAME/FOLDER_NAME/processedFiles/000001.gz"
         }
     ],
     "name": "MANIFEST_FILE_NAME",
     "version": 3
}

JSON paths file

This file will provide necessary information to map the elements in each JSON object from the data files and the Redshift table fields.  If the JSON object is a single level object, and the elements are in the same order as the columns, then you won’t need this file.  But if that’s not the case, then you will.

You can find the guidelines for this files here.

A sample file would look like this:

{
    "jsonpaths": [
        "$['venuename']",
        "$['venuecity']",
        "$['venuestate']",
        "$['venueseats']"
    ]
}

Where each line is an element from the JSON object from the data files.  The lines are in the same order as the Redshift table fields.

6. Delete processed files locally

But, before we go ahead and import the Cloudwatch logs to Redshift, let’s first do some cleanup.  Since we already have the processed files in S3, we don’t need them locally.  So just go ahead and delete the files locally.

If you plan to do this import on a regular basis, I recommend using the same local directory so you don’t have to change any of the code.  And since we are cleaning it up every time after the files are uploaded, then you don’t have to worry about files overlapping or overwriting themselves.

7. Load data from S3 to Redshift

The data is now ready to be imported to Redshift.  But, we still have a couple of things we need to go through.  If you are not familiar with the COPY command, you will have a little trouble running it successfully if you don’t specify certain options.

Using everything we have done so far, the COPY command should look like this:

COPY REDSHIFT_TABLE
FROM 's3://BUCKET_NAME/FOLDER_NAME/MANIFEST_FILENAME'
CREDENTIALS 'aws_access_key_id=AWSACCESSKEY;aws_secret_access_key=AWSSECRETKEY'
MANIFEST
GZIP
TRIMBLANKS
TRUNCATECOLUMNS
FORMAT AS JSON 's3://BUCKET_NAME/FOLDER_NAME/JSONPATHSFILE'

All you have to do now, is run the above command and you should have the data in your Redshift table.   Have in mind that will pretty much perform an insert of the data.

Logging and Notifications

Finally, we should add logging and notifications for each step.  In my case, I used AWS SNS for notifications, as they are very simple to set up and work very well.  How to set up SNS is outside of this post’s scope but, in short, what you need to do is create an SNS topic and subscribe an email to it.

Notifications will give you a peace of mind and knowledge of what happens every time the process runs.

1. Logging

I’m not getting into how to write to log files, just the fact that we should log what happens in every step of the process in case we need to troubleshoot.  If you divide your scripts in several steps for the process, just log every time the step.

2. Success notifications

We will send an Success notification at the very end of a successful run of the process just to know everything went well.

3. Failure notifications

On the other hand, if a step in the process fails, we will send a failure notification, log the error, and halt the process. This way we can go and fix whatever went wrong.

Troubleshooting

When something goes wrong, it shouldn’t be that hard to troubleshoot.  If the process fails and the data was not inserted to the Redshift table, then it should be easy to troubleshoot, since it hasn’t really affected the final table.  Just figure out what went wrong and in which step it happened.

But, if some data was inserted into Redshift, then, some investigation will be needed to find out which data was inserted and remove it.  This is why it’s important to know the specific time period for the data, so we can go and actually remove only the corrupt data.

 

If you ever come across the task of importing AWS Cloudwatch logs to Redshift, this guide should be helpful.  And if it does, let me know!

3 thoughts on “Import AWS Cloudwatch logs to Redshift”

  1. import boto3
    import os
    import sys

    def cloudwatchlogs():

    session = boto3.Session(aws_access_key_id=’AWS_ACCESS_KEY_ID’,
    aws_secret_access_key=’AWS_SECRET_ACCESS_KEY’,
    region_name=’AWS_REGION_NAME’)

    client = session.resource(‘logs’)
    s3Client = session.resource(‘s3′)
    response = client.create_export_task(
    taskName=’string’,
    logGroupName=’string’,
    logStreamNamePrefix=’string’,
    fromTime=123,
    to=123,
    destination=’string’,
    destinationPrefix=’string’)

    BUCKET_NAME = ‘my-bucket’

    KEY = ‘my_image_in_s3.jpg’

    path=’C:\Python27\logfiles\’

    path=s3.Bucket(BUCKET_NAME).download_file(KEY, ‘my_local_image.jpg’)

    1. Hi sumathi,

      The easiest way is to export them to S3, like in Step 2.

      (https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/S3Export.html)

      After defining some `exportParams` just call

      exportTaskResponse = cloudWatchLogs.create_export_task(**exportTaskParams)

      create_export_task will create a similar task like the one you would create from the AWS Console. Just start it and wait until it’s done.

      You can also do it via de AWS CLI:
      https://stackoverflow.com/questions/28060845/aws-cloudwatch-log-is-that-possible-to-export-existing-log-data-from-it

Leave a Reply

Your email address will not be published. Required fields are marked *