Quantcast
Channel: Pensieve
Viewing all articles
Browse latest Browse all 35

What People Don’t Tell You About Redshift

$
0
0

Two years ago when I first read about Redshift and the claim that it was 10x faster than Hive I was blown away. Incredible! I used to evangelize Redshift to everyone in my team or outside. But using it for an year killed all the joy and now I look back to Hive again as my preferred tool.

In this post I’ll be using Hive/Hadoop interchangeably. It’s because I use Hive for my everyday work at Amazon, but I know that Hive queries get converted into map-reduce jobs that are scheduled by Hadoop and so Hive supports mostly whatever Hadoop does.

The Comparison

Yes, Redshift is fast but nobody tells you the effort you need to invest in maintaining the Redshift cluster vs. how easy maintenance a Hive cluster is. I’ll let you read a few articles to understand the differences between Redshift and Hive:

Loading Data

Hadoop has been built to consume any unstructured file and is not limited to HDFS. We at Amazon use S3 as the distributed file system. When using AWS EMR you can create tables with location as a S3 bucket and prefix. The advantage of being able to use S3 is that it allows us to use data generated by different teams without the overhead of copying (and maintaining) their data in our data store.

The problem with Redshift is that you need to setup pipelines to copy every data into the Redshift cluster! The number of pipelines you need to manage grows pretty quickly. We have one pipeline per country for each table! And Redshift does not even guarantee Primary Key constraint, which means you could end up with duplicate data in your table if you are not careful in copying data. We have run into a few nasty issues maintaining data in a Redshift cluster.

And oh, nobody tells you how long it takes to load data into Redshift!

Oh Crap! Preparing Data First

You can throw almost anything into Hive and process it – from very complicated json to application log files. If Hive doesn’t already support your file type you can write a SerDe (Serializer-Deserializer) for your data. If you are feeling lazy, you could just use Hadoop streaming and quickly write up a Python code to process the data. Hive can handle just about anything.

It’s not the same with Redshift. Redshift’s COPY command understands only structured JSON or a delimited (comma, space or tab) files. Anything else and you are out of luck. You need to write a custom transformer that will transform the data into one of the supported types to be able to COPY data into Redshift cluster. Oh, and let me also mention that:

  • Data can contain only UTF-8 characters.
  • VARCHAR has a max. limit and the COPY (or INSERT) command fails if a string exceeds the configured limit. There’s no equivalent of Hive’s string datatype.
  • More guidelines from Redshift’s Preparing Your Data documentation.

Oh, and unlike Hive, Redshift doesn’t allow you to run custom transform functions on the data. You can’t cook up a Python script to hook it up in Redshift and transform the input data. No sir!

Disk Space Usage and Vacuum

Today in the evening when I opened up the Redshift Console page to check the performance graphs I found that we had exhausted the disk space completely! This was such a shock to me because yesterday the utilization was around 75%. I ran a command to get the list of tables along with their space utilization.

I found that one of our tables’ size had grown considerably – almost 9x what it was yesterday. I followed up with the AWS Support team and found that I needed to VACUUM the table to reclaim space. VACUUM!!

So here’s the deal. If you want your Redshift cluster to be happy and to provide you quick results you need to make sure you VACUUM and ANALYSE the tables regularly. How regularly? Well, VACUUM is an I/O intensive task and the preference is to run VACUUM at nighttime or when you expect minimal activity on the cluster. Great! I just have to figure out a downtime for our cluster when we have multiple COPY pipelines running for countries in Asia, Europe and America every 4-6 hours.

And if you don’t run VACUUM frequently enough then it’ll take much more time to VACUUM when you run it next. I even came across a horror story where one teams’ cluster went down trying to VACUUM a table. The table size was ~400GB. During the VACUUM process the size increased to over 1.7TB, at which time the cluster exceeded it’s disk space and became unusable. Conveniently, CANCEL command doesn’t work with VACUUM operations. So that team had to shut down their cluster and start a new one from the earlier clusters backup image (and data). Then they had to perform a deep copy to get the table in usable condition.

We don’t have to worry about all these issues with Hive because addition of data is as simple as creating directories for new partitions and saving the files there. That’s it! Repair (not as horrible as it sounds really) your table and you have the appended data ready for usage. I’ve never had to worry about VACUUM on my Hive tables.

Yes, there are issues with loading too many partitions in Hive or at least I had them when I tried loading absurd tens of thousands of partitions in Hive 0.11.0.2. But I think there’s been some performance update in Hive 0.13.1 that handles large number of partitions well. I haven’t tried it yet though.

End of My Rant

I wasted more than an hour of my evening because of stupid Redshift issues and so this post. I wrote one on Hive + EMR + S3 a few weeks ago but didn’t publish it. I like Hive, and for all the problems it throws up I can almost always find good solutions. Maintaining a Redshift cluster though seems like such an operational burden at times.

Anyways, I feel much better now.


Viewing all articles
Browse latest Browse all 35

Trending Articles