Migrating your Hadoop workloads to the Cloud

Maintaining an Hadoop cluster by yourself is a lot of work. Namenodes fail, disks of workers need replacing, OS security patching is almost a daily task. Switching to the cloud might be a way out, as it takes away most of these tasks. However, moving your jobs is not only about modifying/rewriting your Hadoop jobs. It's also about changing the way you work altogether to suit the cloud. In this blogpost, I'll explain what you need to think about when doing an Hadoop migration.

Setting up Your New Cloud

When choosing a cloud you first need to think about what is important to you. Do you need clusters that start within seconds, or is active directory integration more important?

Moreover, besides having a place to run your Hadoop jobs, you might want to make use of the other shiny new features a cloud provides. Those TPUs are very interesting indeed, or that method of running/scheduling your Docker containers. Some clouds provide a well integrated Kubernetes setup, while others choose to provide their own variant.

Or are you going for a multi-cloud setup? Taking advantage of the best of each cloud. While at the same time complicating the migration, and possibly incurring egress costs.

Next, how are you going to connect your on prem and cloud networks? A secure connection between them is a given, but are you going to use a Site-to-Site VPN or will you pay for a dedicated connection?

And who is going to manage the cloud for you? Is your current IT-department capable of doing so, or are you going to enlist a managed service provider to do this for you?

Moving Data

A relatively simple DistCP command will copy all your local data to your new cloud. But do you need to copy all data? Or can some of it be left behind? Are you going to keep a local copy (which might come in handy at a later time)? Also, uploading data takes time. Uploading 100TB over a 100Mbps line takes more than 3 months.

Decide where you are going to run your DistCP job. If on premise, then the DistCP job itself will occupy resources on your local cluster. To avoid that, spinning cluster in the cloud just for this DistCP job might be a better idea. Moreover, if you want to verify data after copying using the Composite CRC feature, you probably need to run the DistCP job in the cloud as your local cluster isn't running Hadoop 3.1.1 or greater.

Think about how many buckets you are going to use. In your on premise setup you have one single HDFS, but in the cloud you can have multiple storage buckets allowing you to more clearly define the purpose of each bucket.

Ingestion

Next, you need to change the way you ingest data. To start, you need to decide where those jobs are going to run? If on premise, then you need to modify the jobs to send their data to the cloud instead of storing it on-premises. A relatively easy task. However, by doing so, you might not fully make use of all the features your new cloud provides. Alternatively, you can move your ingestion jobs to the cloud, pulling data in. This allows you to make full use of new methods for ingestion.

Or are you going to split your jobs into two parts? E.g., one part pushing data on a bus such as AWS Kinesis or Google PubSub, while the other part (running in the cloud) is in charge of storing and transforming the data as it comes in.

Changing How You Work

So by now, you've chosen a cloud, connected it to your network, uploaded all your data, and modified the ingestion jobs. If everything went well, nobody noticed that your Hadoop workloads are now running in the cloud, but similarly upper management is still thinking why the cloud migration was a good idea.

By changing the way you work you can start to incrementally improve your processes. On demand Hadoop clusters allow you to give users a lot of flexibility, but giving everyone in the company access to their own auto-scaling Dataproc cluster might not be the best choice.

Defining user personas, and how they should use the cloud, is going to help. E.g., your BI analyst who is used to Hive, might be better off with BigQuery. Optimizing who is using what allows you to control costs while at the same time increases the overall happiness of your users.

Also, some jobs might not even need an Hadoop cluster anymore. Due to the splitting of compute and storage in the cloud, an ingestion job can store files directly without the need for an Hadoop cluster.

Concluding

The cloud reduces the burden of maintaining your own Hadoop cluster. Moving to the cloud requires you to overcome both technical and organizational challenges. I'm hoping that in this blog post I've highlighted some of those challenges you might face in doing so. If you have any further questions, feel free to reach out to me directly.

Stay up to date on the latest insights and best-practices by registering for the GoDataDriven newsletter.
Follow us for more of this