Creating Time Machines with Google Compute Engine


Setting Google Compute Engine

Google Compute Engine (GCE) is a cloud based virtual machine. It can provide us with a scalable and highly parallel processing machine. Here, we want to use several instances on GCE and build a simple cluster. Then, we will use this cluster for creating time machines. It will take much less time to build a time machine when we are using hundreds of processors in parallel.

Before setting up your cluster, you have to first sign up with GCE and create a new account. Here, you can find out how to sign up: https://developers.google.com/compute/docs/signup. In this step, you will also define a project id (e.g. my_project). Then, you have to install a software named gcutil. You will use this command line tool in order to communicate with GCE and your cluster. The installation process is explained in this link: https://developers.google.com/compute/docs/gcutil_setup. Make sure that you specify a default project ID by running this command: (replace my_project with your project ID)

$ gcutil getproject --project_id=my_project --cache_flag_values

In this way you don't have to specify your project ID every time you execute a new gcutil command.

Now that you have access to GCE and you have set up your gcutil tool, you are ready to create your own cluster.

Cluster Structure

When you sign up with GCE, you receive a maximum quota for the number of CPUs you can have. Let's say you have a quota of 12, meaning you can have 12 CPU cores up and running in total. But you cannot use all of these cores in one computer. You have to define several virtual computers each with a limited number of CPU cores (usually 1, 2, 4 or 8 cores). Each one of these virtual machines is called an "instance". For creating an instance, run this command:

$ gcutil addinstance tmc-server --wait_until_running

Now you will see a list of zones. Choose one of them. Then you will see a list of available machines. Each machine has a number of cores and it may also have an additional hard drive. Choose n1-standard-4-d. This will create an instance named tmc-server running an Ubuntu operating system with a 4 core CPU and a 1.7 TB additional hard drive. Repeat this process and create two other instances named tmc-1 and tmc-2. Remember to choose the same zone for all of your instances so they can see each other in their local network. Now we are using all of our 12 cores in 3 instances.

In our example, our cluster has one head node always named tmc-server. This head node will connect to other nodes (i.e. tmc-1, tmc-2, ...) and creates our cluster. Then, we can start our Time Machine Converter (TMC) tool from tmc-server and use all the other nodes in our GCE network to process our data.

Setting Shared File System

The first step of setting up our cluster is to create a shared file system. When in the next steps you use TMC tool to process your images and make video files, all the instances on GCE must have access to your data (pictures, etc.). Therefore, first you have to create a single directory shared between all GCE instances. For building this shared file system, we use GlusterFS: http://www.gluster.org/. GlusterFS is a free and open source file system which works over network with distributed hard disks. It can attach all of those hard disks together and create a huge single directory accessible by all instances. For installing GlusterFS first you have to connect to your instances. Let us start by connecting to tmc-server:

$ gcutil ssh tmc-server

Now you are inside tmc-server instance. Then, install GlusterFS server and client:

$ sudo apt-get install glusterfs-server glusterfs-client -y

Then close your connection:

$ exit

Now we have to setup our client instances (tmc-1 and tmc-2). First connect to tmc-1 and install GlusterFS:

$ gcutil ssh tmc-1
$ sudo apt-get install glusterfs-server glusterfs-client -y 

We also need to setup the additional hard drive on this instance so in the next steps we can use it as a part of our shared file system. For setting up the hard drive use these commands:

$ sudo mkdir /mnt/disk
$ sudo /usr/share/google/safe_format_and_mount /dev/disk/by-id/google-ephemeral-disk-0 /mnt/disk
$ sudo chmod a+w /mnt/disk

Now the hard drive is ready to use and mounted to /mnt/disk. Disconnect from tmc-1 ($ exit) and repeat all of this process for tmc-2.

Now that both tmc-1 and tmc-2 are ready, connect back to tmc-server so we can setup GlusterFS.

$ gcutil ssh tmc-server
$ sudo gluster peer probe tmc-1
$ sudo gluster peer probe tmc-2
$ sudo gluster volume create tmc-volume tmc-1:/mnt/disk tmc-2:/mnt/disk
$ sudo gluster volume start tmc-volume
mkdir ~/jobs
$ sudo mount -t glusterfs tmc-server:/tmc-volume ~/jobs
$ exit

The peer probe command tells GlusterFS that there are other instances on our network which are running gluster-server. Then, we create a volume and name it tmc-volume. It is a combination of all local disks on our instances in GCE. After that we start this volume by using volume start command. Now that we have a GlusterFS volume up and running, we have to mount it on a local directory. We create a new directory in our home path. We have named it jobs as TMC assumes this is the name of the shared directory. Then we mound our GlusterFS volume to ~/jobs. Now if you use df -h command, you see we have a new hard drive named tmc-server:/tmc-volume with a total size of 3.4 TB mounted on ~/jobs.

We have to mount this hard drive on other instances too:

$ gcutil ssh tmc-1
mkdir ~/jobs
$ sudo mount -t glusterfs tmc-server:/tmc-volume ~/jobs
$ exit

Repeat this task for tmc-2. Now all 3 instances have access to a shared hard drive mounted to the same path.

Setting TORQUE Resource Manager

TORQUE is a free and open source resource manager (http://www.adaptivecomputing.com/products/open-source/torque/). By using TORQUE, we can fully utilize our cluster. It allows us to submit a job to other instances. It also has a basic scheduler so we can always keep all of our CPUs in the cluster busy. TMC has a built in ability to use TORQUE and distribute its workload between other instances.

Install TORQUE using these commands:

$ gcutil ssh tmc-server
$ sudo apt-get install torque-server torque-scheduler torque-client torque-mom -y
$ exit

Repeat these commands for tmc-1 and tmc-2. Now we have to setup TORQUE server on tmc-server:

$ gcutil ssh tmc-server
$ sudo killall pbs_server pbs_sched pbs_mom

First we connect to tmc-server and then we kill all TORQUE processes. Then we have to set some config files in TORQUE directory. TORQUE is installed by default in /var/spool/torque.

$ cd /var/spool/torque
$ sudo chmod a+w server_name
$ echo tmc-server > server_name

Now we start TORQUE server for the first time and then we configure it. In configuration phase we are creating a new queue and we are telling it to assign each new job to one CPU core in one of our instances on GCE.

$ sudo pbs_server -t create
$ sudo qmgr
>> create queue default
>> set queue default queue_type = Execution
>> set queue default resources_default.nodes = 1
>> set queue default resources_default.neednodes = 1
>> set queue default enabled = True
>> set queue default started = True
>> set server scheduling = True
>> set server acl_host_enable = False
>> set server acl_hosts = tmc-server
>> set server default_queue = default
>> set server query_other_jobs = True
>> exit

Now we have to define our nodes. Each node is one of our instances in GCE which is going to receive jobs from tmc-server. So our nodes would be tmc-1 and tmc-2. We have to also define how many CPU cores are there in each node. In our example, each instance has 4 cores. Then, we have to enter this information in server_priv/nodes file.

$ sudo chmod a+w server_priv/nodes
$ echo -e " tmc-1 np=4 \n tmc-2 np=4" > server_priv/nodes

We can check all of our settings by using qmgr -c 'p s' command. Now we have to restart our server side TORQUE software:

$ sudo qterm
$ sudo pbs_server
$ sudo pbs_sched

Now it is time to setup TORQUE on clients. Exit from your current connection to tmc-server and connect to tmc-1. We first kill all TORQUE processes and then configure TORQUE client-side process, pbs_mom:

$ sudo killall pbs_server pbs_sched pbs_mom
$ cd /var/spool/torque
$ sudo chmod a+w server_name mom_priv/config
$ echo tmc-server > server_name
$ echo pbs_server = tmc-server > mom_priv/config

Now start TORQUE on tmc-1:

$ sudo pbs_mom

Repeat these commands for tmc-2. Now both clients and the server are ready. You can check your job queue by invoking qstat -q command. It shows you an empty queue ready to accept new jobs. You can also see the list of available instances by using pbsnodes -a command.

We can test our TORQUE resource manager by sending jobs from tmc-server. Connect back to tmc-server and perform these commands:

$ echo sleep 100 | qsub
$ qstat

qsub is the main command for creating jobs. Thus the first command sends a simple job to TORQUE. The resource manager then chooses one of our instances in GCE and sends the job to it (as a point just remember that you cannot send jobs as root). qstat shows the state of current jobs in the queue. By using qstat -n you can also see what jobs are assigned to each instance.

Now that our resource manager is up and running, we can install TMC and start creating time machines.

Setting Up Time Machine Creator

Now we install TMC on all of our instances. We explain how to do it on tmc-server and then you have to repeat these steps for tmc-1 and tmc-2 so all of your nodes would have access to TMC in the same path.

First download TMC package from this address: https://docs.google.com/file/d/0B2n3EeJJWXTBUDZpRDFPOUhPRzA/edit . Now upload it to tmc-server from your local machine:

$ gcutil push tmc-server tmc-cluster-pkg.tar.gz ~/jobs

We are uploading TMC package to ~/jobs shared folder so we can also access it from tmc-1 and tmc-2.

Now connect to tmc-server (gcutil ssh tmc-server) and install some tools which would be required in next steps:

$ sudo apt-get update
$ sudo apt-get install build-essential yasm xvfb pkg-config -y

Also untar TMC package:

$ cd ~/jobs
$ tar -xzvf tmc-cluster-pkg.tar.gz

Now go to tmc-pkg directory and execute the setup script:

$ cd tmc-pkg
$ ./setup

It should install all the dependencies required for TMC, add some common scripts to /bin directory and put the compiled files in tmc-pkg/tmc-linux directory. Now, you can run ct.rb inside tmc-pkg/tmc-linux directory and create time machines! Don't forget to repeat these steps on other instances on GCE too.

Creating a Time Machine Using the Cluster

For uploading your photos to the cluster, use push command from your local machine:

$ gcutil push tmc-server my_files.tar ~/jobs

It will upload your file my_files.tar to the shared directory ~/jobs. Also, for downloading your final videos use the pull command on your local machine:

$ gcutil pull tmc-server ~/jobs/my_videos.tar .

It will download a file named my_videos.tar from the shared folder on your cluster to your local machine.

For telling TMC to use the cluster, go to the directory with ct.rb and run this command:

$ ruby ct.rb --remote run_remote --remote-json -j 8 -r 10 <path_to_definition.tmc> <path_to_timemachine_directory>

This command tells TMC to use TORQUE and submit jobs over network. As in our example we used 8 cores in total for our client nodes, we use -j 8 attribute. If you have access to more CPU cores on GCE, you can use a higher number. Also -r 10 means that every time ct.rb creates a new job, that job includes 10 rules of time machine pipeline. For a typical time machine a rule number between 10 to 50 is sufficient.

Congratulations! You have now created a cluster and processed your time machine :)

Creating Cluster Using Pre-loaded GCE Image and Shell Scripts

Doing all these steps every time is tedious and time consuming. Fortunately you can setup your system once and then create a backup image from it. Then, you can tell GCE to load your image instead of a new empty instance. This link of GCE website tells you how to create and image, save it and load it with your new instances: https://developers.google.com/compute/docs/images

Now if you create images from your server and client instances and use them whenever creating new instances, they would also have GlusterFS, TORQUE and TMC pre-installed. But still you need to fine tune your shared file system and resource manager. For example, based on the number of instances you want to have in your cluster, you have to update the nodes file in TORQUE home directory. Or when you want to start your shared file system, you have to first connect all the computers on your network with gluster peer probe command. In order to make these steps automatic, you can write some shell scripts for creating and deleting your cluster. Here we will introduce some scripts for starting/stopping a cluster. You can tune them to your specific needs. They are downloadable from the bottom of this page. Also, we have made our own GCE image public, so you can use that for your own cluster. It has all the required software pre-installed. For adding that to your GCE project and using it with start/stop scripts in the next steps, run this command on your local machine:

$ gcutil addimage my_GCE_image_name "http://commondatastorage.googleapis.com/timemachine4720/tmc_image_public_v0.0.image.tar.gz"

You can replace my_GCE_image_name with any other name you like.

Shell Scripts for Starting/Stopping Cluster

The start-cluster.sh script receives the number of instances and starts your cluster, for example:

./start-cluster.sh 2

It will create one server instance (always named tmc-server) and 2 client instances: tmc-1 and tmc-2. It assumes that each instance has 4 cores, so in total 12 CPU cores would be used. Then it configures TORQUE and GlusterFS and creates a shared file system and mounts it on ~/jobs. Now you can upload your data to tmc-server and execute TMC on your cluster.

The first part of start-cluster.sh script defines your cluster's parameters. zone is the GCE zone you want to use. To see the list of available zones, use

$  gcutil listzones

The next parameters define the machine type for server and client nodes. You can see a list of available machine types by using this command:

gcutil listmachinetypes

We have used n1-standard-4-d which is a 4 core instance with an additional 1.7 TB ephemeral hard drive. Then we have defined the number of CPU cores in each instance, so np=4 as we have 4 cores in each instance. Then you have to define your image name. This is the image name you have created after installing all your required software and saving your image on Google Could Storage. Finally you have to specify your project ID.

In the next step, the script creates all instances and waits for them to start. We then read all the internal IP addresses of our instances and update their /etc/hosts file. This is because sometimes there may be some stale IP addresses in the DNS and it may confuse our cluster. By adding all the IP addresses to /etc/hosts we are always sure that all of our instances can find each other over their internal network. We then run setup-node-1 script on all our client nodes. In our setting, this script was a part of our GCE image (we have a pre-configured image and we have put our cluster-side scripts on /home/tmc/scripts). It mounts the local hard disk to /mnt/disk so in the next steps we can use it for our shared file system. We also make a fresh install of GlusterFS every time. We had some issues with loading GlusterFS from a saved image, but by installing it on each instance while the cluster is being created, those issues are resolved.

Now the start-cluster.sh runs another script saved in our GCE image: setup-server. It creates a GlusterFS volume (our shared file system) and mounts it on ~/jobs. It then adds all our node names to TORQUE node file and restarts TORQUE. Now our server is ready. Finally, we have to mount our shared file system in our client instances. This is done by another script saved previously in our GCE image: setup-node-2.

Now everything is ready. We can upload our data to our cluster, connect to tmc-server and start TMC.

We have also created an script that closes all of our instances and thus deletes our cluster (stop-cluster.sh). So after our time machine is done and we have downloaded it, we can disconnect from tmc-server and use this script to kill our cluster.


ċ
scripts.tar.gz
(1k)
Saman Amirpour,
Aug 24, 2012, 10:02 AM
Comments