Google Compute Engine (GCE) is a cloud based virtual machine. It can provide us with a scalable and highly parallel processing machine. Here, we want to use several instances on GCE and build a simple cluster. Then, we will use this cluster for creating time machines. It will take much less time to build a time machine when we are using hundreds of processors in parallel.
Before setting up your cluster, you have to first sign up with GCE and create a new account. Here, you can find out how to sign up: https://developers.google.com/compute/docs/signup. In this step, you will also define a project id (e.g. my_project). Then, you have to install a software named gcutil. You will use this command line tool in order to communicate with GCE and your cluster. The installation process is explained in this link: https://developers.google.com/compute/docs/gcutil_setup. Make sure that you specify a default project ID by running this command: (replace my_project with your project ID)
In this way you don't have to specify your project ID every time you execute a new gcutil command.
Now that you have access to GCE and you have set up your gcutil tool, you are ready to create your own cluster.
When you sign up with GCE, you receive a maximum quota for the number of CPUs you can have. Let's say you have a quota of 12, meaning you can have 12 CPU cores up and running in total. But you cannot use all of these cores in one computer. You have to define several virtual computers each with a limited number of CPU cores (usually 1, 2, 4 or 8 cores). Each one of these virtual machines is called an "instance". For creating an instance, run this command:
Now you will see a list of zones. Choose one of them. Then you will see a list of available machines. Each machine has a number of cores and it may also have an additional hard drive. Choose
In our example, our cluster has one head node always named tmc-server. This head node will connect to other nodes (i.e. tmc-1, tmc-2, ...) and creates our cluster. Then, we can start our Time Machine Converter (TMC) tool from tmc-server and use all the other nodes in our GCE network to process our data.
The first step of setting up our cluster is to create a shared file system. When in the next steps you use TMC tool to process your images and make video files, all the instances on GCE must have access to your data (pictures, etc.). Therefore, first you have to create a single directory shared between all GCE instances. For building this shared file system, we use GlusterFS: http://www.gluster.org/. GlusterFS is a free and open source file system which works over network with distributed hard disks. It can attach all of those hard disks together and create a huge single directory accessible by all instances. For installing GlusterFS first you have to connect to your instances. Let us start by connecting to tmc-server:
Now you are inside tmc-server instance. Then, install GlusterFS server and client:
Then close your connection:
Now we have to setup our client instances (tmc-1 and tmc-2). First connect to tmc-1 and install GlusterFS:
We also need to setup the additional hard drive on this instance so in the next steps we can use it as a part of our shared file system. For setting up the hard drive use these commands:
Now the hard drive is ready to use and mounted to /mnt/disk. Disconnect from tmc-1 (
Now that both tmc-1 and tmc-2 are ready, connect back to tmc-server so we can setup GlusterFS.
We have to mount this hard drive on other instances too:
Repeat this task for tmc-2. Now all 3 instances have access to a shared hard drive mounted to the same path.
TORQUE is a free and open source resource manager (http://www.adaptivecomputing.com/products/open-source/torque/). By using TORQUE, we can fully utilize our cluster. It allows us to submit a job to other instances. It also has a basic scheduler so we can always keep all of our CPUs in the cluster busy. TMC has a built in ability to use TORQUE and distribute its workload between other instances.
Install TORQUE using these commands:
Repeat these commands for tmc-1 and tmc-2. Now we have to setup TORQUE server on tmc-server:
First we connect to tmc-server and then we kill all TORQUE processes. Then we have to set some config files in TORQUE directory. TORQUE is installed by default in /var/spool/torque.
Now we start TORQUE server for the first time and then we configure it. In configuration phase we are creating a new queue and we are telling it to assign each new job to one CPU core in one of our instances on GCE.
>> set queue default enabled = True
>> set queue default started = True
>> set server scheduling = True
>> set server acl_host_enable = False
>> set server acl_hosts = tmc-server
>> set server default_queue = default
>> set server query_other_jobs = True
Now we have to define our nodes. Each node is one of our instances in GCE which is going to receive jobs from tmc-server. So our nodes would be tmc-1 and tmc-2. We have to also define how many CPU cores are there in each node. In our example, each instance has 4 cores. Then, we have to enter this information in server_priv/nodes file.
We can check all of our settings by using
Now it is time to setup TORQUE on clients. Exit from your current connection to tmc-server and connect to tmc-1. We first kill all TORQUE processes and then configure TORQUE client-side process, pbs_mom:
Now start TORQUE on tmc-1:
Repeat these commands for tmc-2. Now both clients and the server are ready. You can check your job queue by invoking
We can test our TORQUE resource manager by sending jobs from tmc-server. Connect back to tmc-server and perform these commands:
Now that our resource manager is up and running, we can install TMC and start creating time machines.
Now we install TMC on all of our instances. We explain how to do it on tmc-server and then you have to repeat these steps for tmc-1 and tmc-2 so all of your nodes would have access to TMC in the same path.
First download TMC package from this address: https://docs.google.com/file/d/0B2n3EeJJWXTBUDZpRDFPOUhPRzA/edit . Now upload it to tmc-server from your local machine:
We are uploading TMC package to ~/jobs shared folder so we can also access it from tmc-1 and tmc-2.
Now connect to tmc-server (
Also untar TMC package:
Now go to tmc-pkg directory and execute the setup script:
It should install all the dependencies required for TMC, add some common scripts to /bin directory and put the compiled files in tmc-pkg/tmc-linux directory. Now, you can run ct.rb inside tmc-pkg/tmc-linux directory and create time machines! Don't forget to repeat these steps on other instances on GCE too.
For uploading your photos to the cluster, use
It will upload your file my_files.tar to the shared directory ~/jobs. Also, for downloading your final videos use the
It will download a file named my_videos.tar from the shared folder on your cluster to your local machine.
For telling TMC to use the cluster, go to the directory with ct.rb and run this command:
This command tells TMC to use TORQUE and submit jobs over network. As in our example we used 8 cores in total for our client nodes, we use -j 8 attribute. If you have access to more CPU cores on GCE, you can use a higher number. Also -r 10 means that every time ct.rb creates a new job, that job includes 10 rules of time machine pipeline. For a typical time machine a rule number between 10 to 50 is sufficient.
Congratulations! You have now created a cluster and processed your time machine :)
Doing all these steps every time is tedious and time consuming. Fortunately you can setup your system once and then create a backup image from it. Then, you can tell GCE to load your image instead of a new empty instance. This link of GCE website tells you how to create and image, save it and load it with your new instances: https://developers.google.com/compute/docs/images
Now if you create images from your server and client instances and use them whenever creating new instances, they would also have GlusterFS, TORQUE and TMC pre-installed. But still you need to fine tune your shared file system and resource manager. For example, based on the number of instances you want to have in your cluster, you have to update the nodes file in TORQUE home directory. Or when you want to start your shared file system, you have to first connect all the computers on your network with
You can replace my_GCE_image_name with any other name you like.
The start-cluster.sh script receives the number of instances and starts your cluster, for example:
It will create one server instance (always named tmc-server) and 2 client instances: tmc-1 and tmc-2. It assumes that each instance has 4 cores, so in total 12 CPU cores would be used. Then it configures TORQUE and GlusterFS and creates a shared file system and mounts it on ~/jobs. Now you can upload your data to tmc-server and execute TMC on your cluster.
The first part of start-cluster.sh script defines your cluster's parameters. zone is the GCE zone you want to use. To see the list of available zones, use
The next parameters define the machine type for server and client nodes. You can see a list of available machine types by using this command:
We have used n1-standard-4-d which is a 4 core instance with an additional 1.7 TB ephemeral hard drive. Then we have defined the number of CPU cores in each instance, so np=4 as we have 4 cores in each instance. Then you have to define your image name. This is the image name you have created after installing all your required software and saving your image on Google Could Storage. Finally you have to specify your project ID.
In the next step, the script creates all instances and waits for them to start. We then read all the internal IP addresses of our instances and update their /etc/hosts file. This is because sometimes there may be some stale IP addresses in the DNS and it may confuse our cluster. By adding all the IP addresses to /etc/hosts we are always sure that all of our instances can find each other over their internal network. We then run setup-node-1 script on all our client nodes. In our setting, this script was a part of our GCE image (we have a pre-configured image and we have put our cluster-side scripts on /home/tmc/scripts). It mounts the local hard disk to /mnt/disk so in the next steps we can use it for our shared file system. We also make a fresh install of GlusterFS every time. We had some issues with loading GlusterFS from a saved image, but by installing it on each instance while the cluster is being created, those issues are resolved.
Now the start-cluster.sh runs another script saved in our GCE image: setup-server. It creates a GlusterFS volume (our shared file system) and mounts it on ~/jobs. It then adds all our node names to TORQUE node file and restarts TORQUE. Now our server is ready. Finally, we have to mount our shared file system in our client instances. This is done by another script saved previously in our GCE image: setup-node-2.
Now everything is ready. We can upload our data to our cluster, connect to tmc-server and start TMC.
We have also created an script that closes all of our instances and thus deletes our cluster (stop-cluster.sh). So after our time machine is done and we have downloaded it, we can disconnect from tmc-server and use this script to kill our cluster.