Why We Invested in

Why We Invested in is a smart compute management layer for optimizing, accelerating and monitoring deep learning training workloads

GPU Performance Enhancement — by Knitz, Florian, last modified by Aykin, Can on 31.January 2017

When looking to optimize deep neural network workload, there are multiple parameters that should be taken into account.

No one knows what triggers the magic leap from special cases to general concepts in deep neural networks. The human creators of deep neural networks never expected the algorithms to work so well, as no underlying principle has guided their design other than vague inspiration from the architecture of the brain. The practice of deep learning is actually more like alchemy or black magic, an experimental science that continues to surprise developers.

( Just recently Naftali Tishby an Israeli computer scientist and neuroscientist from the Hebrew University of Jerusalem, presented a theory to explain deep learning generalization called “The information bottleneck”.)

What we do know is that deep neural networks have become a standard for supervised learning, are spreading like wildfire and rapidly evolving. We also know that deep neural networks feast on compute and memory. And lastly, we know that developing an AI system involves much “trial-and-adjust” processes, and that its quality is highly correlated to the number of experiments done in a given period of time or budget

Workload Optimization

In a typical NN, there are millions of parameters which define the model and requires large amounts of data and a lengthy computational process. Not surprisingly, a top challenge for developers nowadays is to optimize the time and cost of training these compute and memory greedy networks.

There are many attempts to accelerate deep neural networks training. Some focus on the hardware side while others on the algorithm side. All these approaches are valid, but they frequently ignore two important aspects –

  1. Deep neural networks are rarely being trained in an isolated and fully controlled environment.
  2. Data scientists and machine learning experts are focused on the algorithm and the analysis side of building NN and mostly lack the expertise needed to manage the challenges associated with distributed scale issues.

Let’s take a small team of 5 data scientists, each is working on a different deep neural network model. They share multiple configurations of GPU and CPU servers in an hybrid cloud environment, some of the servers are hosted in their private cloud and some are reserved in the public cloud. The data they are using is collected from connected devices and is stored at multiple locations. Each of the models has its own various algorithms, various parameters, and various feature sets. And some of the models are an assembly of few other models.

When trying to envision a way to optimize the time and cost of training all these models in parallel, you soon realizes that you have to consider many parameters.

For example, you can accelerate the training of your model by training it on a machine with multiple GPUs, or on multiple machines each with several GPUs by ‘parallelising’ the training. But choosing the right model of parallelization for a each model (data parallelism, model parallelism or hybrid data-model parallelism), can turn into a tricky and complex task. You have to consider the specific model architecture, the size of the training data, the location of the data, the current available network bandwidth, the configuration of the compute resources, the availability and cost of the different compute resources and the other models and iterations in the que.

Choosing a parallelism method

The below example from researchers at Facebook AI group that tested a standard image classification network on various servers configurations and various parallelism models, emphasize the benefits but also the complexity of choosing the right parallelism method.

GPU Performance Enhancement — by Knitz, Florian, last modified by Aykin, Can on 31.January 2017

GPU Performance Enhancement — by Knitz, Florian, last modified by Aykin, Can on 31.January 2017

The Universe of your optional methods is the number of your multi GPU/CPU compute configurations to the power of all possible parallelism models. And in real life, unlike the example above, there are actually more than 3 possible parallelism models. Furthermore, in order to optimize your workload as a whole, you need to compare all optional parallelism methods in the power of all NN models in your workload. Thus, your universe of optional parallelism methods is growing exponentially with each compute configuration you add, and new parallelism model or NN model you develop/implement.

Network overhead limit scalability

What was not demonstrated in the above example is that although parallelism has clear benefits, network communications overhead can quickly limit scalability.

Arimo’s data science team’ tested the scaling capabilities of a single version of TensorFlow with data-parallelism setup across different dimensions.

The graph below nicely demonstrates the impact of network overhead in their experiment:

GPU Performance Enhancement — by Knitz, Florian, last modified by Aykin, Can on 31.January 2017

Although the local implementation of the GPU had no data-parallelism, as the dataset got larger, the network became the bottleneck for the distributed implementations thus making the Local GPU the most efficient option. (In this experiment, even the largest data set was small enough to fit in memory).

I could continue with the examples but I think you get the idea.

Optimizing deep neural network workloads within a single machine or across multiple machines is a constantly moving target. One that needs to be recalculated per each NN model and iteration and as bandwidth and compute resources become available or not.

Training never ends

Training your neural network is a delicate process that takes time and multiple iterations. Initially you set your weights randomly, but in the process of training, you hope to wind up with high accuracy. The graph below is a good example of how a typical rate of accuracy in deep neural networks improve over time. Rather quickly you can get to a good accuracy, but from from that point on, any improvement is much harder and increases the risk of overfitting.

GPU Performance Enhancement — by Knitz, Florian, last modified by Aykin, Can on 31.January 2017

It’s easy to obtain a good accuracy, but every improvement is hard and expensive.

Thus, It’s no wonder that developers invest most of their time in the training phase. Changes in learning settings, sampling methods in training, convergence thresholds, and essentially every other possible tweak may alter, either subtly or dramatically the prediction behavior.

.But training never ends as companies that are already running inference at large scale in production experience. A core feature of deep neural networks is their ability to generalize a behaviour that can not be implemented in software logic. The flip side is that it’s impossible to adhere neural networks to a specific intended behavior. The performance of machine learning algorithms depends on the input taken from external data sources, and there is no way to separate abstract behavior from quirks of data. This tight coupling of algorithm and data means that a change in the external data typically would change the way the algorithm behaves.

Unfortunately, external data is rarely stable. In practice, It often means that even if shipping the first version of a machine learning system may seem rather easy, making subsequent adjustments and improvements is unexpectedly difficult and time consuming. Companies have to constantly update and train their models.

If we look back at our team of five engineers, we should now assume that they are not only working each on a new model, but each data scientist is also retraining tens or even hundreds of older models. Automatic retraining of models will become a must in the near future, and willThis of course adds add to the already immense complexity of the NN workload optimization task.

Collage.AI makes optimizing NN training workloads simple and automatic. It takes care of orchestrating the training while taking advantage of all available resources. With zero effort you can increase the speed of your NN training by 10x! is a software only layer, which takes as an input any iteration from your data science team and execute automatically the best optimization procedure.

The ‘magic’ lies in the analysis of the specific subtleties of each NN and the selection of the specific optimization / parallelization method. uses a variety of parallelism models and automatically implement the best one, freeing the data scientist from dealing with compute and distribution issues.In a transparent way determines, what is the most efficient and cost effective way to run the NN training workload, while taking into account the network bandwidth availability; the compute resources availability, cost and configurations and the data pipeline and size. may change the data pipeline or add/ reduce the number of compute resources based on demand and pricing. For example, supports GPU spot instances to reduce cost and increase efficiency. is already working with a select group of design partners, and will be open to beta soon. I strongly recommend every company that is using deep neural networks to explore their solution and join the beta program.

You can read more about here.

About the Investment

As a VC I meet entrepreneurs all day, every day, but some meetings are memorable, first meeting was one of them. Both Omri Geller and Ronen Dar the founders of, are not only exceptionally bright and professional, but also great people to work with. We have seen many AI companies in recent years, but Omri and Ronen brought a fresh look at solving the NN training optimization challenge. Two hours of brainstorming and discussing their solution convinced us that we should quickly move forward to explore an investment opportunity. Only three weeks afterwards, we signed the term sheet and Omri and Ronen started working on from our offices.

For us, every investment is the beginning of a journey with unexpected turns and obstacles, one that we would never set out to without feeling fully confident in our partners. We feel lucky to have Omri and Ronen as partners in this journey.

We want to welcome Omri and Ronen and the team to our portfolio and wish them good luck!