I would like to announce a library that I have been working on with a few collaborators1, called the Computation Graph Toolkit (CGT): GitHub / Documentation. For those of you who are familiar with Theano, the main upshot of CGT is this: CGT replicates Theanos API, but it has very short compilation time and supports multithreading. In particular, with CGT you can create unrolled recurrent networks with tens of thousands of operations, and the compilation time is still just a few seconds. Multithreading is possible because CGT exports the computation graph to a data structure that can be executed by a C++ interpreter fully independently of Python.
CGT also brings various other new functionality, for example,
nnmodule that borrows some convenient API features from Torch, most importantly the
Module, representing a parameterized function that can be applied many times, which is particularly convenient with recurrent networks.
theano.config.floatXor worrying about unwanted upcastsCGT globally chooses one precision (single or double) and converts all data to it.
CGT is still a work in progress. However, I decided to release it at this stage for two reasons:
CGT makes it possible to work with large recurrent networks, unrolled across time, without having to worry about compilation time. In our examples directory, you can find a working implementation of the Neural Turing Machine. This implementation operates on batches of inputs, in contrast to all of the other open-source reimplementations I am aware of. In this file, the whole process of constructing the computation graph and compiling it to a callable function takes 8s for a computation that includes 15000 nodes in the computation graph (each one corresponding to an operation). Also in the examples directory, you can also find a reimplementation of karpathys char-rnn code, where an unrolled deep LSTM with 18000 operations is compiled in 7s.
If you are interested to give CGT a spin, check out the
examples directory, which includes the following:
demo_mnist.py: shows how to build up a fully-connected or convolutional neural network using the low-level API.
demo_cifar.py: train a convolutional neural net on CIFAR dataset using
nns Torch-like API.
demo_char_rnn.py: based on Andrej Karpathys char-rnn code, but all in one file for building the deep LSTM or GRU model and generating text.
demo_neural_turing_machine.py: implementation of the Neural Turing Machine, with a feedforward controller.
Or, start simple with the Tutorial.
You may be wondering how CGT currently stacks up against Theano with regard to compilation time and runtime. Here are the results from some simple examples where Ive implemented the exact same model in Theano and CGT. First, Ill show some benchmark results for feedforward networks that operate on MNIST-size inputs and run on the CPU.Fully-connected Network Library / setting Runtime Theano .22s CGT, sequential .24s CGT, num_threads=4 .18s
Apparently Theano has very slow CPU convolutions, at least on the platform I am using.
CGT uses Caffes im2col approach for convolutions on the CPU, and cuDNN on the GPU (though the API for GPU usage is not quite ready.)
I ran another experiment with a gated recurrent unit network unrolled across time. In the table below, top row shows the number of timesteps that the computation was unrolled for. Theano fails when the computation has more than 30 timesteps, throwing an exception maximum recursion depth exceeded. While Theano provides a Scan operator which could allow this GRU network to be used for more timesteps, Theano takes prohibitively long to compile larger/deeper recurrent models when using Scan. CGTs graph optimization uses a non-recursive algorithm whose time is linear in the size of the graph. (CGT works for at least 2000 timesteps of this model.)Benchmark with GRU T=10 T=20 T=30 T=40 T=80 Library / setting Compile Run Compile Run Compile Run Compile Run Compile Run Theano 5.7 .8 11.2 1.7 19.7 2.8 FAIL FAIL CGT, sequential .3 1.0 .7 1.9 1.0 2.9 1.3 4.2 2.7 8.5 CGT, num_threads=4 .3 .56 .6 1.1 1.0 1.6 1.3 2.2 2.7 4.7
These results were obtained with a single-threaded BLAS on my quad-core laptop. On my machine, CGT, num_threads=4 is also faster than Theano on these examples when using a multi-threaded BLAS (vecLib).
Here, Ill address some questions that people might ask about CGT and this endeavor.
Youre reimplementing much of Theanos functionality as part of this effort. Why didnt you just contribute to Theano development instead?
We are making some changes at the very core of the software (such as the graph datastructure itself), which will make new functionality possible and also allow for a much cleaner codebase. We explain some of these changes more thoroughly here: Why Not Build on Theano?
There have been new deep learning libraries announced every other week, and its getting tiresome. Why another one?
CGT is not a deep learning library: it provides general functionality for automatic differentiation and efficient execution of computations involving tensors. Currently the only comparable software occupying this niche is Theano. We hope that libraries will be built on top of CGT, as they have been built on Theano (even outside of the realm of neural networks and deep learning, e.g., PyMC3).
Part of my motivation for developing CGT was to provide a base layer for developing a library for implementing the algorithms from this paper on stochastic computation graphs, which provide a generalization of backpropagation that includes policy gradient and variational inference methods as special cases. These algorithms require various queries on the graph, which can only be implemented straightforwardly for a flat computation graphi.e., one that doesnt contain composite operations like Scan. Theano can only handle recurrent networks via Scan, so it was unsuitable for implementing this library. I hope that the computation graph representation used by CGT will be helpful for implementing other algorithms that go beyond just computing gradients, for example this recent paper.
I also believe that the usefulness of software tools is usually greatly underrated. Better tools can act as a significant multiplier on everyones productivity, and the right tools can make it easier for researchers to share code with each other. Code should be concise, readable (closely resembling the underlying math and algorithms), and have light dependencies.
Can I help?
I downloaded your code and ran into problem XYZ.
Please post to the cgt-users discussion group
Are you planning to turn CGT into a commercial product?
CGT is MIT-licensed, and I hope it is of interest to people in academia as well as industry. I personally have no plans to commercialize it.
What about GPU support?
GPU and multi-GPU computation has been a core consideration in CGTs design from day one. Usage of GPUs is currently not documented and we need some work to straighten out the API, but the basic scaffolding is in place for transporting data to and from the GPU, calling libraries like cuBLAS and cuDNN, as well as compiling kernels on the fly. We plan to substantially improve GPU and multi-GPU support in the coming weeks and months. So far, the GPU implementations use CUDA, but we are glad to accept code contributions providing OpenCL support, which should be doable given how CGTs code generation works.