Merge pull request #85 from facebookresearch/tutorial-fix

nicolasvasilache · web-flow · commit 8d3d5fabc382 · 2018-02-28T12:28:45.000-05:00
Address feedback on tutorials and other nits
diff --git a/docs/source/framework/pytorch_integration/autograd_with_tc.rst b/docs/source/framework/pytorch_integration/autograd_with_tc.rst
@@ -6,14 +6,11 @@ a training layer with TC and be able to run backwards as well if the layer is pa
 of a network. In order to write a training layer with TC, you need to follow the
 steps below:
 
-1. Define your TC language that has two definitions: one for the forward layer
-and the other for the backward layer and pass it to :code:`tc.define` call. In
-addition, also pass :code:`training=True` and the name of the backward TC :code:`backward`.
+1. Define your TC language that has two definitions: one for the forward layer and the other for the backward layer and pass it to :code:`tc.define` call. In addition, also pass :code:`training=True` and the name of the backward TC :code:`backward`.
 
-2. Create the Input Variables and Parameters. For example, weights should be marked
-as Parameters and the inputs tensors as Variables.
+2. Create the Input Variables and Parameters. For example, weights should be marked as Parameters and the inputs tensors as Variables.
 
-3. Run the layer and get the output of forward pass
+3. Run the layer and get the output of forward pass.
 
 4. To see that the backward call works fine, you can call backward on the outputs.
 
@@ -79,7 +76,7 @@ them, the example for that would be:
 In order to obtain options via autotuning for backward and forward layer, keep reading further.
 
 
-Autotuning Training Layer
+Autotuning training layer
 -------------------------
 
 You can autotune a training layer easily. The forward and backward layers will
@@ -114,7 +111,7 @@ You will find two cache files created: :code:`convolution_train.cuda/options` ha
 options for the forward layer and :code:`convolution_train_backward.cuda/options` file
 has options for the grad layer.
 
-Reordering Grad Outputs
+Reordering grad outputs
 -----------------------
 
 In the backward pass, TC uses the list of input tensors in the forward pass and appends
diff --git a/docs/source/framework/pytorch_integration/autotuning_layers.rst b/docs/source/framework/pytorch_integration/autotuning_layers.rst
@@ -1,6 +1,6 @@
 .. _pytorch_autotune_layers:
 
-Autotuning Layers
+Autotuning layers
 =================
 
 TC provides a genetic search based autotuner that can be used to optimize a TC on
@@ -47,7 +47,7 @@ my_layer.autotune
 
 .. _autotune_parameters:
 
-Autotuning Parameters
+Autotuning parameters
 ---------------------
 
 Autotuner exposes various parameters that can be adjusted to control amount of tuning.
@@ -120,7 +120,7 @@ An example for how to pass options:
 
 .. _autotuner_cache_choices:
 
-Caching Autotuned options
+Caching autotuned options
 -------------------------
 
 As user autotunes kernels on given input tensor sizes, user can also cache the options
@@ -195,7 +195,7 @@ For example:
     out2 = matmul(mat1, mat2)
 
 
-Using Tuple sizes to Autotune
+Using tuple sizes to autotune
 -----------------------------
 
 If you want to autotune a kernel on variety of sizes and store the cache for later
@@ -227,7 +227,7 @@ The API description is given below:
 
 .. autofunction:: decode
 
-Decoding Example
+Decoding example
 ^^^^^^^^^^^^^^^^
 
 Below is example describing the above usage:
diff --git a/docs/source/framework/pytorch_integration/debugging.rst b/docs/source/framework/pytorch_integration/debugging.rst
@@ -20,7 +20,7 @@ In order to use enable these flags, you need to call :code:`tc.GlobalDebugInit`
 and set the proper flags to :code:`true`. All of these flags are :code:`boolean`
 flags that take values :code:`true` or :code:`false`.
 
-Example Usage
+Example usage
 -------------
 
 .. code-block:: python
diff --git a/docs/source/framework/pytorch_integration/frequently_asked_questions.rst b/docs/source/framework/pytorch_integration/frequently_asked_questions.rst
@@ -67,8 +67,8 @@ This TC is invalid because :code:`tmp` and :code:`O(n, d)` have cyclic dependenc
     }
 
 
-Autotuner FAQ
--------------
+Autotuner
+---------
 
 At the start of new generation, I see high kernel runtime, Why?
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
diff --git a/docs/source/framework/pytorch_integration/getting_started.rst b/docs/source/framework/pytorch_integration/getting_started.rst
@@ -16,7 +16,9 @@ A **few cases** where TC can be useful:
 
 * you are interested in fusing layers like group convolution, ReLU, FC *or*
 
-* if you have a different new layer, let's call it :code:`hconv` (a variant of convolution), for which you wish you had an efficient kernel available
+* if you have a different new layer, let's call it :code:`hconv` (a variant of convolution), for which you wish you had an efficient kernel available *or*
+
+* if you have standard operation on different data layouts that you didn't want to use because you couldn't get good kernels for them
 
 TC makes its very trivial to get CUDA code for such cases and many more. By providing
 TC integration with PyTorch, we hope to make it further easy for PyTorch users
diff --git a/docs/source/framework/pytorch_integration/layers_database.rst b/docs/source/framework/pytorch_integration/layers_database.rst
@@ -27,7 +27,7 @@ An example to do so:
 Pooling Layers
 --------------
 
-Average Pooling
+Average pooling
 ^^^^^^^^^^^^^^^
 
 .. code::
@@ -37,7 +37,7 @@ Average Pooling
     }}
 
 
-Max Pooling
+Max pooling
 ^^^^^^^^^^^
 
 .. code::
@@ -46,7 +46,7 @@ Max Pooling
         output(b, c, h, w) max= input(b, c, h * {sH} + kh, w * {sW} + kw) where kh in 0:{kH}, kw in 0:{kW}
     }}
 
-Convolution Layers
+Convolution layers
 ------------------
 
 Simple Convolution
@@ -99,7 +99,7 @@ Group Convolution Strided
         O(n, g, f, h, w) = O(n, g, f, h, w) + B(g, f)
     }}
 
-Linear Layers
+Linear layers
 -------------
 
 Fully Connected layer
@@ -277,7 +277,7 @@ Scale
         O(m, n) = I(m, n) * {s}
     }}
 
-Fused Layers
+Fused layers
 ------------
 
 FCRelu
@@ -307,7 +307,7 @@ Small MobileNet
         O2(c2, h, w)  = fmax(O2(c2, h, w), 0)
     }
 
-Normalization Layers
+Normalization layers
 --------------------
 
 Batch Normalization
@@ -358,7 +358,7 @@ Cosine Similarity
 
 What operations can not be expressed
 ------------------------------------
-* **Reshaping** Tensors inside the language
-* **Dropout** : RNGs are not suppported inside TC language, because TC doesn't do internal allocations
-* **Strided "tensors"** : input Tensors have to be contiguous. If they are not contiguous, they are made contiguous before passing to the TC backend.
-* **RNNs** : TC language doesn't have loops yet. You can write them unrolled if you want.
+* **Reshape**: Reshaping tensors inside the language.
+* **Dropout**: RNGs are not supported inside TC language, because TC doesn't do internal allocations.
+* **Strided tensors**: Input tensors have to be contiguous. If they are not contiguous, they are made contiguous before passing to the TC backend.
+* **RNNs**: TC language doesn't have loops yet. You can write them unrolled if you want.
diff --git a/docs/source/framework/pytorch_integration/note_about_performance.rst b/docs/source/framework/pytorch_integration/note_about_performance.rst
@@ -1,7 +1,7 @@
-Note about Performance/Autotuning
-=================================
+Note about Performance / Autotuning
+===================================
 
-Reuse Outputs
+Reuse outputs
 -------------
 
 TC depends on a tensor library to do the allocations for temporary variables or output tensors.
@@ -29,19 +29,21 @@ argument when you run the TC. For a concrete example:
     matmul(mat3, mat4, outputs=out)     # outputs re-used
 
 
-Static sizes for Autotuning
+Static sizes for autotuning
 ---------------------------
 
 Tensor Comprehensions have an autotuner that uses evolutionary search to find
 faster kernels. TC tries to specialize the kernels to the given input sizes.
 If the sizes are parametric, then the search space will become bigger and the performance
 is not as good static input sizes. Hence, for now, TC takes static input sizes. More
-concretely,
+concretely:
+
 
 1. you can not tune a kernel for parametric size ranges like batchsize between 16 and 32.
 
-2. you can tune a kernel let's say :code:`avgpool` for input shape :code:`(16, 32, 24, 23)`
-by simply calling:
+
+2. you can tune a kernel let's say :code:`avgpool` for input shape :code:`(16, 32, 24, 23)` by simply calling:
+
 
 .. code::
 
diff --git a/docs/source/framework/pytorch_integration/writing_layers.rst b/docs/source/framework/pytorch_integration/writing_layers.rst
@@ -57,13 +57,13 @@ There are two ways to set the :code:`Options`:
 
 * **Autotuning**: You can autotune the kernel the kernel on certain input tensor sizes, cache the options and use them to run the layer. See :ref:`pytorch_autotune_layers` for how to autotune kernels.
 
-* **Default Mapping**: We provide various default options that can be chosen to closely represent the kernel. THe defaults provided are:
+* **Default Mapping**: We provide various default options that can be chosen to closely represent the kernel. The defaults provided are:
 
   * :code:`pointwise`: if kernel resembles a pointwise operation
   * :code:`mlp`: if kernel resembles an Linear layer operation
   * :code:`conv`: if kernel resembles a convolution operation
-  * :code:`group_conv`: if kernel resembles a convolution operation
-  * :code:`naive`: if none of the above, then chose naive Default
+  * :code:`group_conv`: if kernel resembles a group convolution operation
+  * :code:`naive`: if none of the above, then chose naive default
 
 An example for how to pass options:
 
@@ -126,9 +126,12 @@ happens only once and then you can keep running the layer.
 Multiple TC definitions in language
 -----------------------------------
 
-Let's say you want to define all of your TCs in one string and later keep running
-them. You an do so easily. Every time you want to run a different layer, you can
-make a :code:`tc.define` call and get the layer.
+Let's say you want to define all of your TCs in one string and later use that string
+for running different operations defined in the string. You an do so easily. You
+can define a :code:`lang` variable that holds the TC definition for all your operations.
+Every time you want to run a different operation, you can make a :code:`tc.define` call
+on the :code:`lang` variable, specify the :code:`name` corresponding to the operation
+definition and get the TC layer for it. Below is an example for how to do this:
 
 .. code-block:: python
 
@@ -215,7 +218,7 @@ adopt whatever feels more convenient.
     out = avgpool(inp)
 
 
-Manually Injecting external CUDA code
+Manually injecting external CUDA code
 -------------------------------------
 
 If you have an external efficient CUDA code that you want to use rather than
@@ -248,17 +251,19 @@ call. For example:
     a, b = torch.randn(100).cuda(), torch.randn(100).cuda()
     out = add(a, b, grid=[1, 1, 1], block=[100, 1, 1])
 
-In such cases, please note that TC doesn't modify the injected CUDA kernel. It will
-simply run the kernel injected as is and TC will also not guarantee the performance
-of the kernel. User needs to specify the :code:`grid` and :code:`block` values
-when running the layer and TC will simply use those settings.
+.. note::
+
+    In such cases, please note that TC doesn't modify the injected CUDA kernel. It will
+    simply run the kernel injected as is and TC will also not guarantee the performance
+    of the kernel. User needs to specify the :code:`grid` and :code:`block` values
+    when running the layer and TC will simply use those settings.
 
 
-Builtin Functions
------------------
+Built-in Functions
+------------------
 
-TC allows using some CUDA builtin functions as well when defining the TC language.
-During the execution, CUDA API will be called for those builtin functions. For example,
+TC allows using some CUDA built-in functions as well when defining the TC language.
+During the execution, CUDA API will be called for those built-in functions. For example,
 let's say we want to use :code:`fmax` CUDA function in our TC language. An example
 for how this would be done is below:
 
@@ -275,7 +280,7 @@ for how this would be done is below:
     inp = torch.randn(100, 128).cuda()
     out = relu(inp)
 
-TC supports only a few builtin CUDA functions and not all. You can find the documentation
+TC only supports a subset of built-in CUDA functions. You can find the documentation
 for these functions at the official CUDA documentation `here <http://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH__SINGLE.html#group__CUDA__MATH__SINGLE>`_.
 The functions supported in TC are:
 
diff --git a/docs/source/introduction.rst b/docs/source/introduction.rst
@@ -34,6 +34,8 @@ More generally the only requirement to integrate TC into a workflow is to use a
 simple tensor library with a few basic functionalities. For more details, see
 :ref:`integrating_ml_frameworks`.
 
+.. _tc_einstein_notation:
+
 Tensor Comprehension Notation
 -----------------------------
 TC borrow three ideas from Einstein notation that make expressions concise:
diff --git a/docs/source/tutorials/index.rst b/docs/source/tutorials/index.rst
@@ -4,22 +4,24 @@ Tensor Comprehensions Tutorials
 **Author**: `Priya Goyal <https://github.com/prigoyal>`_
 
 Tensor Comprehensions (TC) is a framework agnostic library to **automatically**
-synthesize high-performance Machine Learning kernels. TC relies on
+synthesize high-performance machine learning kernels. TC relies on
 `Halide <https://github.com/halide/Halide>`_ IR to express computation and analysis
 tools to reason about it. TC uses :code:`polyhedral` compilation techniques to
 (semi-)automatically decide how to perform this computation efficiently and produce
 fast code. We also provide TC integration with PyTorch and Caffe2.
 
+To automatically tune the performance of the kernel, we provide a genetic algorithms
+based **Autotuner** details of which are available at :ref:`pytorch_autotune_layers`.
+
 To read more about Tensor Comprehensions, see our documentation available
 at https://facebookresearch.github.io/TensorComprehensions/ and C++ API documentation is
 available at https://facebookresearch.github.io/TensorComprehensions/api.
 
 We provide many **python examples** for expressing and running various different ML layers
 with TC. The examples can be found `here <https://github.com/facebookresearch/TensorComprehensions/tree/master/test_python/layers>`_.
 
-To read more about Framework integrations, checkout our documentation on `PyTorch <https://facebookresearch.github.io/TensorComprehensions/framework/pytorch_integration/getting_started.html>`_ integration
-and `Caffe2 <https://facebookresearch.github.io/TensorComprehensions/framework/caffe2_integration/integration_with_example.html>`_
-integration.
+To read more about Framework integrations, checkout our documentation on `PyTorch integration <https://facebookresearch.github.io/TensorComprehensions/framework/pytorch_integration/getting_started.html>`_
+and `Caffe2 integration <https://facebookresearch.github.io/TensorComprehensions/framework/caffe2_integration/integration_with_example.html>`_.
 
 If you want to **integrate your framework** with TC, it's easy and the instructions are
 available at https://facebookresearch.github.io/TensorComprehensions/integrating_any_ml_framework.html
diff --git a/docs/source/tutorials/tutorial_tensordot_with_tc.rst b/docs/source/tutorials/tutorial_tensordot_with_tc.rst

Original file line number	Diff line number	Diff line change
@@ -67,8 +67,8 @@ This TC is invalid because :code:`tmp` and :code:`O(n, d)` have cyclic dependenc
`67`	`67`	`}`
`68`	`68`
`69`	`69`
`70`		`-Autotuner FAQ`
`71`		`--------------`
	`70`	`+Autotuner`
	`71`	`+---------`
`72`	`72`
`73`	`73`	`At the start of new generation, I see high kernel runtime, Why?`
`74`	`74`	`^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^`