Minor fix up to CUDA blog post based on feedback (#7361)
diff --git a/docs/website/docs/blog/2021-10-15-cuda-backend.md b/docs/website/docs/blog/2021-10-15-cuda-backend.md
index 6e5e8f7..b70bf88 100644
--- a/docs/website/docs/blog/2021-10-15-cuda-backend.md
+++ b/docs/website/docs/blog/2021-10-15-cuda-backend.md
@@ -1,5 +1,5 @@
- Tuesday, October 15, 2021<br>
+ Friday, October 15, 2021<br>
By Thomas Raoux
# CUDA Backend in IREE
@@ -17,7 +17,7 @@
## Bring up
-### Hal support
+### HAL support
IREE has a [HAL API](https://github.com/google/iree/blob/main/docs/developers/design_roadmap.md#hal-hardware-abstraction-layer-and-multi-architecture-executables)
that abstract all the targets behind a common interface. The first step to
@@ -44,8 +44,8 @@
(CUDA LLVM variant) and use LLVM's backend to generate PTX. The CUDA driver
will do the "last mile compilation" at runtime to convert PTX into the GPU's native ISA.
-IREE compiler pipeline starts from linalg on tensor representation. A large part
-of the compiler is independent of the target.
+IREE compiler pipeline starts from [linalg](https://mlir.llvm.org/docs/Dialects/Linalg/)
+with tensor operands. A large part of the compiler is independent of the target.
The linalg on tensor representation of the graph is broken up into dispatch
regions that are processed by NVVM Codegen. A simple implementation of the
@@ -136,8 +136,8 @@
optimized to access 128 bits of data per thread. Therefore it is critical to
vectorize load/store operations.
After tiling to a size we vectorize the IR to get vector read/write mapping to
-load4/store4. This helps significantly improve the memory access pattern of the
-code generated.
+load4/store4. This significantly improves the memory access pattern of the code
+generated.
This convert the previous IR to:
```