integrations/tensorflow/e2e/README.md - 3p/openxla/iree - Git at Google

 # TensorFlow e2e tests

 This is a collection of e2e tests that save a TensorFlow model, compile it with
 IREE, run it on multiple backends and crosscheck the results.

 ## Pre-Requisites

 You will need a TensorFlow 2.0+ nightly installed in your python environment:
 the python binary in `$PYTHON_BIN` should be able to `import tensorflow` and
 that TensorFlow should be version 2.0+. This can be checked with
 `tensorflow.version`.

 See [Install TensorFlow with pip](https://www.tensorflow.org/install/pip) for
 instructions.

 ## Vulkan Setup

 If you do not have your environment setup to use IREE with Vulkan (see
 [the doc](../../../docs/vulkan_and_spirv.md)), then you can run the manual test
 targets with `--target_backends=tf,iree_vmla,iree_llvmjit` (that is, by omitting
 `iree_vulkan` from the list of backends to run the tests on).

 The test suites can be run excluding Vulkan by specifying
 `--test_tag_filters="-driver=vulkan"` in the `bazel test` invocation, or by
 adding `test --test_tag_filters="-driver=vulkan"` to your `user.bazelrc`.

 ## Compiling `tf.Module`s

 Compatible TensorFlow modules can be compiled to specific IREE backends using
 `IreeCompiledModule`. This also optionally saves compilation artifacts to a
 specified directory. These artifacts include: MLIR across various lowerings, a
 TensorFlow SavedModel, and the compiled VM FlatBuffer. A basic example of
 creating and calling an `IreeCompiledModule` can be found in
 [`tf_utils_test.py`](https://github.com/google/iree/blob/main/integrations/tensorflow/bindings/python/pyiree/tf/support/tf_utils_test.py)

 When using Keras models or tf.Modules with functions that IREE can't compile,
 `exported_names` should be specified. For example:

 ```python
 from pyiree.tf.support import tf_utils
 vmla_module = tf_utils.IreeCompiledModule(
     module_class=KerasTFModuleClass,
     backend_info=tf_utils.BackendInfo('iree_vmla'),
     exported_names=['predict'])
 vmla_module.predict(...)
 ```

 By default the TensorFlow SavedModels will not be kept. This can be overridden
 via the `--keep_saved_model` flag.

 ## Running Tests

 For locally running tests and iterating on backend development, `bazel run` is
 preferred.

 ```shell
 # Run math_test on all backends.
 bazel run :math_test_manual

 # Run math_test comparing TensorFlow to itself (e.g. to debug randomization).
 bazel run :math_test_manual -- target_backends=tf

 # Run math_test comparing the VMLA backend and TensorFlow.
 bazel run :math_test_manual -- --target_backends=iree_vmla

 # Run math_test comparing the VMLA backend to itself multiple times.
 bazel run :math_test_manual -- \
   --reference_backend=iree_vmla --target_backends=iree_vmla,iree_vmla

 # Run math_test and output on failure.
 bazel test :math_test_manual --test_output=errors

 # Run an individual test interactively.
 bazel run :math_test_manual -- --test_output=streamed
 ```

 For reproducibility of the unit tests `CompiledModule()` sets the random seeds
 of `tf`, `numpy` and `python` by calling `tf_utils.set_random_seed()` before
 model creation.

 ## Writing Tests

 Our tests use a class `TracedModule` to capture and store all of the inputs and
 outputs of a `CompiledModule` in a `Trace`. Each unittest on a `TestCase` uses
 the `compare_backends` method. This method runs the function it is passed with a
 `TracedModule` once for each reference and target backend. The inputs and
 outputs to these modules are then checked for correctness, using the reference
 backend as a source of truth. For example:

 ```python
 # Compile a `tf.Module` named `SimpleArithmeticModule` into a `CompiledModule`.
 @tf_test_utils.compile_module(SimpleArithmeticModule)
 # Inherit from `TracedModuleTestCase`.
 class SimpleArithmeticTest(tf_test_utils.TracedModuleTestCase):

   # Unit test.
   def test_simple_mul(self):

     # Trace function.
     def simple_mul(module):
       # A random seed is automatically set before each call to `simple_mul`.
       a = tf_utils.uniform([4])
       b = np.array([400., 5., 6., 7.], dtype=np.float32)
       # The inputs `a` and `b` are recorded along with the output `c`
       c = module.simple_mul(a, b)
       # The inputs `a` and `b` are recorded along with the (unnamed) output
       # module.simple_mul returns.
       module.simple_mul(a, c)

     # Calls `simple_mul` once for each backend, recording the inputs and outputs
     # to `module` and then comparing them.
     self.compare_backends(simple_mul)
 ```

 ## Test Suites

 Test targets are automatically generated for each test file and for each backend
 to check numerical correctness against TensorFlow. Tests targets that pass are
 placed into the `e2e_tests` test suite. Tests that fail on particular backends
 are recorded in lists in the `BUILD` files. For example, if
 `experimental_new_test.py` fails on the `iree_llvmjit` and `iree_vulkan`
 backends then the following lines should be added to the `BUILD` file:

 ```build
 LLVM_FAILING = [
     ...
     "experimental_new_test.py",
     ...
 ]

 VULKAN_FAILING = [
     ...
     "experimental_new_test.py",
     ...
 ]
 ```

 Test targets for these backends are placed into the `e2e_tests_failing` test
 suite. Test targets in these test suites can be run as follows:

 ```shell
 # Run all e2e tests that are expected to pass.
 bazel test :e2e_tests

 # Run all e2e tests that are expected to fail.
 bazel test :e2e_tests_failing

 # Run a specific failing e2e test target.
 # Note that generated test targets are prefixed with their test suite name.
 bazel test :e2e_tests_failing_broadcasting_test__tf__iree_vulkan
 ```

 ## Generated Artifacts

 By default, running an E2E test generates a number of compilation, debugging and
 benchmarking artifacts in `/tmp/iree/modules/`. The location of these artifacts
 can be changed via the `--artifacts_dir` flag. The generated directory structure
 for each module is as follows:

 ```
 /tmp/iree/modules/ModuleName
 ├── tf_input.mlir        # MLIR for ModuleName in TF's input dialect
 ├── iree_input.mlir      # tf_input.mlir translated to IREE MLIR
 ├── backend_name_1       # e.g. iree_vmla, tf or tf_ref
 │   ├── compiled.vmfb    # flatbuffer of ModuleName compiled to this backend
 │   ├── saved_model      # Only created if --keep_saved_model is specified.
 │   └── traces
 │       ├── trace_1      # Directory storing logs and serialization for each trace.
 │       │   └── log.txt  # A more detailed version of the test logs
 │       └── trace_2
 │           └── log.txt
 └── backend_name_2
     └── ...
 ```

 Traces for a particular test can be loaded via the `Trace.load(trace_dir)`
 method. For example:

 ```python
 ref_trace = Trace.load("/tmp/iree/modules/ModuleName/tf_ref/traces/predict/")
 tar_trace = Trace.load("/tmp/iree/modules/ModuleName/iree_vmla/traces/predict/")
 abs_diff = np.abs(ref_trace.calls[0].outputs[0] - tar_trace.calls[0].outputs[0])
 print(np.mean(abs_diff))
 ```

 Traces are named after the trace functions defined in their unittests. So in the
 `SimpleArithmeticModule` example above, the `trace_dir` would be
 `/tmp/iree/modules/SimpleArithmeticModule/iree_vmla/traces/simple_mul/`.

 ## Benchmarking E2E Modules

 Abseil flagfiles containing all of the data that `iree-benchmark-module` needs
 to run are generated for each `Trace` in our E2E tests. This allows for any
 module we test to be easily benchmarked on valid inputs. The process for
 benchmarking a vision model can thus be reduced to the following:

 ```shell
 # Generate benchmarking artifacts for all vision models:
 bazel test integrations/tensorflow/e2e/keras:vision_external_tests

 # Benchmark ResNet50 with cifar10 weights on vmla:
 bazel run iree/tools:iree-benchmark-module -- \
   --flagfile=/tmp/iree/modules/ResNet50/cifar10/iree_vmla/traces/predict/flagfile

 # Benchmark ResNet50 with cifar10 weights on llvmjit:
 bazel run iree/tools:iree-benchmark-module -- \
   --flagfile=/tmp/iree/modules/ResNet50/cifar10/iree_llvmjit/traces/predict/flagfile
 ```

 Duplicate flags provided after the flagfile will take precedence. For example:

 ```shell
 bazel run iree/tools:iree-benchmark-module -- \
   --flagfile=/tmp/iree/modules/ResNet50/cifar10/iree_llvmjit/traces/predict/flagfile  \
   --input_file=/path/to/custom/compiled.vmfb
 ```

 Currently, this only supports benchmarking the first module call in a trace. We
 plan to extend this to support benchmarking all of the calls in the trace, and
 also plan to support verifying outputs during the warm-up phase of the
 benchmark.

 ## Debugging Tests

 If the compiler fails to compile the program, then it will create a crash
 reproducer (see [MLIR documentation](https://mlir.llvm.org/docs/WritingAPass/)),
 which then allows reproducing the bug with an appropriate "opt" tool. Further
 debugging iteration can happen in opt.

 TODO(silvasean): debugging miscompiles

 ## Test Harnesses

 ### Simple function tests

 See `simple_arithmetic_test.py` for some basic examples.
	# TensorFlow e2e tests

	This is a collection of e2e tests that save a TensorFlow model, compile it with
	IREE, run it on multiple backends and crosscheck the results.

	## Pre-Requisites

	You will need a TensorFlow 2.0+ nightly installed in your python environment:
	the python binary in `$PYTHON_BIN` should be able to `import tensorflow` and
	that TensorFlow should be version 2.0+. This can be checked with
	`tensorflow.version`.

	See [Install TensorFlow with pip](https://www.tensorflow.org/install/pip) for
	instructions.

	## Vulkan Setup

	If you do not have your environment setup to use IREE with Vulkan (see
	[the doc](../../../docs/vulkan_and_spirv.md)), then you can run the manual test
	targets with `--target_backends=tf,iree_vmla,iree_llvmjit` (that is, by omitting
	`iree_vulkan` from the list of backends to run the tests on).

	The test suites can be run excluding Vulkan by specifying
	`--test_tag_filters="-driver=vulkan"` in the `bazel test` invocation, or by
	adding `test --test_tag_filters="-driver=vulkan"` to your `user.bazelrc`.

	## Compiling `tf.Module`s

	Compatible TensorFlow modules can be compiled to specific IREE backends using
	`IreeCompiledModule`. This also optionally saves compilation artifacts to a
	specified directory. These artifacts include: MLIR across various lowerings, a
	TensorFlow SavedModel, and the compiled VM FlatBuffer. A basic example of
	creating and calling an `IreeCompiledModule` can be found in
	[`tf_utils_test.py`](https://github.com/google/iree/blob/main/integrations/tensorflow/bindings/python/pyiree/tf/support/tf_utils_test.py)

	When using Keras models or tf.Modules with functions that IREE can't compile,
	`exported_names` should be specified. For example:

	```python
	from pyiree.tf.support import tf_utils
	vmla_module = tf_utils.IreeCompiledModule(
	module_class=KerasTFModuleClass,
	backend_info=tf_utils.BackendInfo('iree_vmla'),
	exported_names=['predict'])
	vmla_module.predict(...)
	```

	By default the TensorFlow SavedModels will not be kept. This can be overridden
	via the `--keep_saved_model` flag.

	## Running Tests

	For locally running tests and iterating on backend development, `bazel run` is
	preferred.

	```shell
	# Run math_test on all backends.
	bazel run :math_test_manual

	# Run math_test comparing TensorFlow to itself (e.g. to debug randomization).
	bazel run :math_test_manual -- target_backends=tf

	# Run math_test comparing the VMLA backend and TensorFlow.
	bazel run :math_test_manual -- --target_backends=iree_vmla

	# Run math_test comparing the VMLA backend to itself multiple times.
	bazel run :math_test_manual -- \
	--reference_backend=iree_vmla --target_backends=iree_vmla,iree_vmla

	# Run math_test and output on failure.
	bazel test :math_test_manual --test_output=errors

	# Run an individual test interactively.
	bazel run :math_test_manual -- --test_output=streamed
	```

	For reproducibility of the unit tests `CompiledModule()` sets the random seeds
	of `tf`, `numpy` and `python` by calling `tf_utils.set_random_seed()` before
	model creation.

	## Writing Tests

	Our tests use a class `TracedModule` to capture and store all of the inputs and
	outputs of a `CompiledModule` in a `Trace`. Each unittest on a `TestCase` uses
	the `compare_backends` method. This method runs the function it is passed with a
	`TracedModule` once for each reference and target backend. The inputs and
	outputs to these modules are then checked for correctness, using the reference
	backend as a source of truth. For example:

	```python
	# Compile a `tf.Module` named `SimpleArithmeticModule` into a `CompiledModule`.
	@tf_test_utils.compile_module(SimpleArithmeticModule)
	# Inherit from `TracedModuleTestCase`.
	class SimpleArithmeticTest(tf_test_utils.TracedModuleTestCase):

	# Unit test.
	def test_simple_mul(self):

	# Trace function.
	def simple_mul(module):
	# A random seed is automatically set before each call to `simple_mul`.
	a = tf_utils.uniform([4])
	b = np.array([400., 5., 6., 7.], dtype=np.float32)
	# The inputs `a` and `b` are recorded along with the output `c`
	c = module.simple_mul(a, b)
	# The inputs `a` and `b` are recorded along with the (unnamed) output
	# module.simple_mul returns.
	module.simple_mul(a, c)

	# Calls `simple_mul` once for each backend, recording the inputs and outputs
	# to `module` and then comparing them.
	self.compare_backends(simple_mul)
	```

	## Test Suites

	Test targets are automatically generated for each test file and for each backend
	to check numerical correctness against TensorFlow. Tests targets that pass are
	placed into the `e2e_tests` test suite. Tests that fail on particular backends
	are recorded in lists in the `BUILD` files. For example, if
	`experimental_new_test.py` fails on the `iree_llvmjit` and `iree_vulkan`
	backends then the following lines should be added to the `BUILD` file:

	```build
	LLVM_FAILING = [
	...
	"experimental_new_test.py",
	...
	]

	VULKAN_FAILING = [
	...
	"experimental_new_test.py",
	...
	]
	```

	Test targets for these backends are placed into the `e2e_tests_failing` test
	suite. Test targets in these test suites can be run as follows:

	```shell
	# Run all e2e tests that are expected to pass.
	bazel test :e2e_tests

	# Run all e2e tests that are expected to fail.
	bazel test :e2e_tests_failing

	# Run a specific failing e2e test target.
	# Note that generated test targets are prefixed with their test suite name.
	bazel test :e2e_tests_failing_broadcasting_test__tf__iree_vulkan
	```

	## Generated Artifacts

	By default, running an E2E test generates a number of compilation, debugging and
	benchmarking artifacts in `/tmp/iree/modules/`. The location of these artifacts
	can be changed via the `--artifacts_dir` flag. The generated directory structure
	for each module is as follows:

	```
	/tmp/iree/modules/ModuleName
	├── tf_input.mlir # MLIR for ModuleName in TF's input dialect
	├── iree_input.mlir # tf_input.mlir translated to IREE MLIR
	├── backend_name_1 # e.g. iree_vmla, tf or tf_ref
	│ ├── compiled.vmfb # flatbuffer of ModuleName compiled to this backend
	│ ├── saved_model # Only created if --keep_saved_model is specified.
	│ └── traces
	│ ├── trace_1 # Directory storing logs and serialization for each trace.
	│ │ └── log.txt # A more detailed version of the test logs
	│ └── trace_2
	│ └── log.txt
	└── backend_name_2
	└── ...
	```

	Traces for a particular test can be loaded via the `Trace.load(trace_dir)`
	method. For example:

	```python
	ref_trace = Trace.load("/tmp/iree/modules/ModuleName/tf_ref/traces/predict/")
	tar_trace = Trace.load("/tmp/iree/modules/ModuleName/iree_vmla/traces/predict/")
	abs_diff = np.abs(ref_trace.calls[0].outputs[0] - tar_trace.calls[0].outputs[0])
	print(np.mean(abs_diff))
	```

	Traces are named after the trace functions defined in their unittests. So in the
	`SimpleArithmeticModule` example above, the `trace_dir` would be
	`/tmp/iree/modules/SimpleArithmeticModule/iree_vmla/traces/simple_mul/`.

	## Benchmarking E2E Modules

	Abseil flagfiles containing all of the data that `iree-benchmark-module` needs
	to run are generated for each `Trace` in our E2E tests. This allows for any
	module we test to be easily benchmarked on valid inputs. The process for
	benchmarking a vision model can thus be reduced to the following:

	```shell
	# Generate benchmarking artifacts for all vision models:
	bazel test integrations/tensorflow/e2e/keras:vision_external_tests

	# Benchmark ResNet50 with cifar10 weights on vmla:
	bazel run iree/tools:iree-benchmark-module -- \
	--flagfile=/tmp/iree/modules/ResNet50/cifar10/iree_vmla/traces/predict/flagfile

	# Benchmark ResNet50 with cifar10 weights on llvmjit:
	bazel run iree/tools:iree-benchmark-module -- \
	--flagfile=/tmp/iree/modules/ResNet50/cifar10/iree_llvmjit/traces/predict/flagfile
	```

	Duplicate flags provided after the flagfile will take precedence. For example:

	```shell
	bazel run iree/tools:iree-benchmark-module -- \
	--flagfile=/tmp/iree/modules/ResNet50/cifar10/iree_llvmjit/traces/predict/flagfile \
	--input_file=/path/to/custom/compiled.vmfb
	```

	Currently, this only supports benchmarking the first module call in a trace. We
	plan to extend this to support benchmarking all of the calls in the trace, and
	also plan to support verifying outputs during the warm-up phase of the
	benchmark.

	## Debugging Tests

	If the compiler fails to compile the program, then it will create a crash
	reproducer (see [MLIR documentation](https://mlir.llvm.org/docs/WritingAPass/)),
	which then allows reproducing the bug with an appropriate "opt" tool. Further
	debugging iteration can happen in opt.

	TODO(silvasean): debugging miscompiles

	## Test Harnesses

	### Simple function tests

	See `simple_arithmetic_test.py` for some basic examples.