Skip to content

Commit

Permalink
docs: update sparkml doc; cleanups. (#559)
Browse files Browse the repository at this point in the history
Signed-off-by: Jason Wang <[email protected]>
  • Loading branch information
memoryz authored Jun 8, 2022
1 parent e298dfb commit f0fdf12
Show file tree
Hide file tree
Showing 2 changed files with 75 additions and 57 deletions.
38 changes: 27 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,16 @@
<!--- SPDX-License-Identifier: Apache-2.0 -->
#

![ONNXMLTools_logo_main](docs/ONNXMLTools_logo_main.png)

<p align="center"><img width="40%" src="docs/ONNXMLTools_logo_main.png" /></p>
| Linux | Windows |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| [![Build Status](https://dev.azure.com/onnxmltools/onnxmltools/_apis/build/status/onnxmltools-linux-conda-ci?branchName=master)](https://dev.azure.com/onnxmltools/onnxmltools/_build/latest?definitionId=3?branchName=master) | [![Build Status](https://dev.azure.com/onnxmltools/onnxmltools/_apis/build/status/onnxmltools-win32-conda-ci?branchName=master)](https://dev.azure.com/onnxmltools/onnxmltools/_build/latest?definitionId=3?branchName=master) |

| Linux | Windows |
|-------|---------|
| [![Build Status](https://dev.azure.com/onnxmltools/onnxmltools/_apis/build/status/onnxmltools-linux-conda-ci?branchName=master)](https://dev.azure.com/onnxmltools/onnxmltools/_build/latest?definitionId=3?branchName=master)| [![Build Status](https://dev.azure.com/onnxmltools/onnxmltools/_apis/build/status/onnxmltools-win32-conda-ci?branchName=master)](https://dev.azure.com/onnxmltools/onnxmltools/_build/latest?definitionId=3?branchName=master)|
## Introduction

# Introduction
ONNXMLTools enables you to convert models from different machine learning toolkits into [ONNX](https://onnx.ai). Currently the following toolkits are supported:

* Tensorflow (a wrapper of [tf2onnx converter](https://github.com/onnx/tensorflow-onnx/))
* scikit-learn (a wrapper of [skl2onnx converter](https://github.com/onnx/sklearn-onnx/))
* Apple Core ML
Expand All @@ -18,22 +20,30 @@ ONNXMLTools enables you to convert models from different machine learning toolki
* XGBoost
* H2O
* CatBoost
<p>Pytorch has its builtin ONNX exporter check <a href="https://pytorch.org/docs/stable/onnx.html">here</a> for details</p>

Pytorch has its builtin ONNX exporter check [here](https://pytorch.org/docs/stable/onnx.html) for details.

## Install

You can install latest release of ONNXMLTools from [PyPi](https://pypi.org/project/onnxmltools/):
```

```bash
pip install onnxmltools
```

or install from source:
```

```bash
pip install git+https://github.com/microsoft/onnxconverter-common
pip install git+https://github.com/onnx/onnxmltools
```

If you choose to install `onnxmltools` from its source code, you must set the environment variable `ONNX_ML=1` before installing the `onnx` package.

## Dependencies

This package relies on ONNX, NumPy, and ProtoBuf. If you are converting a model from scikit-learn, Core ML, Keras, LightGBM, SparkML, XGBoost, H2O, CatBoost or LibSVM, you will need an environment with the respective package installed from the list below:

1. scikit-learn
2. CoreMLTools (version 3.1 or lower)
3. Keras (version 2.0.8 or higher) with the corresponding Tensorflow version
Expand All @@ -47,9 +57,11 @@ This package relies on ONNX, NumPy, and ProtoBuf. If you are converting a model
ONNXMLTools is tested with Python **3.7+**.

# Examples

If you want the converted ONNX model to be compatible with a certain ONNX version, please specify the target_opset parameter upon invoking the convert function. The following Keras model conversion example demonstrates this below. You can identify the mapping from ONNX Operator Sets (referred to as opsets) to ONNX releases in the [versioning documentation](https://github.com/onnx/onnx/blob/master/docs/Versioning.md#released-versions).

## Keras to ONNX Conversion

Next, we show an example of converting a Keras model into an ONNX model with `target_opset=7`, which corresponds to ONNX release version 1.2.

```python
Expand Down Expand Up @@ -83,6 +95,7 @@ onnx_model = onnxmltools.convert_keras(keras_model, target_opset=7)
```

## CoreML to ONNX Conversion

Here is a simple code snippet to convert a Core ML model into an ONNX model.

```python
Expand All @@ -100,7 +113,8 @@ onnxmltools.utils.save_model(onnx_model, 'example.onnx')
```

## H2O to ONNX Conversion
Below is a code snippet to convert a H2O MOJO model into an ONNX model. The only pre-requisity is to have a MOJO model saved on the local file-system.

Below is a code snippet to convert a H2O MOJO model into an ONNX model. The only prerequisite is to have a MOJO model saved on the local file-system.

```python
import onnxmltools
Expand All @@ -122,7 +136,7 @@ backend of your choice.

You can check the operator set of your converted ONNX model using [Netron](https://github.com/lutzroeder/Netron), a viewer for Neural Network models. Alternatively, you could identify your converted model's opset version through the following line of code.

```
```python
opset_version = onnx_model.opset_import[0].version
```

Expand All @@ -138,7 +152,8 @@ All converter unit test can generate the original model and converted model to a
[onnxruntime](https://pypi.org/project/onnxruntime/) or
[onnxruntime-gpu](https://pypi.org/project/onnxruntime-gpu/).
The unit test cases are all the normal python unit test cases, you can run it with pytest command line, for example:
```

```bash
python -m pytest --ignore .\tests\
```

Expand All @@ -159,4 +174,5 @@ be added in *tests_backend* to compute the prediction
with the runtime.

# License

[Apache License v2.0](LICENSE)
94 changes: 48 additions & 46 deletions onnxmltools/convert/sparkml/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
<!--- SPDX-License-Identifier: Apache-2.0 -->

# Spark ML to Onnx Model Conversion
# Spark ML to ONNX Model Conversion

There is prep work needed above and beyond calling the API. In short these steps are:

Expand All @@ -9,72 +9,74 @@ There is prep work needed above and beyond calling the API. In short these steps
* taking the output Tensor(s) and converting it(them) back to a DataFrame if further processing is required.

## Instructions

For examples, please see the unit tests under `tests/sparkml`

1- Create a list of input types needed to be supplied to the `convert_sparkml()` call.
For simple cases you can use `buildInitialTypesSimple()` function in `convert/sparkml/utils.py`.
To use this function just pass your test DataFrame.
1. Create a list of input types needed to be supplied to the `convert_sparkml()` call.

For simple cases you can use `buildInitialTypesSimple()` function in `convert/sparkml/utils.py`.
To use this function just pass your test DataFrame.

Otherwise, the conversion code requires a list of tuples with input names and their corresponding Tensor types, as shown below:

Otherwise, the conversion code requires a list of tuples with input names and their corresponding Tensor types, as shown below:
```python
initial_types = [
("label", StringTensorType([1, 1])),
# (repeat for the required inputs)
]
```
Note that the input names are the same as columns names from your DataFrame and they must match the "inputCol(s)" values
```python
initial_types = [
("label", StringTensorType([1, 1])),
# (repeat for the required inputs)
]
```

you provided when you created your Pipeline.
Note that the input names are the same as columns names from your DataFrame and they must match the "inputCol(s)" values

2- Now you can create the ONNX model from your pipeline model like so:
```python
pipeline_model = pipeline.fit(training_data)
onnx_model = convert_sparkml(pipeline_model, 'My Sparkml Pipeline', initial_types)
```
you provided when you created your Pipeline.

3- (optional) You could save the ONNX model for future use or further examination by using the `SerializeToString()`
2. Now you can create the ONNX model from your pipeline model like so:

```python
pipeline_model = pipeline.fit(training_data)
onnx_model = convert_sparkml(pipeline_model, 'My Sparkml Pipeline', initial_types)
```

3. (optional) You could save the ONNX model for future use or further examination by using the `SerializeToString()`
method of ONNX model

```python
with open("model.onnx", "wb") as f:
f.write(onnx_model.SerializeToString())
```
```python
with open("model.onnx", "wb") as f:
f.write(onnx_model.SerializeToString())
```

4- Before running this model (e.g. using `onnxruntime`) you need to create a `dict` from the input data. This dictionay
4. Before running this model (e.g. using `onnxruntime`) you need to create a `dict` from the input data. This dictionary
will have entries for each input name and its corresponding TensorData. For simple cases you could use the function
`buildInputDictSimple()` and pass your testing DataFrame to it. Otherwise, you need to create something like the following:

```python
input_data = {}
input_data['label'] = test_df.select('label').toPandas().values
# ... (repeat for all desired inputs)
```
```python
input_data = {}
input_data['label'] = test_df.select('label').toPandas().values
# ... (repeat for all desired inputs)
```

5. (optional) You could save the converted input data for possible debugging or future reuse. See below:

5- (optional) You could save the converted input data for possible debugging or future reuse. See below:
```python
with open("input_data", "wb") as f:
pickle.dump(input, f)
```
```python
with open("input_data", "wb") as f:
pickle.dump(input, f)
```

6- And finally run the newly converted ONNX model in the runtime:
```python
sess = onnxruntime.InferenceSession(onnx_model)
output = sess.run(None, input_data)
6. And finally run the newly converted ONNX model in the runtime:

```
This output may need further conversion back to a DataFrame.
```python
sess = onnxruntime.InferenceSession(onnx_model)
output = sess.run(None, input_data)
```

This output may need further conversion back to a DataFrame.

## Known Issues

1. Overall invalid data handling is problematic and not implemented in most cases.
Make sure your data is clean.
1. Overall invalid data handling is problematic and not implemented in most cases. Make sure your data is clean.

2. OneHotEncoderEstimator must not drop the last bit: OneHotEncoderEstimator has an option
which you can use to make sure the last bit is included in the vector: `dropLast=False`
2. When converting `OneHotEncoderModel` to ONNX, if `handleInvalid` is set to `"keep"`, then `dropLast` must be set to `True`. If `handleInvalid` is set to `"error"`, then `dropLast` must be set to `False`.

3. Use FloatTensorType for all numbers (instead of Int6t4Tensor or other variations)
3. Use `FloatTensorType` for all numbers (instead of `Int64Tensor` or other variations)

4. Some conversions, such as the one for Word2Vec, can only handle batch size of 1 (one input row)

0 comments on commit f0fdf12

Please sign in to comment.