Skip to content

Commit

Permalink
[Doc] Graph Construction PR Refactor - PR1: GSProcessing Doc Refactor (
Browse files Browse the repository at this point in the history
…awslabs#907)

*Issue #, if available:*

*Description of changes:*

Preview on read the doc:
https://jalencato-graphstorm-doc.readthedocs.io/en/gsprocessing-awsinfra-doc/graph-construction/index.html

* Refactor the doc structure for existing GSProcessing doc.
* Rename the title as what we previously discussed in the reorg doc
plan.
* Change a typo in the doc about `--model-name` to `--hf-model` in
distributed set up.
* It is the first PR about the Doc Refactor. 

By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.

---------

Co-authored-by: Theodore Vasiloudis <[email protected]>
Co-authored-by: xiang song(charlie.song) <[email protected]>
  • Loading branch information
3 people authored Jul 11, 2024
1 parent d5b1e03 commit 46c1331
Show file tree
Hide file tree
Showing 14 changed files with 148 additions and 52 deletions.
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
.. _gsprocessing_sagemaker:

Running distributed jobs on Amazon SageMaker
============================================

Once the :doc:`Amazon SageMaker setup <distributed-processing-setup>` is complete, we can
Once the :ref:`Amazon SageMaker Setup<gsprocessing_distributed_setup>` is complete, we can
use the Amazon SageMaker launch scripts to launch distributed processing
jobs that use AWS resources.

Expand All @@ -24,7 +26,7 @@ of up to 20 instances, allowing you to scale your processing to massive graphs,
using larger instances like ``ml.r5.24xlarge``.

Since we're now executing on AWS, we'll need access to an execution role
for SageMaker and the ECR image URI we created in :doc:`distributed-processing-setup`.
for SageMaker and the ECR image URI we created in :ref:`GSProcessing distributed setup<gsprocessing_distributed_setup>`.
For instructions on how to create an execution role for SageMaker
see the `AWS SageMaker documentation <https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html#sagemaker-roles-create-execution-role>`_.

Expand Down Expand Up @@ -75,7 +77,7 @@ enough to fit in the memory of the Spark leader.
For large graphs you will
want to launch that step as a separate job on an instance with more memory to avoid memory errors.
`ml.r5` instances should allow you to re-partition graph data with billions of nodes and edges.
For more details on the re-partitioning step see :doc:`row-count-alignment`.
For more details on the re-partitioning step see :ref:`row count alignment<row_count_alignment>`.

To run the re-partition job as a separate job use:

Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
.. _gsprocessing_emr_serverless:

Running distributed jobs on EMR Serverless
==========================================

Once the :doc:`distributed processing setup <distributed-processing-setup>` is complete,
Once the :ref:`distributed processing setup<gsprocessing_distributed_setup>` is complete,
and we have built and pushed an EMR Serverless image tagged as ``graphstorm-processing-emr-serverless``, we can
set up our execution environment for EMR Serverless (EMR-S). If you're not familiar with EMR-S
we suggest going through its `introductory documentation <https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/emr-serverless.html>`_
Expand Down Expand Up @@ -99,7 +101,7 @@ from the image you just created. GSProcessing version ``0.2.2`` uses ``emr-6.13.
base image, so we need to ensure our application uses the same release.

Additionally, if it is required to use text feature transformation with Huggingface model, it is suggested to download the model cache inside the emr-serverless
docker image: :doc:`distributed-processing-setup` to save cost and time. Please note that the maximum size for docker images in EMR Serverless is limited to 5GB:
docker image: :ref:`GSProcessing Distributed Setup<gsprocessing_distributed_setup>` to save cost and time. Please note that the maximum size for docker images in EMR Serverless is limited to 5GB:
`EMR Serverless Considerations and Limitations
<https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/application-custom-image.html#considerations>`_.

Expand Down Expand Up @@ -160,7 +162,7 @@ With all the setup complete we should now have the following:
* An EMR-S application that uses our custom image.
* An execution role that our EMR-S jobs will use when we launch them.

To launch the same example job as we demonstrate in the :doc:`SageMaker Processing job guide <amazon-sagemaker>`
To launch the same example job as we demonstrate in the :ref:`SageMaker Processing job guide<gsprocessing_sagemaker>`
you can use the following ``bash`` snippet. Note that we use ``jq`` to wrangle JSON data,
which you can download from its `official website <https://jqlang.github.io/jq/download/>`_,
install using your package manager, or by running ``pip install jq``.
Expand Down Expand Up @@ -239,7 +241,7 @@ on an instance with S3 access (where we installed GSProcessing):
gs-repartition --input-prefix ${OUTPUT_PREFIX}
Or if your data are too large for the re-partitioning job to run locally, you can
launch a SageMaker job as below after following the :doc:`distributed processing setup <distributed-processing-setup>`
launch a SageMaker job as below after following the :ref:`distributed processing setup<gsprocessing_distributed_setup>`
and building the GSProcessing SageMaker ECR image:

.. code-block:: bash
Expand All @@ -261,7 +263,7 @@ Note that ``${OUTPUT_PREFIX}`` here will need to match the value assigned when l
the EMR-S job, i.e. ``"s3://${OUTPUT_BUCKET}/gsprocessing/emr-s/small-graph/4files/"``

For more details on the re-partitioning step see
:doc:`row-count-alignment`.
:ref:`row count alignment<row_count_alignment>`.

Examine the output
------------------
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
.. _gsprocessing_emr_ec2:

Running distributed jobs on EMR on EC2
======================================

Once the :doc:`distributed processing setup <distributed-processing-setup>` is complete,
Once the :ref:`distributed processing setup<gsprocessing_distributed_setup>` is complete,
and we have built and pushed an EMR image tagged as ``graphstorm-processing-emr``, we can
set up our execution environment for EMR. If you're not familiar with EMR
we suggest going through its
Expand Down Expand Up @@ -191,7 +193,7 @@ successfully, we can run:
We should see the file ``updated_row_counts_metadata.json`` in the output,
which means our data are ready for distributed partitioning.

If the re-partitioning failed, we can run a separate job, see :doc:`row-count-alignment`
If the re-partitioning failed, we can run a separate job, see :ref:`row count alignment<row_count_alignment>`
for details.

Run distributed partitioning and training on Amazon SageMaker
Expand Down
37 changes: 37 additions & 0 deletions docs/source/graph-construction/gs-processing/aws-infra/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
================================================
Running distributed processing jobs on AWS Infra
================================================

After successfully building the Docker image and pushing it to
`Amazon ECR <https://docs.aws.amazon.com/ecr/>`_,
you can now initiate GSProcessing jobs with AWS resources.

We support running GSProcessing jobs on different AWS execution environments including:
`Amazon SageMaker <https://docs.aws.amazon.com/sagemaker/>`_,
`EMR Serverless <https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/emr-serverless.html>`_, and
`EMR on EC2 <https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-what-is-emr.html>`_.


Running distributed jobs on `Amazon SageMaker <https://docs.aws.amazon.com/sagemaker/>`_:

.. toctree::
:maxdepth: 1
:titlesonly:

amazon-sagemaker.rst

Running distributed jobs on `EMR Serverless <https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/emr-serverless.html>`_:

.. toctree::
:maxdepth: 1
:titlesonly:

emr-serverless.rst

Running distributed jobs on `EMR on EC2 <https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-what-is-emr.html>`_:

.. toctree::
:maxdepth: 1
:titlesonly:

emr.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
.. _row_count_alignment:

Row count alignment
===================

Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
.. _distributed_construction_example:

GraphStorm Processing Example
=============================

Expand Down Expand Up @@ -48,7 +50,7 @@ Apart from the data, GSProcessing also requires a configuration file that descri
data and the transformations we will need to apply to the features and any encoding needed for
labels.
We support both the `GConstruct configuration format <https://graphstorm.readthedocs.io/en/latest/configuration/configuration-gconstruction.html#configuration-json-explanations>`_
, and the library's own GSProcessing format, described in :doc:`/gs-processing/developer/input-configuration`.
, and the library's own GSProcessing format, described in :ref:`GSProcessing Input Configuration<gsprocessing_input_configuration>`.

.. note::
We expect end users to only provide a GConstruct configuration file,
Expand All @@ -61,7 +63,7 @@ We support both the `GConstruct configuration format <https://graphstorm.readthe
as we do with GConstruct.

For a detailed description of all the entries of the GSProcessing configuration file see
:doc:`/gs-processing/developer/input-configuration`.
:ref:`GSProcessing Input Configuration<gsprocessing_input_configuration>`.

.. _gsp-relative-paths:

Expand Down Expand Up @@ -196,7 +198,7 @@ we can run the step as a separate job:
gs-repartition --input-prefix /tmp/gsprocessing-example/
For more details on the re-partitioning step see
:doc:`row-count-alignment`.
:ref:`row count alignment<row_count_alignment>`.

.. _gsp-examining-output:

Expand Down Expand Up @@ -293,13 +295,14 @@ you to focus on model development. In particular you can follow the GraphStorm d


To run GSProcessing jobs on Amazon SageMaker we'll need to follow
:doc:`/gs-processing/usage/distributed-processing-setup` to set up our environment
and :doc:`/gs-processing/usage/amazon-sagemaker` to execute the job.
:ref:`GSProcessing distributed setup<gsprocessing_distributed_setup>` to set up our environment
and :ref:`Running GSProcessing on SageMaker<gsprocessing_sagemaker>` to execute the job.


.. rubric:: Footnotes


.. [#f1] Note that this is just a hint to the Spark engine, and it's
not guaranteed that the number of output partitions will always match
the requested value.
the requested value.
.. [#f2] This doc will be future extended to include a partition example.
22 changes: 22 additions & 0 deletions docs/source/graph-construction/gs-processing/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
==============================
Distributed Graph Construction
==============================

Beyond single-machine graph construction, distributed graph construction offers enhanced scalability
and efficiency. This process involves two main steps: GraphStorm Distributed Data Processing (GSProcessing)
and GraphStorm Distributed Data Partitioning (GPartition). The documentations of GPartition will be released soon.

The following sections provide guidance on doing distributed graph construction.
The first section details the execution environment setup for GSProcessing.
The second section offers examples on drafting a configuration file for a GSProcessing job.
The third section explains how to deploy your GSProcessing job with AWS infrastructure.
The final section shows an example to quick start GSProcessing.

.. toctree::
:maxdepth: 1
:glob:

prerequisites/index.rst
input-configuration.rst
aws-infra/index.rst
example.rst
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
.. _input-configuration:
.. _gsprocessing_input_configuration:

GraphStorm Processing Input Configuration
=========================================
Expand Down
Original file line number Diff line number Diff line change
@@ -1,8 +1,12 @@
.. _gsprocessing_developer_guide:

Developer Guide
---------------

The project is set up using ``poetry`` to make easier for developers to
jump into the project.
The document helps developers set up their environment for development.
It includes steps for setting up the local development environment with poetry,
followed by guidelines on using linters and type checkers to ensure code quality.


The steps we recommend are:

Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
.. _gsprocessing_distributed_setup:

GraphStorm Processing Distributed Setup
=======================================

Expand Down Expand Up @@ -125,8 +127,8 @@ You can find detailed instructions on creating a VPC for EMR Serverless in the A

.. code-block:: bash
bash docker/build_gsprocessing_image.sh --environment sagemaker --model-name bert-base-uncased
bash docker/build_gsprocessing_image.sh --environment emr-serverless --model-name bert-base-uncased
bash docker/build_gsprocessing_image.sh --environment sagemaker --hf-model bert-base-uncased
bash docker/build_gsprocessing_image.sh --environment emr-serverless --hf-model bert-base-uncased
Support for arm64 architecture
------------------------------
Expand Down Expand Up @@ -260,7 +262,7 @@ Launch a SageMaker Processing job using the example scripts.
------------------------------------------------------------

Once the setup is complete, you can follow the
:doc:`SageMaker Processing job guide <amazon-sagemaker>`
:ref:`SageMaker Processing job guide<gsprocessing_distributed_setup>`
to launch your distributed processing job using Amazon SageMaker resources.

Launch an EMR Serverless job using the example scripts.
Expand All @@ -271,5 +273,5 @@ as an execution environment to allow you to scale to even larger datasets
(recommended when your graph has 30B+ edges).
Its setup is more involved than Amazon SageMaker, so we only recommend
it for experienced AWS users.
Follow the :doc:`EMR Serverless job guide <emr-serverless>`
Follow the :ref:`EMR Serverless job guide<gsprocessing_emr_serverless>`
to launch your distributed processing job using EMR Serverless resources.
Original file line number Diff line number Diff line change
Expand Up @@ -3,17 +3,6 @@
GraphStorm Processing Getting Started
=====================================


GraphStorm Distributed Data Processing (GSProcessing) allows you to process and prepare massive graph data
for training with GraphStorm. GSProcessing takes care of generating
unique ids for nodes, using them to encode edge structure files, process
individual features and prepare the data to be passed into the
distributed partitioning and training pipeline of GraphStorm.

We use PySpark to achieve
horizontal parallelism, allowing us to scale to graphs with billions of nodes
and edges.

.. _gsp-installation-ref:

Installation
Expand Down Expand Up @@ -100,7 +89,7 @@ To check if Java is installed you can use.
Example
-------

See the provided :doc:`usage/example` for an example of how to start with tabular
See the provided :ref:`example<distributed_construction_example>` for an example of how to start with tabular
data and convert them into a graph representation before partitioning and
training with GraphStorm.

Expand Down Expand Up @@ -137,7 +126,7 @@ partitioning pipeline.
See `this guide <https://graphstorm.readthedocs.io/en/latest/scale/sagemaker.html>`_
for more details on how to use GraphStorm distributed partitioning and training on SageMaker.

See :doc:`usage/example` for a detailed walkthrough of using GSProcessing to
See :ref:`example<distributed_construction_example>` for a detailed walkthrough of using GSProcessing to
wrangle data into a format that's ready to be consumed by the GraphStorm
distributed training pipeline.

Expand All @@ -148,18 +137,18 @@ Running on AWS resources
GSProcessing supports Amazon SageMaker, EMR on EC2, and EMR Serverless as execution environments.
To run distributed jobs on AWS resources we will have to build a Docker image
and push it to the Amazon Elastic Container Registry, which we cover in
:doc:`usage/distributed-processing-setup`. We can then run either a SageMaker Processing
job which we describe in :doc:`usage/amazon-sagemaker`, an EMR on EC2 job which
we describe in :doc:`usage/emr`, or an EMR Serverless
job that is covered in :doc:`usage/emr-serverless`.
:ref:`distributed processing setup<gsprocessing_distributed_setup>`. We can then run either a SageMaker Processing
job which we describe in :ref:`running GSProcessing on SageMaker<gsprocessing_sagemaker>`, an EMR on EC2 job which
we describe in :ref:`running GSProcessing on EMR EC2<gsprocessing_emr_ec2>`, or an EMR Serverless
job that is covered in :ref:`running GSProcessing on EMR Serverless<gsprocessing_emr_serverless>`.


Input configuration
-------------------

GSProcessing supports both the GConstruct JSON configuration format,
as well as its own GSProcessing config. You can learn about the
GSProcessing JSON configuration in :doc:`developer/input-configuration`.
GSProcessing JSON configuration in :ref:`GSProcessing Input Configuration<gsprocessing_input_configuration>`.

Re-applying feature transformations to new data
-----------------------------------------------
Expand All @@ -181,12 +170,12 @@ Currently, we only support re-applying transformations for categorical features.
Developer guide
---------------

To get started with developing the package refer to :doc:`developer/developer-guide`.
To get started with developing the package refer to :ref:`developer guide<gsprocessing_developer_guide>`.


.. rubric:: Footnotes

.. [#f1] DGL expects that every file produced for a single node/edge type
has matching row counts, which is something that Spark cannot guarantee.
We use the re-partitioning script to fix this where needed in the produced
output. See :doc:`usage/row-count-alignment` for details.
output. See :ref:`row count alignment<row_count_alignment>` for details.
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
===============================================
Distributed GraphStorm Processing
===============================================

GraphStorm Distributed Data Processing (GSProcessing) allows you to process
and prepare massive graph data for training with GraphStorm. GSProcessing takes
care of generating unique ids for nodes, using them to encode edge structure files,
process individual features and prepare the data to be passed into the distributed
partitioning and training pipeline of GraphStorm.

We use PySpark to achieve horizontal parallelism, allowing us to scale to graphs with billions of nodes and edges.

.. warning::
GraphStorm currently only supports running GSProcessing on AWS Infras including `Amazon SageMaker <https://docs.aws.amazon.com/sagemaker/>`_, `EMR Serverless <https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/emr-serverless.html>`_, and `EMR on EC2 <https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-what-is-emr.html>`_.

The following sections outline essential prerequisites and provide a detailed guide to use
GSProcessing.
The first section provides an introduction to GSProcessing, how to install it locally and a quick example of its input configuration.
The second section demonstrates how to set up GSProcessing for distributed processing, enabling scalable and efficient processing using AWS resources.

.. toctree::
:maxdepth: 1
:titlesonly:

gs-processing-getting-started.rst
distributed-processing-setup.rst
11 changes: 11 additions & 0 deletions docs/source/graph-construction/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
==================
Graph Construction
==================

Graphstorm offers various methods to build graphs on both a single machine and distributed clusters.

.. toctree::
:maxdepth: 2
:glob:

gs-processing/index.rst
Loading

0 comments on commit 46c1331

Please sign in to comment.