[Doc] Graph Construction PR Refactor - PR1: GSProcessing Doc Refactor (…

…awslabs#907) *Issue #, if available:* *Description of changes:* Preview on read the doc: https://jalencato-graphstorm-doc.readthedocs.io/en/gsprocessing-awsinfra-doc/graph-construction/index.html * Refactor the doc structure for existing GSProcessing doc. * Rename the title as what we previously discussed in the reorg doc plan. * Change a typo in the doc about `--model-name` to `--hf-model` in distributed set up. * It is the first PR about the Doc Refactor. By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice. --------- Co-authored-by: Theodore Vasiloudis <[email protected]> Co-authored-by: xiang song(charlie.song) <[email protected]>
jalencato · Jul 11, 2024 · 46c1331 · 46c1331
1 parent d5b1e03
commit 46c1331
Show file tree

Hide file tree

Showing 14 changed files with 148 additions and 52 deletions.
diff --git a/.../gs-processing/usage/amazon-sagemaker.rst → ...processing/aws-infra/amazon-sagemaker.rst b/.../gs-processing/usage/amazon-sagemaker.rst → ...processing/aws-infra/amazon-sagemaker.rst
@@ -1,7 +1,9 @@
+.. _gsprocessing_sagemaker:
+
 Running distributed jobs on Amazon SageMaker
 ============================================
 
-Once the :doc:`Amazon SageMaker setup <distributed-processing-setup>` is complete, we can
+Once the :ref:`Amazon SageMaker Setup<gsprocessing_distributed_setup>` is complete, we can
 use the Amazon SageMaker launch scripts to launch distributed processing
 jobs that use AWS resources.
 
@@ -24,7 +26,7 @@ of up to 20 instances, allowing you to scale your processing to massive graphs,
 using larger instances like ``ml.r5.24xlarge``.
 
 Since we're now executing on AWS, we'll need access to an execution role
-for SageMaker and the ECR image URI we created in :doc:`distributed-processing-setup`.
+for SageMaker and the ECR image URI we created in :ref:`GSProcessing distributed setup<gsprocessing_distributed_setup>`.
 For instructions on how to create an execution role for SageMaker
 see the `AWS SageMaker documentation <https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html#sagemaker-roles-create-execution-role>`_.
 
@@ -75,7 +77,7 @@ enough to fit in the memory of the Spark leader.
 For large graphs you will
 want to launch that step as a separate job on an instance with more memory to avoid memory errors.
 `ml.r5` instances should allow you to re-partition graph data with billions of nodes and edges.
-For more details on the re-partitioning step see :doc:`row-count-alignment`.
+For more details on the re-partitioning step see :ref:`row count alignment<row_count_alignment>`.
 
 To run the re-partition job as a separate job use:
 

diff --git a/...ce/gs-processing/usage/emr-serverless.rst → ...s-processing/aws-infra/emr-serverless.rst b/...ce/gs-processing/usage/emr-serverless.rst → ...s-processing/aws-infra/emr-serverless.rst
@@ -1,7 +1,9 @@
+.. _gsprocessing_emr_serverless:
+
 Running distributed jobs on EMR Serverless
 ==========================================
 
-Once the :doc:`distributed processing setup <distributed-processing-setup>` is complete,
+Once the :ref:`distributed processing setup<gsprocessing_distributed_setup>` is complete,
 and we have built and pushed an EMR Serverless image tagged as ``graphstorm-processing-emr-serverless``, we can
 set up our execution environment for EMR Serverless (EMR-S). If you're not familiar with EMR-S
 we suggest going through its `introductory documentation <https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/emr-serverless.html>`_
@@ -99,7 +101,7 @@ from the image you just created. GSProcessing version ``0.2.2`` uses ``emr-6.13.
 base image, so we need to ensure our application uses the same release.
 
 Additionally, if it is required to use text feature transformation with Huggingface model, it is suggested to download the model cache inside the emr-serverless
-docker image: :doc:`distributed-processing-setup` to save cost and time. Please note that the maximum size for docker images in EMR Serverless is limited to 5GB:
+docker image: :ref:`GSProcessing Distributed Setup<gsprocessing_distributed_setup>` to save cost and time. Please note that the maximum size for docker images in EMR Serverless is limited to 5GB:
 `EMR Serverless Considerations and Limitations
 <https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/application-custom-image.html#considerations>`_.
 
@@ -160,7 +162,7 @@ With all the setup complete we should now have the following:
 * An EMR-S application that uses our custom image.
 * An execution role that our EMR-S jobs will use when we launch them.
 
-To launch the same example job as we demonstrate in the :doc:`SageMaker Processing job guide <amazon-sagemaker>`
+To launch the same example job as we demonstrate in the :ref:`SageMaker Processing job guide<gsprocessing_sagemaker>`
 you can use the following ``bash`` snippet. Note that we use ``jq`` to wrangle JSON data,
 which you can download from its `official website <https://jqlang.github.io/jq/download/>`_,
 install using your package manager, or by running ``pip install jq``.
@@ -239,7 +241,7 @@ on an instance with S3 access (where we installed GSProcessing):
     gs-repartition --input-prefix ${OUTPUT_PREFIX}
 
 Or if your data are too large for the re-partitioning job to run locally, you can
-launch a SageMaker job as below after following the :doc:`distributed processing setup <distributed-processing-setup>`
+launch a SageMaker job as below after following the :ref:`distributed processing setup<gsprocessing_distributed_setup>`
 and building the GSProcessing SageMaker ECR image:
 
 .. code-block:: bash
@@ -261,7 +263,7 @@ Note that ``${OUTPUT_PREFIX}`` here will need to match the value assigned when l
 the EMR-S job, i.e. ``"s3://${OUTPUT_BUCKET}/gsprocessing/emr-s/small-graph/4files/"``
 
 For more details on the re-partitioning step see
-:doc:`row-count-alignment`.
+:ref:`row count alignment<row_count_alignment>`.
 
 Examine the output
 ------------------

diff --git a/docs/source/gs-processing/usage/emr.rst → ...struction/gs-processing/aws-infra/emr.rst b/docs/source/gs-processing/usage/emr.rst → ...struction/gs-processing/aws-infra/emr.rst
@@ -1,7 +1,9 @@
+.. _gsprocessing_emr_ec2:
+
 Running distributed jobs on EMR on EC2
 ======================================
 
-Once the :doc:`distributed processing setup <distributed-processing-setup>` is complete,
+Once the :ref:`distributed processing setup<gsprocessing_distributed_setup>` is complete,
 and we have built and pushed an EMR image tagged as ``graphstorm-processing-emr``, we can
 set up our execution environment for EMR. If you're not familiar with EMR
 we suggest going through its
@@ -191,7 +193,7 @@ successfully, we can run:
 We should see the file ``updated_row_counts_metadata.json`` in the output,
 which means our data are ready for distributed partitioning.
 
-If the re-partitioning failed, we can run a separate job, see :doc:`row-count-alignment`
+If the re-partitioning failed, we can run a separate job, see :ref:`row count alignment<row_count_alignment>`
 for details.
 
 Run distributed partitioning and training on Amazon SageMaker

diff --git a/docs/source/graph-construction/gs-processing/aws-infra/index.rst b/docs/source/graph-construction/gs-processing/aws-infra/index.rst
@@ -0,0 +1,37 @@
+================================================
+Running distributed processing jobs on AWS Infra
+================================================
+
+After successfully building the Docker image and pushing it to
+`Amazon ECR <https://docs.aws.amazon.com/ecr/>`_,
+you can now initiate GSProcessing jobs with AWS resources.
+
+We support running GSProcessing jobs on different AWS execution environments including:
+`Amazon SageMaker <https://docs.aws.amazon.com/sagemaker/>`_,
+`EMR Serverless <https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/emr-serverless.html>`_, and
+`EMR on EC2 <https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-what-is-emr.html>`_.
+
+
+Running distributed jobs on `Amazon SageMaker <https://docs.aws.amazon.com/sagemaker/>`_:
+
+.. toctree::
+  :maxdepth: 1
+  :titlesonly:
+
+  amazon-sagemaker.rst
+
+Running distributed jobs on `EMR Serverless <https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/emr-serverless.html>`_:
+
+.. toctree::
+  :maxdepth: 1
+  :titlesonly:
+
+  emr-serverless.rst
+
+Running distributed jobs on `EMR on EC2 <https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-what-is-emr.html>`_:
+
+.. toctree::
+  :maxdepth: 1
+  :titlesonly:
+
+  emr.rst
diff --git a/...-processing/usage/row-count-alignment.rst → ...cessing/aws-infra/row-count-alignment.rst b/...-processing/usage/row-count-alignment.rst → ...cessing/aws-infra/row-count-alignment.rst
@@ -1,3 +1,5 @@
+..  _row_count_alignment:
+
 Row count alignment
 ===================
 

diff --git a/docs/source/gs-processing/usage/example.rst → ...ph-construction/gs-processing/example.rst b/docs/source/gs-processing/usage/example.rst → ...ph-construction/gs-processing/example.rst
@@ -1,3 +1,5 @@
+.. _distributed_construction_example:
+
 GraphStorm Processing Example
 =============================
 
@@ -48,7 +50,7 @@ Apart from the data, GSProcessing also requires a configuration file that descri
 data and the transformations we will need to apply to the features and any encoding needed for
 labels.
 We support both the `GConstruct configuration format <https://graphstorm.readthedocs.io/en/latest/configuration/configuration-gconstruction.html#configuration-json-explanations>`_
-, and the library's own GSProcessing format, described in :doc:`/gs-processing/developer/input-configuration`.
+, and the library's own GSProcessing format, described in :ref:`GSProcessing Input Configuration<gsprocessing_input_configuration>`.
 
 .. note::
     We expect end users to only provide a GConstruct configuration file,
@@ -61,7 +63,7 @@ We support both the `GConstruct configuration format <https://graphstorm.readthe
     as we do with GConstruct.
 
 For a detailed description of all the entries of the GSProcessing configuration file see
-:doc:`/gs-processing/developer/input-configuration`.
+:ref:`GSProcessing Input Configuration<gsprocessing_input_configuration>`.
 
 .. _gsp-relative-paths:
 
@@ -196,7 +198,7 @@ we can run the step as a separate job:
     gs-repartition --input-prefix /tmp/gsprocessing-example/
 
 For more details on the re-partitioning step see
-:doc:`row-count-alignment`.
+:ref:`row count alignment<row_count_alignment>`.
 
 .. _gsp-examining-output:
 
@@ -293,13 +295,14 @@ you to focus on model development. In particular you can follow the GraphStorm d
 
 
 To run GSProcessing jobs on Amazon SageMaker we'll need to follow
-:doc:`/gs-processing/usage/distributed-processing-setup` to set up our environment
-and :doc:`/gs-processing/usage/amazon-sagemaker` to execute the job.
+:ref:`GSProcessing distributed setup<gsprocessing_distributed_setup>` to set up our environment
+and :ref:`Running GSProcessing on SageMaker<gsprocessing_sagemaker>` to execute the job.
 
 
 .. rubric:: Footnotes
 
 
 .. [#f1] Note that this is just a hint to the Spark engine, and it's
     not guaranteed that the number of output partitions will always match
-    the requested value.
+    the requested value.
+.. [#f2] This doc will be future extended to include a partition example.
diff --git a/docs/source/graph-construction/gs-processing/index.rst b/docs/source/graph-construction/gs-processing/index.rst
@@ -0,0 +1,22 @@
+==============================
+Distributed Graph Construction
+==============================
+
+Beyond single-machine graph construction, distributed graph construction offers enhanced scalability
+and efficiency. This process involves two main steps: GraphStorm Distributed Data Processing (GSProcessing)
+and GraphStorm Distributed Data Partitioning (GPartition). The documentations of GPartition will be released soon.
+
+The following sections provide guidance on doing distributed graph construction.
+The first section details the execution environment setup for GSProcessing.
+The second section offers examples on drafting a configuration file for a GSProcessing job.
+The third section explains how to deploy your GSProcessing job with AWS infrastructure.
+The final section shows an example to quick start GSProcessing.
+
+.. toctree::
+   :maxdepth: 1
+   :glob:
+
+   prerequisites/index.rst
+   input-configuration.rst
+   aws-infra/index.rst
+   example.rst
diff --git a/...cessing/developer/input-configuration.rst → ...ion/gs-processing/input-configuration.rst b/...cessing/developer/input-configuration.rst → ...ion/gs-processing/input-configuration.rst
@@ -1,4 +1,4 @@
-..  _input-configuration:
+..  _gsprocessing_input_configuration:
 
 GraphStorm Processing Input Configuration
 =========================================

diff --git a/...-processing/developer/developer-guide.rst → ...cessing/prerequisites/developer-guide.rst b/...-processing/developer/developer-guide.rst → ...cessing/prerequisites/developer-guide.rst
@@ -1,8 +1,12 @@
+.. _gsprocessing_developer_guide:
+
 Developer Guide
 ---------------
 
-The project is set up using ``poetry`` to make easier for developers to
-jump into the project.
+The document helps developers set up their environment for development.
+It includes steps for setting up the local development environment with poetry,
+followed by guidelines on using linters and type checkers to ensure code quality.
+
 
 The steps we recommend are:
 

diff --git a/...ng/usage/distributed-processing-setup.rst → ...quisites/distributed-processing-setup.rst b/...ng/usage/distributed-processing-setup.rst → ...quisites/distributed-processing-setup.rst
@@ -1,3 +1,5 @@
+.. _gsprocessing_distributed_setup:
+
 GraphStorm Processing Distributed Setup
 =======================================
 
@@ -125,8 +127,8 @@ You can find detailed instructions on creating a VPC for EMR Serverless in the A
 
 .. code-block:: bash
 
-    bash docker/build_gsprocessing_image.sh --environment sagemaker --model-name bert-base-uncased
-    bash docker/build_gsprocessing_image.sh --environment emr-serverless --model-name bert-base-uncased
+    bash docker/build_gsprocessing_image.sh --environment sagemaker --hf-model bert-base-uncased
+    bash docker/build_gsprocessing_image.sh --environment emr-serverless --hf-model bert-base-uncased
 
 Support for arm64 architecture
 ------------------------------
@@ -260,7 +262,7 @@ Launch a SageMaker Processing job using the example scripts.
 ------------------------------------------------------------
 
 Once the setup is complete, you can follow the
-:doc:`SageMaker Processing job guide <amazon-sagemaker>`
+:ref:`SageMaker Processing job guide<gsprocessing_distributed_setup>`
 to launch your distributed processing job using Amazon SageMaker resources.
 
 Launch an EMR Serverless job using the example scripts.
@@ -271,5 +273,5 @@ as an execution environment to allow you to scale to even larger datasets
 (recommended when your graph has 30B+ edges).
 Its setup is more involved than Amazon SageMaker, so we only recommend
 it for experienced AWS users.
-Follow the :doc:`EMR Serverless job guide <emr-serverless>`
+Follow the :ref:`EMR Serverless job guide<gsprocessing_emr_serverless>`
 to launch your distributed processing job using EMR Serverless resources.
diff --git a/...cessing/gs-processing-getting-started.rst → ...uisites/gs-processing-getting-started.rst b/...cessing/gs-processing-getting-started.rst → ...uisites/gs-processing-getting-started.rst
@@ -3,17 +3,6 @@
 GraphStorm Processing Getting Started
 =====================================
 
-
-GraphStorm Distributed Data Processing (GSProcessing) allows you to process and prepare massive graph data
-for training with GraphStorm. GSProcessing takes care of generating
-unique ids for nodes, using them to encode edge structure files, process
-individual features and prepare the data to be passed into the
-distributed partitioning and training pipeline of GraphStorm.
-
-We use PySpark to achieve
-horizontal parallelism, allowing us to scale to graphs with billions of nodes
-and edges.
-
 .. _gsp-installation-ref:
 
 Installation
@@ -100,7 +89,7 @@ To check if Java is installed you can use.
 Example
 -------
 
-See the provided :doc:`usage/example` for an example of how to start with tabular
+See the provided :ref:`example<distributed_construction_example>` for an example of how to start with tabular
 data and convert them into a graph representation before partitioning and
 training with GraphStorm.
 
@@ -137,7 +126,7 @@ partitioning pipeline.
 See `this guide <https://graphstorm.readthedocs.io/en/latest/scale/sagemaker.html>`_
 for more details on how to use GraphStorm distributed partitioning and training on SageMaker.
 
-See :doc:`usage/example` for a detailed walkthrough of using GSProcessing to
+See :ref:`example<distributed_construction_example>` for a detailed walkthrough of using GSProcessing to
 wrangle data into a format that's ready to be consumed by the GraphStorm
 distributed training pipeline.
 
@@ -148,18 +137,18 @@ Running on AWS resources
 GSProcessing supports Amazon SageMaker, EMR on EC2, and EMR Serverless as execution environments.
 To run distributed jobs on AWS resources we will have to build a Docker image
 and push it to the Amazon Elastic Container Registry, which we cover in
-:doc:`usage/distributed-processing-setup`. We can then run either a SageMaker Processing
-job which we describe in :doc:`usage/amazon-sagemaker`, an EMR on EC2 job which
-we describe in :doc:`usage/emr`, or an EMR Serverless
-job that is covered in :doc:`usage/emr-serverless`.
+:ref:`distributed processing setup<gsprocessing_distributed_setup>`. We can then run either a SageMaker Processing
+job which we describe in :ref:`running GSProcessing on SageMaker<gsprocessing_sagemaker>`, an EMR on EC2 job which
+we describe in :ref:`running GSProcessing on EMR EC2<gsprocessing_emr_ec2>`, or an EMR Serverless
+job that is covered in :ref:`running GSProcessing on EMR Serverless<gsprocessing_emr_serverless>`.
 
 
 Input configuration
 -------------------
 
 GSProcessing supports both the GConstruct JSON configuration format,
 as well as its own GSProcessing config. You can learn about the
-GSProcessing JSON configuration in :doc:`developer/input-configuration`.
+GSProcessing JSON configuration in :ref:`GSProcessing Input Configuration<gsprocessing_input_configuration>`.
 
 Re-applying feature transformations to new data
 -----------------------------------------------
@@ -181,12 +170,12 @@ Currently, we only support re-applying transformations for categorical features.
 Developer guide
 ---------------
 
-To get started with developing the package refer to :doc:`developer/developer-guide`.
+To get started with developing the package refer to :ref:`developer guide<gsprocessing_developer_guide>`.
 
 
 .. rubric:: Footnotes
 
 .. [#f1] DGL expects that every file produced for a single node/edge type
     has matching row counts, which is something that Spark cannot guarantee.
     We use the re-partitioning script to fix this where needed in the produced
-    output. See :doc:`usage/row-count-alignment` for details.
+    output. See :ref:`row count alignment<row_count_alignment>` for details.
diff --git a/docs/source/graph-construction/gs-processing/prerequisites/index.rst b/docs/source/graph-construction/gs-processing/prerequisites/index.rst
@@ -0,0 +1,26 @@
+===============================================
+Distributed GraphStorm Processing
+===============================================
+
+GraphStorm Distributed Data Processing (GSProcessing) allows you to process
+and prepare massive graph data for training with GraphStorm. GSProcessing takes
+care of generating unique ids for nodes, using them to encode edge structure files,
+process individual features and prepare the data to be passed into the distributed
+partitioning and training pipeline of GraphStorm.
+
+We use PySpark to achieve horizontal parallelism, allowing us to scale to graphs with billions of nodes and edges.
+
+.. warning::
+    GraphStorm currently only supports running GSProcessing on AWS Infras including `Amazon SageMaker <https://docs.aws.amazon.com/sagemaker/>`_, `EMR Serverless <https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/emr-serverless.html>`_, and `EMR on EC2 <https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-what-is-emr.html>`_.
+
+The following sections outline essential prerequisites and provide a detailed guide to use 
+GSProcessing.
+The first section provides an introduction to GSProcessing, how to install it locally and a quick example of its input configuration.
+The second section demonstrates how to set up GSProcessing for distributed processing, enabling scalable and efficient processing using AWS resources.
+
+.. toctree::
+  :maxdepth: 1
+  :titlesonly:
+
+  gs-processing-getting-started.rst
+  distributed-processing-setup.rst
diff --git a/docs/source/graph-construction/index.rst b/docs/source/graph-construction/index.rst
@@ -0,0 +1,11 @@
+==================
+Graph Construction
+==================
+
+Graphstorm offers various methods to build graphs on both a single machine and distributed clusters.
+
+.. toctree::
+   :maxdepth: 2
+   :glob:
+
+   gs-processing/index.rst