Merge pull request #15 from gotzmann/server

Server Mode
gotzmann · Apr 28, 2023 · bf2bddd · bf2bddd
2 parents ea45a8a + e274511
commit bf2bddd
Show file tree

Hide file tree

Showing 13 changed files with 1,179 additions and 452 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,4 +1,5 @@
 .env
+*.bin
 .idea
 .vscode
 *.pprof

diff --git a/Makefile b/Makefile
@@ -1,5 +1,6 @@
 TARGET = llama
-VERSION = $(shell cat VERSION)
+# $(shell cat VERSION)
+VERSION = v1.4.0
 OS = linux
 ARCH = amd64
 PACKAGE = github.com/gotzmann/$(TARGET)
@@ -140,3 +141,9 @@ fp16:
 pprof:
 	go tool pprof -pdf cpu.pprof > cpu.pdf
 
+.PHONY: builds
+builds:
+	GOOS=windows GOARCH=amd64 go build -o ./builds/llama-go-$(VERSION).exe -ldflags "-s -w" main.go
+	GOOS=darwin GOARCH=amd64 go build -o ./builds/llama-go-$(VERSION)-macos -ldflags "-s -w" main.go
+	GOOS=linux GOARCH=amd64 go build -o ./builds/llama-go-$(VERSION)-linux -ldflags "-s -w" main.go
+
diff --git a/README.md b/README.md
@@ -2,83 +2,198 @@
 
 ![](./assets/images/terminal.png?raw=true)
 
-## The Goal
+## Motivation
 
-We dream of a world where ML hackers are able to grok with **REALLY BIG GPT** models without having GPU clusters consuming a shit tons of **$$$** - using only machines in their own homelabs.
+We dream of a world where fellow ML hackers are grokking **REALLY BIG GPT** models in their homelabs without having GPU clusters consuming a shit tons of **$$$**.
 
-The code of the project is based on the legendary **[ggml.cpp](https://github.com/ggerganov/llama.cpp)** framework of Georgi Gerganov written in C++
+The code of the project is based on the legendary **[ggml.cpp](https://github.com/ggerganov/llama.cpp)** framework of Georgi Gerganov written in C++ with the same attitude to performance and elegance.
 
-We hope using our beloved Golang instead of *soo-powerful* but *too-low-level* language will allow much greater adoption of the **NoGPU** ideas.
-
-The V1 supports only FP32 math, so you'll need at least 32GB RAM to work even with the smallest **LLaMA-7B** model. As a preliminary step you should have binary files converted from original LLaMA model locally.
+We hope using Golang instead of *soo-powerful* but *too-low-level* language will allow much greater adoption.
 
 ## V0 Roadmap
 
-- [x] Run tensor math in pure Golang based on C++ source
+- [x] Tensor math in pure Golang
 - [x] Implement LLaMA neural net architecture and model loading
-- [x] Run smaller LLaMA-7B model
-- [x] Be sure Go inference works EXACT SAME way as C++
-- [x] Let Go shine! Enable multi-threading and boost performance
+- [x] Test with smaller LLaMA-7B model
+- [x] Be sure Go inference works exactly same way as C++
+- [x] Let Go shine! Enable multi-threading and messaging to boost performance
 
-## V1 Roadmap
+## V1 Roadmap - Spring'23
 
 - [x] Cross-patform compatibility with Mac, Linux and Windows
-- [x] Release first stable version for ML hackers
-- [x] Support bigger LLaMA models: 13B, 30B, 65B
-- [x] ARM NEON support on Apple Silicon (modern Macs) and ARM servers
-- [x] Performance boost with x64 AVX2 support for Intel and AMD
+- [x] Release first stable version for ML hackers - v1.0
+- [x] Enable bigger LLaMA models: 13B, 30B, 65B - v1.1
+- [x] ARM NEON support on Apple Silicon (modern Macs) and ARM servers - v1.2
+- [x] Performance boost with x64 AVX2 support for Intel and AMD - v1.2
+- [x] Better memory use and GC optimizations - v1.3
+- [x] Introduce Server Mode (embedded REST API) for use in real projects - v1.4
+- [x] Release converted models for free access over the Internet - v1.4
+- [ ] INT8 quantization to allow x4 bigger models fit same memory
+- [ ] Benchmark LLaMA.go against some mainstream Python / C++ frameworks
+- [ ] Enable some popular models of LLaMA family: Vicuna, Alpaca, etc
 - [ ] Speed-up AVX2 with memory aligned tensors
-- [ ] INT8 quantization to allow x4 bigger models fit the same memory
-- [ ] Enable interactive mode for real-time chat with GPT
-- [ ] Allow automatic download converted model weights from the Internet
+- [ ] Extensive logging for production monitoring
+- [ ] Interactive mode for real-time chat with GPT
+
+## V2 Roadmap - Summer'23
+
+- [ ] Automatic CPU / GPU features detection
 - [ ] Implement metrics for RAM and CPU usage
-- [ ] Server Mode for use in Clouds as part of Microservice Architecture
+- [ ] Standalone GUI or web interface for better access to framework
+- [ ] Support popular open models: Open Assistant, StableLM, BLOOM, Anthropic, etc.
+- [ ] AVX512 support - yet another performance boost for AMD Epyc and Intel Sapphire Rapids
+- [ ] Nvidia GPUs support (CUDA or Tensor Cores)
 
-## V2 Roadmap
+## V3 Roadmap - Fall'23
 
 - [ ] Allow plugins and external APIs for complex projects
-- [ ] AVX512 support - yet another performance boost for AMD Epyc
-- [ ] FP16 and BF16 support when hardware support there
-- [ ] Support INT4 and GPTQ quantization 
+- [ ] Allow model training and fine-tuning
+- [ ] Speed up execution on GPU cards and clusters
+- [ ] FP16 and BF16 math if hardware support is there
+- [ ] INT4 and GPTQ quantization 
+- [ ] AMD Radeon GPUs support with OpenCL
+
+## How to Run?
+
+First, obtain and convert original LLaMA models on your own, or just download ready-to-rock ones:
+
+**LLaMA-7B:** [llama-7b-fp32.bin](https://nogpu.com/llama-7b-fp32.bin)
+
+**LLaMA-13B:** [llama-7b-fp32.bin](https://nogpu.com/llama-7b-fp32.bin)
+
+Both models store FP32 weights, so you'll needs at least 32Gb of RAM (not VRAM or GPU RAM) for LLaMA-7B. Double to 64Gb for LLaMA-13B.
 
-## How to Run
+Next, build app binary from sources (see instructions below), or just download already built one:
+
+**Windows:** [llama-go-v1.4.0.exe](./builds/llama-go-v1.4.0.exe)
+
+**MacOS:** [llama-go-v1.4.0-macos](./builds/llama-go-v1.4.0-macos)
+
+**Linux:** [llama-go-v1.4.0-linux](./builds/llama-go-v1.4.0-linux)
+
+So now you have both executable and model, go try it for yourself:
 
 ```shell
-go run main.go \
-    --model ~/models/7B/ggml-model-f32.bin \
-    --temp 0.80 \
-    --context 128 \
-    --predict 128 \
-    --prompt "Why Golang is so popular?"
+llama-go-v1.4.0-macos \
+    --model ~/models/llama-7b-fp32.bin \
+    --prompt "Why Golang is so popular?" \
 ```
 
-Or build it with Makefile and then run binary.
-
-## Useful CLI parameters:
+## Useful command line flags:
 
 ```shell
 --prompt   Text prompt from user to feed the model input
---model    Path and file name of converted .bin LLaMA model
+--model    Path and file name of converted .bin LLaMA model [ llama-7b-fp32.bin, etc ]
+--server   Start in Server Mode acting as REST API endpoint
+--host     Host to allow requests from in Server Mode [ localhost by default ]
+--port     Port listen to in Server Mode [ 8080 by default ]
+--pods     Maximum pods or units of parallel execution allowed in Server Mode [ 1 by default ]
 --threads  Adjust to the number of CPU cores you want to use [ all cores by default ]
---predict  Number of tokens to predict [ 64 by default ]
---context  Context size in tokens [ 64 by default ]
---temp     Model temperature hyper parameter [ 0.8 by default ]
---silent   Hide welcome logo and other output [ show by default ]
+--context  Context size in tokens [ 1024 by default ]
+--predict  Number of tokens to predict [ 512 by default ]
+--temp     Model temperature hyper parameter [ 0.5 by default ]
+--silent   Hide welcome logo and other output [ shown by default ]
 --chat     Chat with user in interactive mode instead of compute over static prompt
---profile  Profe CPU performance while running and store results to [cpu.pprof] file
+--profile  Profe CPU performance while running and store results to cpu.pprof file
 --avx      Enable x64 AVX2 optimizations for Intel and AMD machines
 --neon     Enable ARM NEON optimizations for Apple Macs and ARM server
 ```
 
+## Going Production
+
+LLaMA.go embeds standalone HTTP server exposing REST API. To enable it, run app with special flags:
+
+```shell
+llama-go-v1.4.0-macos \
+    --model ~/models/llama-7b-fp32.bin \
+    --server \
+    --host 127.0.0.1 \
+    --port 8080 \
+    --pods 4 \
+    --threads 4
+```
+
+Depending on the model size, how many CPU cores available there, how many requests you want to process in parallel, how fast you'd like to get answers, choose **pods** and **threads** parameters wisely.
+
+**Pods** is a number of inference instances that might run in parallel.
+
+**Threads** parameter sets how many cores will be used for tensor math within a pod.
+
+So for example if you have machine with 16 hardware cores capable running 32 hyper-threads in parallel, you might end up with something like that: 
+
+```shell
+--server --pods 4 --threads 8
+```
+
+When there is no free pod to handle arriving request, it will be placed into the waiting queue and started when some pod gets job finished.
+
+# REST API examples
+
+## Place new job
+
+Send POST request (with Postman) to your server address with JSON containing unique UUID v4 and prompt:
+
+```json
+{
+    "id": "5fb8ebd0-e0c9-4759-8f7d-35590f6c9fc3",
+    "prompt": "Why Golang is so popular?"
+}
+```
+
+## Check job status
+
+Send GET request (with Postman or browser) to URL like http://host:port/jobs/status/:id
+
+```shell
+GET http://localhost:8080/jobs/status/5fb8ebd0-e0c9-4759-8f7d-35590f6c9fcb
+```
+
+## Get the results
+
+Send GET request (with Postman or browser) to  URL like http://host:port/jobs/:id
+
+```shell
+GET http://localhost:8080/jobs/5fb8ebd0-e0c9-4759-8f7d-35590f6c9fcb
+```
+
+# How to build
+
+First, install **Golang** and **git** (you'll need to download installers in case of Windows). 
+
+```shell
+brew install git
+brew install golang
+```
+
+Then clone the repo and enter the project folder:
+
+```
+git clone https://github.com/gotzmann/llama.go.git
+cd llama.go
+```
+
+Some Go magic to install external dependencies:
+
+```
+go tidy
+go vendor
+```
+
+Now we are ready to build the binary from the source code:
+
+```shell
+go build -o llama-go-v1.exe -ldflags "-s -w" main.go
+```
+
 ## FAQ
 
-**1] Where might I get original LLaMA model files?**
+**1) From where I might obtain original LLaMA models?**
 
-Contact Meta directly or look around for some torrent alternatives
+Contact Meta directly or just look around for some torrent alternatives.
 
-**2] How to convert original LLaMA files into supported format?** 
+**2) How to convert original LLaMA files into supported format?** 
 
-Youl'll need original FP16 files placed into **models** directory, then convert with command:
+Place original PyTorch FP16 files into **models** directory, then convert with command:
 
 ```shell
 python3 ./scripts/convert.py ~/models/LLaMA/7B/ 0

diff --git a/assets/images/terminal.png b/assets/images/terminal.png
diff --git a/builds/llama-go-v1.4.0-linux b/builds/llama-go-v1.4.0-linux
diff --git a/builds/llama-go-v1.4.0-macos b/builds/llama-go-v1.4.0-macos
diff --git a/builds/llama-go-v1.4.0.exe b/builds/llama-go-v1.4.0.exe
diff --git a/go.mod b/go.mod
@@ -3,6 +3,9 @@ module github.com/gotzmann/llama.go
 go 1.20
 
 require (
+	github.com/gofiber/fiber/v2 v2.44.0
+	github.com/google/uuid v1.3.0
+	github.com/gotzmann/llama.go/llama v0.0.0-20230412160549-c20730f209a3
 	github.com/gotzmann/llama.go/ml v0.0.0-20230412160549-c20730f209a3
 	github.com/jessevdk/go-flags v1.5.0
 	github.com/mattn/go-colorable v0.1.13
@@ -14,11 +17,20 @@ require (
 )
 
 require (
+	github.com/andybalholm/brotli v1.0.5 // indirect
 	github.com/felixge/fgprof v0.9.3 // indirect
 	github.com/google/pprof v0.0.0-20211214055906-6f57359322fd // indirect
-	github.com/mattn/go-isatty v0.0.17 // indirect
+	github.com/klauspost/compress v1.16.3 // indirect
+	github.com/mattn/go-isatty v0.0.18 // indirect
 	github.com/mattn/go-runewidth v0.0.14 // indirect
+	github.com/philhofer/fwd v1.1.2 // indirect
 	github.com/rivo/uniseg v0.2.0 // indirect
-	golang.org/x/sys v0.6.0 // indirect
+	github.com/savsgio/dictpool v0.0.0-20221023140959-7bf2e61cea94 // indirect
+	github.com/savsgio/gotils v0.0.0-20230208104028-c358bd845dee // indirect
+	github.com/tinylib/msgp v1.1.8 // indirect
+	github.com/valyala/bytebufferpool v1.0.0 // indirect
+	github.com/valyala/fasthttp v1.45.0 // indirect
+	github.com/valyala/tcplisten v1.0.0 // indirect
+	golang.org/x/sys v0.7.0 // indirect
 	golang.org/x/term v0.6.0 // indirect
 )