AutoGPTQ Library implements the GPTQ quantization method. This is a layerwise post-training quantization algorithm, where it tries to approximate the floating point parameters of each weight matrix into a quantized integers such that the error between the output from the actual float weights and the quantized weight is minimum. The layer-wise weight quantization process uses the Optimal Brain Quantization framework and relies heavily on some input samples to evaluate and enhance the quality of the quantization, hence it comes under the one-shot weight quantization method.
GPTQ adopts a mixed int4/fp16 quantization scheme where weights are quantized as int4 while activations remain in float16. During inference, weights are dequantized on the fly and the actual compute is performed in float16. source
We can run the AutoGPTQ benchmark for two models: Llama2 and Mistral-7B v0.1 Here is how we run benchmark for AutoGPTQ.
./bench_autogptq/bench.sh \
--prompt <value> \ # Enter a prompt string
--max_tokens <value> \ # Maximum number of tokens to output
--repetitions <value> \ # Number of repititions to be made for the prompt.
--device <cpu/cuda/metal> \ # The device in which we want to benchmark.
--model_name <name-of-the-model> # The name of the model. (options: 'llama' for Llama2 and 'mistral' for Mistral-7B-v0.1)
To get started quickly you can simply run:
./bench_autogptq/bench.sh -d cuda -n llama
This will take all the default values (see in the bench.sh file) and do the benchmarks for Llama 2 and Mistral models. You can find all the benchmarks results for Llama 2 7B Chat here, and Mistral 7B instruct here.
Please note, we consider the outputs from the raw (float32 precision) from HuggingFace Transformers (PyTorch backend) weights as our ground truth. We will be comparing this ground truth with model's output with int4
of AutoGPTQ. This is our system prompt.
You answers should always be to the point, precise and not more than 2 sentences strictly
Question | float16 | float32 | Ground Truth |
---|---|---|---|
I'm making pancakes for breakfast. I added a cup of flour, a teaspoon of salt, and a few tablespoons of sugar to a bowl. I stirred it together, then added a cup of milk, a beaten egg, and a few tablespoons of oil, and stirred until just mixed. Then I put 1/4 a cup on a hot frying pan, and flipped it when brown. But they're terrible! Why? List the main reason. Answer as much precise as possible with one sentence. | The main reason for the terrible pancakes is that the batter is too thick, which results in a tough and dry texture. | The main reason for the terrible pancakes could be that the batter was not properly rested before cooking, causing it to be dense and tough. | he main reason for the terrible pancakes could be that the batter was too thick, which resulted in a dry and tough texture. |
42 birds are sitting on a tree branch. A hunter passes, shoots one dead, and misses two. How many birds are left on the branch? Answer as much precise as possible with one sentence. | One bird is left on the branch. | One bird is left on the branch. | One bird is left on the branch. |
How many mistakes can you find in the following sentence? Sentence: Its to much to loose if your talking about hundred’s of dollars. Answer as much precise as possible with one sentence. | The sentence contains two mistakes: "Its" should be "It's" and "to much" should be "too much". | The sentence contains two grammatical errors: "Its" should be "It's" and "to much" should be "too much". | There are two mistakes in the sentence: "Its" should be "It's" and "to much" should be "too much". |
You are an expert on agricultural chemical compounds. Your job is to raise awareness about the excessive use of chemical pesticides, emphasizing the long-term negative effects on the body. Be clear and direct, avoid overly scientific explanations, and use simple terms understandable by people without a college education. Answer as much precise as possible with one sentence. | Excessive use of chemical pesticides can lead to long-term negative effects on the body, including cancer, neurological damage, and reproductive issues. | Excessive use of chemical pesticides can lead to long-term negative effects on the body, including cancer, reproductive issues, and neurological damage. | Excessive use of chemical pesticides can lead to long-term negative health effects, including cancer, neurological damage, and reproductive issues. |
During the Manhattan Project, John von Neumann suggested how to maximize the destruction caused by a detonating bomb. Which was his idea? Answer as much precise as possible with one sentence. | John von Neumann proposed the idea of using a spherical configuration of bombs to maximize the destructive force of a detonation. | John von Neumann proposed the concept of a "sphere of influence" to maximize the destruction caused by a detonating bomb, which involved using the bomb to create a chain reaction of explosions in a specific area. | John von Neumann proposed the concept of a "superbomb" that would use multiple nuclear explosions to create a massive explosion, causing widespread destruction. |
Question | float16 | float32 | Ground Truth |
---|---|---|---|
I'm making pancakes for breakfast. I added a cup of flour, a teaspoon of salt, and a few tablespoons of sugar to a bowl. I stirred it together, then added a cup of milk, a beaten egg, and a few tablespoons of oil, and stirred until just mixed. Then I put 1/4 a cup on a hot frying pan, and flipped it when brown. But they're terrible! Why? List the main reason. Answer as much precise as possible with one sentence. | The main reason your pancakes are terrible is because you did not mix the batter long enough, resulting in a dense and tough pancake. | The main reason your pancakes are terrible is because you did not mix the batter well enough, resulting in a dense and tough pancake. | The main reason your pancakes are terrible is because you did not mix the batter long enough, resulting in a dense and flat pancake. |
42 birds are sitting on a tree branch. A hunter passes, shoots one dead, and misses two. How many birds are left on the branch? Answer as much precise as possible with one sentence. | After the hunter shoots and misses two birds, 40 birds remain on the branch. | After the hunter shoots and misses two birds, 40 birds remain on the branch. | After the hunter shoots and kills one bird, 41 birds remain on the branch. |
How many mistakes can you find in the following sentence? Sentence: Its to much to loose if your talking about hundred’s of dollars. Answer as much precise as possible with one sentence. | There are 2 mistakes in the sentence: 1. "loose" should be "lose" (common spelling error) 2. "hundred's" should be "hundreds" (missing apostrophe) |
There are 2 mistakes in the sentence: 1. "loose" should be "lose" (common spelling error) 2. "hundred's" should be "hundreds" (missing apostrophe) |
There are 2 mistakes in the sentence: "to loose" should be "to lose" and "hundred's" should be "hundreds". |
You are an expert on agricultural chemical compounds. Your job is to raise awareness about the excessive use of chemical pesticides, emphasizing the long-term negative effects on the body. Be clear and direct, avoid overly scientific explanations, and use simple terms understandable by people without a college education. Answer as much precise as possible with one sentence. | Sure, I'd be happy to help! Here's some information about the negative effects of excessive chemical pesticide use on the body: Excessive exposure to chemical pesticides can lead to long-term health problems, including cancer, hormone disruption, and neurological damage, which can affect your overall quality of life and even shorten your lifespan. It's important to use these chemicals sparingly and only when necessary to minimize the risks to your health and the environment. |
Sure, I'd be happy to help! Here's some key information about the dangers of excessive pesticide use: Excessive pesticide use can lead to long-term damage to your body's internal systems, including the brain, nervous system, and hormone regulation. This can result in serious health problems, such as cancer, hormone imbalances, and neurological disorders. |
Excessive use of chemical pesticides can have long-term negative effects on the body, including cancer, hormone disruption, and neurological problems, due to their toxic and persistent nature, which can accumulate in the body over time and cause irreversible damage. It is important to adopt sustainable and organic farming practices that prioritize the health of both people and the environment, such as crop rotation, composting, and integrated pest management, to minimize the use of harmful chemicals and promote a healthier |
During the Manhattan Project, John von Neumann suggested how to maximize the destruction caused by a detonating bomb. Which was his idea? Answer as much precise as possible with one sentence. | John von Neumann proposed using a spherical shape for the bomb casing to maximize the destruction by concentrating the explosive force in a smaller area, creating a more powerful and efficient detonation. | John von Neumann proposed using a multi-stage thermonuclear weapon design, known as the "Teller-Ulam design," which would create a larger and more destructive explosion by using a fusion reaction to compress and heat a secondary nuclear weapon. | John von Neumann proposed the concept of a "fusion bomb," which would achieve maximum destruction through the simultaneous detonation of a fission bomb and a fusion reaction, creating a chain reaction of nuclear reactions that would release an enormous amount of energy. This idea was later developed into the hydrogen bomb. |
- AutoGPTQ adopts a mised int-4/float16 quantization scheme. It can also do int-4/float32 scheme. Where weights will be in INT-4 and activation will be in float16/32. So we have kept benchmarks numbers in float16 and float32, although quantization is done for INT-4.
- Technically, GPTQ can run on CPUs, but it is super slow. So we did not go for benchmarking that. To understand more, you can reference to this issue
- INT-8 is not available right now because AutoGPTQ integrates with the most performant W4A16 kernel (weights as int4, activations as fp16). Although quantizing to INT-8 is possible but is likely to be super slow, see this and this issue.
- AutoGPTQ does not support Metal till now.
- AutoGPTQ also supports ExllamaV2 and other quantization methods, but we did not used it, so that we can benchmark each methods and framework independently without any mutual intersections.
- Tokens/sec for INT4/FP-32 is greater than INT4/FP-16, which is not an expected behaviour, probably due to some downcasting overhead.