AI localization using quantization

With artificial intelligence tools continuing to make inroads into the localization industry, linguistic service providers (LSPs) need to re-evaluate integration methods to reduce turnaround times while preserving translation quality.

Implementing AI localization effectively can be challenging, especially when aiming to do more complex tasks than simply getting an LLM-powered translation. There are often restrictions from popular LLMs sourced through APIs (such as OpenAI) and running larger and more capable open-source models on-site can be expensive.

AI localization and the open-source problem

Open-source models can offer businesses a much greater degree of customization and the ability to avoid iterative costs, but they require high-grade hardware in order to run, specifically a good graphics card (GPU) to run calculations, though it is possible, albeit slower, to use the central processing unit (CPU) for this.

Running Llama 3 at full precision means running it at float 32 (FP32), a computer number format that typically occupies 32 bits (four bytes) of memory. The lightest Llama 3 model contains 8 billion parameters, each operating at FP32 precision. Thus:

8,000,000,000 parameters x 4 bytes (memory usage per parameter at FP32) = 32,000,000,000 byte (roughly 30GB) graphics card memory requirement.

But it would be prohibitively expensive to run this Llama 3 on internal hardware. If businesses want to run open-source models such as Llama to perform more varied and nuanced AI experiments, they need to look at quantization techniques.

Group of developers working on AI localization

What is quantization?

In this context, quantization refers to the process of reducing the precision of a model’s weights, which compress the model by reducing the number of bits required to represent each parameter.

Quantization aims to reduce the memory footprint required to run resource-intensive open-source LLMs, even enabling use of edge devices such as phones. Reducing precision also results in faster computation and lower energy consumption.

There are a range of potential quantization methods suitable for AI localization, including post-training quantization, quantization-aware training and dynamic quantization, with the end effect meaning lower requirements to run open-source models.

Going back to Llama 3, quantization offers a reduction in precision from FP32 to FP16 or even float point 8 (FP8). Running Llama 3’s lightest model at FP16 would mean:

8,000,000,000 parameters x 2 bytes (memory usage per parameter at FP16) = 16,000,000,000 byte (roughly 15GB) graphics card memory requirement.

Significant gains such as this can result in massive savings for LSPs looking to experiment with AI integration more freely than is possible through APIs from companies such as OpenAI.

What are the trade-offs?

To start with, FP16 has a smaller dynamic range and lower precision compared to FP32, which can result in numerical instability. Gradients could also become less accurate, opening up the potential for slower convergence. This would, of course, be more of an issue during training than inference.

That said, the practical results of this shouldn’t be underestimated. In a translation environment, this could mean subtle nuances or infrequent word associations are lost, or output could become increasingly inaccurate.

Let’s take a look at some potential outcomes of quantization at each step of the LLM-based translation process.

LLM translation and potential quantization issues

There are four key stages to the LLM-based translation process, each of which could experience issues stemming from quantization if not appropriately addressed.

Input processing

Tokenization: Input text is broken down into smaller units called tokens.
Encoding: Tokens are converted into numerical representations that the model can process.

Potential issues

Encoding: Quantization can lead to errors in the embedding layer, where continuous space representations of tokens are approximated. This could potentially result in less accurate embeddings, which would limit the model’s understanding of the input text.

Contextual understanding

Contextual embedding: The model captures the meaning of each token in relation to those surrounding it.
Attention mechanisms: This ensures important information is retained over irrelevant content.

Potential issues

Contextual embedding: Quantization can lower contextual precision, affecting the model’s ability to capture nuances and relationships between tokens.
Attention mechanisms: The model may focus on less relevant parts of the text.

Translation generation

Decoding: Translation is generated one token at a time.
Beam search: This keeps track of multiple possible translations and selects the best one based on a scoring mechanism.

Potential issues

Decoding: Precision of generated tokens may be lower. The model may therefore generate tokens that are close but not exactly correct.
Beam search: The scoring mechanism may be affected, leading to suboptimal translation path selection.

Output processing

Detokenization: Generated tokens are converted back into human-readable text.
Post-processing: Additional steps may be taken to improve the translation, such as correcting grammar.

Potential issues

Post-processing: As a step in itself, post-processing will be less affected by quantization, but any errors introduced in previous stages can propagate and would become more noticeable here.

Working on effective AI localization can be like putting the pieces of a jigsaw together.

Mitigating the effects of quantization

For AI localization, it’s vital that accuracy and context need to be preserved, meaning that the potential impacts of quantization must be mitigated. There are several methods that could aid with this.

Mixed-precision quantization

While quantization can significantly reduce hardware requirements, it doesn’t need to be applied across the entire model. Mixed-precision quantization keeps critical layers at higher prevision levels such as FP32, while reducing other layers to lower points such as FP16.

Quantization-aware training

Businesses can also consider training the model with quantization in mind. By simulating the effects of quantization during training, the model can learn how best to compensate for the lower precision levels. This typically results in better performance when compared to processes such as raw post-training quantization.

Post-quantization fine-tuning

By fine-tuning on a relevant dataset after the quantization process is complete, the model can better adjust to its lower precision and recover some of its lost performance.

Error-feedback mechanisms

Implementing error-feedback mechanisms, such as the employment of human specialists can help the model to correct quantization errors during the inference stage of translation. The model can maintain the quality of translated or localized content from feedback generated over time.

Need help with LLM datasets? Alpha CRC is ready to help.

Potential applications for quantization in AI localization

LSPs have been exploring potential applications of AI since 2022, but hardware requirements and specific regulations present issues using APIs from the likes of OpenAI. Quantization techniques go some way to help, if applied with care and any negative impacts can be mitigated through the approaches stated above.

A quantized model can help reduce internal content production and adaptation costs, as access to open-source models would eliminate the need to use APIs from major LLM providers, ultimately resulting in a more widely-adopted AI localization workflow. Lower hardware requirements also mean that the technology could be applied more readily throughout a company’s workflows.

Open-source models are more flexible than those provided through API access, offering more potential for fine-tuning; the ability to run open-source models from internal hardware enhances security protocols and reduces the risks of IP leakage.

About Alpha CRC

Alpha CRC offers clients high-quality localization services that blend the best of human creativity with the speed and power of technology.

From translation to content creation, Alpha CRC enables clients to engage with their global customer bases and improve their reach.

Find out more about Alpha CRC’s LLM fine-tuning services

Reach

Engage

Innovate

AI localization: Using quantization to improve adaptability

AI localization and the open-source problem

What is quantization?

What are the trade-offs?

LLM translation and potential quantization issues

Input processing

Contextual understanding

Translation generation

Output processing

Mitigating the effects of quantization

Mixed-precision quantization

Quantization-aware training

Post-quantization fine-tuning

Error-feedback mechanisms

Potential applications for quantization in AI localization

About Alpha CRC

Reach

Engage

Innovate

AI localization: Using quantization to improve adaptability

AI localization and the open-source problem

What is quantization?

What are the trade-offs?

LLM translation and potential quantization issues

Input processing

Contextual understanding

Translation generation

Output processing

Mitigating the effects of quantization

Mixed-precision quantization

Quantization-aware training

Post-quantization fine-tuning

Error-feedback mechanisms

Potential applications for quantization in AI localization

About Alpha CRC

Related resources