With artificial intelligence tools continuing to make inroads into the localization industry, linguistic service providers (LSPs) need to re-evaluate integration methods to reduce turnaround times while preserving translation quality.
Implementing AI localization effectively can be challenging, especially when aiming to do more complex tasks than simply getting an LLM-powered translation. There are often restrictions from popular LLMs sourced through APIs (such as OpenAI) and running larger and more capable open-source models on-site can be expensive.
Open-source models can offer businesses a much greater degree of customization and the ability to avoid iterative costs, but they require high-grade hardware in order to run, specifically a good graphics card (GPU) to run calculations, though it is possible, albeit slower, to use the central processing unit (CPU) for this.
Running Llama 3 at full precision means running it at float 32 (FP32), a computer number format that typically occupies 32 bits (four bytes) of memory. The lightest Llama 3 model contains 8 billion parameters, each operating at FP32 precision. Thus:
8,000,000,000 parameters x 4 bytes (memory usage per parameter at FP32) = 32,000,000,000 byte (roughly 30GB) graphics card memory requirement.
But it would be prohibitively expensive to run this Llama 3 on internal hardware. If businesses want to run open-source models such as Llama to perform more varied and nuanced AI experiments, they need to look at quantization techniques.
In this context, quantization refers to the process of reducing the precision of a model’s weights, which compress the model by reducing the number of bits required to represent each parameter.
Quantization aims to reduce the memory footprint required to run resource-intensive open-source LLMs, even enabling use of edge devices such as phones. Reducing precision also results in faster computation and lower energy consumption.
There are a range of potential quantization methods suitable for AI localization, including post-training quantization, quantization-aware training and dynamic quantization, with the end effect meaning lower requirements to run open-source models.
Going back to Llama 3, quantization offers a reduction in precision from FP32 to FP16 or even float point 8 (FP8). Running Llama 3’s lightest model at FP16 would mean:
8,000,000,000 parameters x 2 bytes (memory usage per parameter at FP16) = 16,000,000,000 byte (roughly 15GB) graphics card memory requirement.
Significant gains such as this can result in massive savings for LSPs looking to experiment with AI integration more freely than is possible through APIs from companies such as OpenAI.
To start with, FP16 has a smaller dynamic range and lower precision compared to FP32, which can result in numerical instability. Gradients could also become less accurate, opening up the potential for slower convergence. This would, of course, be more of an issue during training than inference.
That said, the practical results of this shouldn’t be underestimated. In a translation environment, this could mean subtle nuances or infrequent word associations are lost, or output could become increasingly inaccurate.
Let’s take a look at some potential outcomes of quantization at each step of the LLM-based translation process.
There are four key stages to the LLM-based translation process, each of which could experience issues stemming from quantization if not appropriately addressed.
Potential issues
Potential issues
Potential issues
Potential issues
For AI localization, it’s vital that accuracy and context need to be preserved, meaning that the potential impacts of quantization must be mitigated. There are several methods that could aid with this.
While quantization can significantly reduce hardware requirements, it doesn’t need to be applied across the entire model. Mixed-precision quantization keeps critical layers at higher prevision levels such as FP32, while reducing other layers to lower points such as FP16.
Businesses can also consider training the model with quantization in mind. By simulating the effects of quantization during training, the model can learn how best to compensate for the lower precision levels. This typically results in better performance when compared to processes such as raw post-training quantization.
By fine-tuning on a relevant dataset after the quantization process is complete, the model can better adjust to its lower precision and recover some of its lost performance.
Implementing error-feedback mechanisms, such as the employment of human specialists can help the model to correct quantization errors during the inference stage of translation. The model can maintain the quality of translated or localized content from feedback generated over time.
LSPs have been exploring potential applications of AI since 2022, but hardware requirements and specific regulations present issues using APIs from the likes of OpenAI. Quantization techniques go some way to help, if applied with care and any negative impacts can be mitigated through the approaches stated above.
A quantized model can help reduce internal content production and adaptation costs, as access to open-source models would eliminate the need to use APIs from major LLM providers, ultimately resulting in a more widely-adopted AI localization workflow. Lower hardware requirements also mean that the technology could be applied more readily throughout a company’s workflows.
Open-source models are more flexible than those provided through API access, offering more potential for fine-tuning; the ability to run open-source models from internal hardware enhances security protocols and reduces the risks of IP leakage.
Alpha CRC offers clients high-quality localization services that blend the best of human creativity with the speed and power of technology.
From translation to content creation, Alpha CRC enables clients to engage with their global customer bases and improve their reach.