How LLM’s can be discriminatory

Blog:

Why is GPT-3 15.77x more expensive for certain languages? | by Denys Linkov | Medium

As LLMs become more widely used, the disparity between English and non-English writing will only grow. Accuracy has been a standard concern [3], as a smaller corpus of text is being used and most benchmarks measuring English performance [4]. Bias and hate speech have been other concerns [5], with fewer native speakers to read through training data to confirm its validity for use.

Further:

If we put accuracy aside, and look purely at increased token usage we get four additional impacts: higher costs, longer wait times, less expressive prompts and more limited responses. Many underrepresented languages are spoken and written in the Global South, and with token usage currently pegged to the US Dollar, LLM API access will be financially inaccessible in many parts of the world. This likely means an inability to benefit from developments in the space until costs come down. For this reason, startups who prompt users in their native languages will likely be undercut by those who prompt users in English, French, Spanish or Chinese, undercutting local companies who are using a local language.

This is a good blog post detailing how tokenization impacts the LLM’s (large language models) that run the GPT. Essential reading.