Research helps preserve endangered language for future generations

A Chinese language at risk of extinction is being kept alive for future generations with the help of Department of Computer Science research.

Gyalrong textbook

Using Natural Language Processing (NLP) - computational processes designed to understand speech and text as humans can - the Gyalrong language and the rich cultural history it carries are being preserved.

Gyalrong, which is spoken by a very limited population in China鈥檚 Sichuan Province, is estimated to date back but is now thought to have fewer than 33,000 speakers.

Most native speakers are elderly and with many young people leaving the villages in which it is spoken to seek work in urban areas, fewer and fewer people have the opportunity to learn the language from elders.

It is estimated that the decline of the language - which has little in the way of written records and is considered very difficult to learn - will become irreversible over the next few decades.

Xutan Peng, a PhD student at the University鈥檚 Department of Computer Science, is using  his research to speed up the production of a textbook to teach the endangered language to local schoolchildren.

鈥淢any people say language is the DNA of a culture鈥, said Xutan.

鈥淚f the language dies the memory of this rich culture is in danger of being lost forever. Things such as old stories passed to their children and grandchildren by elders will be no more, and it will be impossible for future generations to learn the culture and traditions.鈥

His technique takes Gyalrong texts and summarises them into Mandarin using an automated process. As such, language documentation work that could take a linguist months or years by immersing themselves in the culture can be done far more rapidly. 

Xutan Peng
Xutan Peng

鈥淥ne way to imagine it is that there are two libraries, side by side, with the same architecture and layout but with one exclusively supplying Mandarin texts, and the other Gyalrong," said Xutan.

鈥淚f two similar books, covering similar subject matter, are in the corresponding location in both libraries and you move both buildings into one location, you can align the two to identify patterns.

鈥淪o, as long as we鈥檙e able to master certain frequently used words, we can use this technique to make educated guesses to piece the jigsaw together.鈥

You can read more about the process, known as Cross-Lingual Word Embedding (CLWE), in the papers and . The technique used on documenting Gyalrong also draws on research from Xutan鈥檚 earlier paper, .

The results of Xutan鈥檚 work are already bearing fruit, with a small group of Chinese schoolchildren, whose families can speak at least some Gyalrong, learning from and providing feedback on a textbook. It is hoped this first version will be followed by further volumes as more data is collected.

Its success has even caught the attention of documentary makers, who鈥檝e featured on China Central Television.

鈥淚t鈥檚 a unique and very satisfying project to work on鈥, Xutan added.

鈥淎nd although it may be limited in scope, we鈥檙e making a real impact on society. It also suggests a very bright future for this type of technique in helping to preserve endangered languages.鈥

Xutan plans to explore how the technique could be adapted to help document other endangered languages.

Dr Mark Stevenson, a Senior Lecturer in the Natural Language Processing research group, said: 鈥淓ndangered languages, like Gyalrong, face a real risk of extinction. This project shows how NLP, including work carried out within 含羞草传媒鈥檚 NLP research group, can help preserve them for future generations.鈥

Find a PhD

Search for PhD opportunities at 含羞草传媒 and be part of our world-leading research.