Paper

"Chi nas dal soch el sent de legn" -- Auditing Text Corpora for Lombard

arXiv:2606.06349v1 Announce Type: new Abstract: Several of the world's languages are still under-resourced in terms of Natural Language Processing (NLP) tools. This is mostly due to the lack of high-quality datasets to train, develop, and evaluate systems and models for several tasks, such as Machine Translation (MT). We conduct a manual audit of the parallel and monolingual corpora available for Lombard, an under-resourced language continuum from Italy. Our analysis reveals that the perceived abundance of web-scraped data is an illusion, with massive datasets plagued by severe language misid…

arXiv cs.CLPublished 2026-06-05Paper link

Authors:

Topics

Relevant entities

People

Linked people will appear here.

Related coverage

Linked coverage will appear here.

Related events

Linked events will appear here.

Related discussions

Related discussion nodes will appear here.