Skip to main content

Search

Items tagged with: Language


At Indigenous Sacred Sites, Seeing Things I’m Not Supposed to See

Western journalism tends to value transparency as a public good. But as an Indigenous reporter, I face a unique set of challenges: Include too-specific cultural details, and I risk endangering my community.

propub.li/3zMkG9z

#News #Journalism #Indigenous #Culture #Community #Native #Transparency #Privacy #Storytelling #Language


#AI is thwarting the study of human #language.
404media.co/project-analyzing-…
(#paywalled)

"The creator of an #OpenSource project that scraped the internet to determine the ever-changing popularity of different words in human language usage says that they are sunsetting the project because generative AI spam has poisoned the internet…“Now the web at large is full of #slop generated by #LLMs, written by no one to communicate nothing. Including this slop in the data skews the word frequencies.”


If you're a #language nerd like I am, then you won't have missed the @mozilla #CommonVoice v19 #speech #dataset release - which now features 131 languages! Here's my #dataviz, done in @observablehq of the v19 #metadata coverage.

I've updated the visualisation this time around with human-readable language names instead of their ISO-639 or BCP-47 language codes to make it it easier to read.

There's some interesting observations:

▶ Catalan (ca) continues to be leader in terms of data - speaking volumes about the efforts to revitalise culture and language in Catalunya. It's also one of the few languages that has data for all age groups, particularly older speakers - this sort of data is missing for most other languages.

▶ Kiswahili (sw) is one of the languages where there is more data for female-identifying speakers than for male-identifying speakers ♀ - although Japanese (ja), Western Mari (mrj) and Luganda (lg) do pretty well here, too!

▶ Sentence domains can now be categorised, and although most new sentences are "general", Albanian (sq) has a lot of sentences related to law and government.

▶ Tsonga (ts), a Bantu language spoken in Southern Africa, has dethroned Icelandic (is) as the language with the highest average utterance duration. I don't know enough about Tsonga to speculate why - it's a somewhat agglutinative language, but many Tsonga works are generally short.

▶ Bengali / Bangla (bn) has a significant amount of data that is not yet validated, and therefore does not appear in training / dev / test splits. There is a similar case for many languages new to Common Voice - it takes time to validate.

▶ The language with the highest number of average contributions per speaker is Taita (dav), a Bantu language from Kenya.

What do you make of the data visualisation? Are there any other insights you can see?

Big thanks to the CV team for all their efforts - EM, Jessica Rose, Dmitrij Feller and Justin Grant.

#linguistics

observablehq.com/@kathyreid/mo…