Friendica Social Network

Search

Items tagged with: Language

ProPublica

3 weeks ago

ProPublica
3 weeks ago

At Indigenous Sacred Sites, Seeing Things I’m Not Supposed to See
—

Western journalism tends to value transparency as a public good. But as an Indigenous reporter, I face a unique set of challenges: Include too-specific cultural details, and I risk endangering my community.

propub.li/3zMkG9z

#News #Journalism #Indigenous #Culture #Community #Native #Transparency #Privacy #Storytelling #Language

How Do I Cover Sacred Sites as an Indigenous Journalist?

Western journalism values transparency as a public good. But for Indigenous reporters, including too-specific cultural details can harm Native communities.

^ProPublica

#news #journalism #privacy #culture #Indigenous #native #language #Community #transparency #storytelling

Please wait

View in context

petersuber

4 weeks ago

petersuber
4 weeks ago

#AI is thwarting the study of human #language.
404media.co/project-analyzing-…
(#paywalled)

"The creator of an #OpenSource project that scraped the internet to determine the ever-changing popularity of different words in human language usage says that they are sunsetting the project because generative AI spam has poisoned the internet…“Now the web at large is full of #slop generated by #LLMs, written by no one to communicate nothing. Including this slop in the data skews the word frequencies.”

Project Analyzing Human Language Usage Shuts Down Because ‘Generative AI Has Polluted the Data’

Wordfreq shuts down because "I don’t think anyone has reliable information about post-2021 language usage by humans.”

^{Jason Koebler (404 Media)}

#ai #opensource #LLMs #language #paywalled #slop

Please wait

View in context

Kathy Reid

4 weeks ago

Kathy Reid
4 weeks ago

If you're a #language nerd like I am, then you won't have missed the @mozilla #CommonVoice v19 #speech #dataset release - which now features 131 languages! Here's my #dataviz, done in @observablehq of the v19 #metadata coverage.

I've updated the visualisation this time around with human-readable language names instead of their ISO-639 or BCP-47 language codes to make it it easier to read.

There's some interesting observations:

▶ Catalan (ca) continues to be leader in terms of data - speaking volumes about the efforts to revitalise culture and language in Catalunya. It's also one of the few languages that has data for all age groups, particularly older speakers - this sort of data is missing for most other languages.

▶ Kiswahili (sw) is one of the languages where there is more data for female-identifying speakers than for male-identifying speakers ♀ - although Japanese (ja), Western Mari (mrj) and Luganda (lg) do pretty well here, too!

▶ Sentence domains can now be categorised, and although most new sentences are "general", Albanian (sq) has a lot of sentences related to law and government.

▶ Tsonga (ts), a Bantu language spoken in Southern Africa, has dethroned Icelandic (is) as the language with the highest average utterance duration. I don't know enough about Tsonga to speculate why - it's a somewhat agglutinative language, but many Tsonga works are generally short.

▶ Bengali / Bangla (bn) has a significant amount of data that is not yet validated, and therefore does not appear in training / dev / test splits. There is a similar case for many languages new to Common Voice - it takes time to validate.

▶ The language with the highest number of average contributions per speaker is Taita (dav), a Bantu language from Kenya.

What do you make of the data visualisation? Are there any other insights you can see?

Big thanks to the CV team for all their efforts - EM, Jessica Rose, Dmitrij Feller and Justin Grant.

#linguistics

observablehq.com/@kathyreid/mo…

Mozilla Common Voice v19 dataset metadata coverage

This visualisation uses "@d3/stacked-horizontal-bar-chart" to visualise the Common Voice metadata coverage.

^Observable

#dataviz #linguistics #language #metadata #speech #commonvoice #dataset @Mozilla @Observable

Please wait

View in context

⇧

Search

Items tagged with: Language

ProPublica 3 weeks ago

ProPublica 3 weeks ago

How Do I Cover Sacred Sites as an Indigenous Journalist?

petersuber 4 weeks ago

petersuber 4 weeks ago

Project Analyzing Human Language Usage Shuts Down Because ‘Generative AI Has Polluted the Data’

Kathy Reid 4 weeks ago

Kathy Reid 4 weeks ago

Mozilla Common Voice v19 dataset metadata coverage

ProPublica

3 weeks ago

ProPublica
3 weeks ago

petersuber

4 weeks ago

petersuber
4 weeks ago

Kathy Reid

4 weeks ago

Kathy Reid
4 weeks ago