Image

Multilanguage Word Embeddings for Social Scientists: Estimation, Inference and Validation Resources for 157 Languages

Author

Elisa Wirsching, Pedro Rodriguez, Arthur Spirling, Brandon M. Stewart

Publication Year

2025

Type

Journal Article

Abstract

Word embeddings are now a vital resource for social science research. However, obtaining high-quality training data for non-English languages can be difficult, and fitting embeddings therein may be computationally expensive. In addition, social scientists typically want to make statistical comparisons and do hypothesis tests on embeddings, yet this is nontrivial with current approaches. We provide three new data resources designed to ameliorate the union of these issues: (1) a new version of fastText model embeddings, (2) a multilanguage “a la carte” (ALC) embedding version of the fastText model, and (3) a multilanguage ALC embedding version of the well-known GloVe model. All three are fit to Wikipedia corpora. These materials are aimed at “low-resource” settings where the analysts lack access to large corpora in their language of interest or to the computational resources required to produce high-quality vector representations. We make these resources available for 40 languages, along with a code pipeline for another 117 languages available from Wikipedia corpora. We extensively validate the materials via reconstruction tests and other proofs-of-concept. We also conduct human crowdworker tests for our embeddings for Arabic, French, (traditional Mandarin) Chinese, Japanese, Korean, Russian, and Spanish. Finally, we offer some advice to practitioners using our resources.

Journal

Political Analysis

Volume

Start Page

156

Issue

Pages

156-163

URL

Publisher's Version

ALC embeddings