Multilanguage Word Embeddings for Social Scientists: Estimation, Inference and Validation Resources for 157 Languages
Word embeddings are now a vital resource for social science research. But it can be difficult to obtain high quality embeddings for non-English languages, and it may be computationally expensive to do so. In addition, social scientists typically want to make statistical comparisons and do hypothesis tests on embeddings, yet this is non-trivial with current approaches. We provide three new data resources designed to ameliorate the union of these issues: (1) a new version of fastText model embeddings; (2) a multi-language “a la carte” (ALC) embedding version of the fastText model; (3) a multi-language ALC embedding version of the well-known GloVe model. All three are fit to Wikipedia corpora. These materials are aimed at “low resource” settings where the analysts lack access to large corpora in their language of interest, or lack access to the computational resources required to produce high-quality vector representations. We make these resources available for 30 languages, along with a code pipeline for another 127 languages available from Wikipedia corpora. We provide extensive validation of the materials, via reconstruction tests and other proofs-of-concept. We also conduct human crowdworker tests, for our embeddings for Arabic, French, (traditional, Mandarin) Chinese, Japanese, Korean, Russian and Spanish. Finally, we offer some advice to practitioners using our resources.