The hanyupinyin package converts Chinese characters into Hanyu Pinyin. It is designed to be fast, vectorized, and easy to use for both interactive analysis and data-cleaning pipelines.
The workhorse function is to_pinyin():
to_pinyin("春眠不觉晓")
#> [1] "chun1_mian2_bu4_jue2_xiao3"
to_pinyin(c("你好", "世界"))
#> [1] "ni3_hao3" "shi4_jie4"By default the separator is "_" and tones are rendered
as numeric suffixes. You can change either:
For situations where you only need the alphabetic form without tones,
use to_pinyin_toneless():
To extract just the first letter of each syllable:
Chinese has many characters with multiple readings. By default
to_pinyin() converts characters independently, which works
well for single characters but can be wrong in context. Enable the
built-in phrase table with polyphone = TRUE:
You can also add your own phrases:
A common use case is turning Chinese column names from imported data (e.g. SAS, Excel) into valid R variable names:
to_varname(c("姓名", "年龄", "性别"))
#> [1] "xing_ming" "nian_ling" "xing_bie"
to_varname("1开始")
#> [1] "X1_kai_shi"For URL slugs:
The package bundles the kMandarin field from the Unicode
Unihan Database (Version 17.0), which covers more than 44,000 unique
Chinese characters. Because the data come from an open standard, you can
be confident in their provenance and stability.