Converting Chinese to Pinyin with hanyupinyin

library(hanyupinyin)

The hanyupinyin package converts Chinese characters into Hanyu Pinyin. It is designed to be fast, vectorized, and easy to use for both interactive analysis and data-cleaning pipelines.

Basic conversion

The workhorse function is to_pinyin():

to_pinyin("春眠不觉晓")
#> [1] "chun1_mian2_bu4_jue2_xiao3"
to_pinyin(c("你好", "世界"))
#> [1] "ni3_hao3"  "shi4_jie4"

By default the separator is "_" and tones are rendered as numeric suffixes. You can change either:

to_pinyin("Hello 世界", sep = " ", other_replace = "?")
#> [1] "????? shi4 jie4"

Toneless output and initials

For situations where you only need the alphabetic form without tones, use to_pinyin_toneless():

to_pinyin_toneless("中华人民共和国")
#> [1] "zhong_hua_ren_min_gong_he_guo"

To extract just the first letter of each syllable:

to_pinyin_initials("中华人民共和国")
#> [1] "zhrmghg"

Handling polyphones

Chinese has many characters with multiple readings. By default to_pinyin() converts characters independently, which works well for single characters but can be wrong in context. Enable the built-in phrase table with polyphone = TRUE:

to_pinyin("银行行长", polyphone = TRUE)
#> [1] "yin2_hang2_hang2_zhang3"

You can also add your own phrases:

add_phrase("测试短语", "ce4 shi4 duan3 yu3")
to_pinyin("测试短语", polyphone = TRUE)
#> [1] "ce4_shi4_duan3_yu3"

Data-cleaning helpers

A common use case is turning Chinese column names from imported data (e.g. SAS, Excel) into valid R variable names:

to_varname(c("姓名", "年龄", "性别"))
#> [1] "xing_ming" "nian_ling" "xing_bie"
to_varname("1开始")
#> [1] "X1_kai_shi"

For URL slugs:

to_slug("2026年报告")
#> [1] "2026-nian-bao-gao"

Dictionary source

The package bundles the kMandarin field from the Unicode Unihan Database (Version 17.0), which covers more than 44,000 unique Chinese characters. Because the data come from an open standard, you can be confident in their provenance and stability.