Crawler for Cantonese pronunciation data on Chinese Character Database: With Word-formations Phonologically Disambiguated According to the Cantonese Dialect (粵語審音配詞字庫)
If you are interested in a more up-to-date word list, see rime/rime-cantonese.
See releases page.
Sample data:
"ch": "一",
"initial": "j",
"rhyme": "at",
"tone": "1",
"words": [ "一致", "統一", "一枝獨秀", "一般", "一切", "一樣", "專一", "劃一", "一視同仁", "一觸即發", "一落千丈", "長短不一" ]
"ch": "丈",
"initial": "z",
"rhyme": "oeng",
"tone": "6",
"words": [ "丈夫", "丈人", "丈母", "清丈", "丈量", "岳丈", "一落千丈", "丈二金剛" ]
"ch": "丙",
"initial": "b",
"rhyme": "ing",
"tone": "2",
"words": [ "丙等", "丙夜", "付丙", "丙吉問牛" ]
(1) Get a list of all Chinese characters on that website from classified character table
(2) Get the result of each character by Scrapy
Detailed explanation:
(1) Get a list of all Chinese characters
Download the classified character table.
Decode as big5hkscs
Iterate through the text with the regex <a href="search.php\?q=([%0-9A-Za-z_]+)">(.)</a>
to extract the characters and their corresponding links.
(2) Get the result of each character
Nomenclature: page > data row > field
2.1 Get data rows in a single page
form > table:first-child > tr:not(:first-child)
2.2 Get fields in a single data row
- Initial:
td:nth-child(1) > font[color="red"]::text
- Rhyme:
td:nth-child(1) > font[color="green"]::text
- Tone:
td:nth-child(1) > font[color="blue"]::text
- Words and explanation:
- Words:
- Explanation:
- Words:
Tips: When encountering decoding problem, try decoding as `big5hkscs` instead of `big5`.
$ python3
$ scrapy crawl lexi -s LOG_ENABLED=False -o data3.json
$ python3
Code for building the data is distributed under MIT license.
Dictionary data follows the original license.
<div nowrap>枝指</div>
<div nowrap>
枝葉, 荔枝, 花枝招展
<a href="#1" onclick="xid_down('zi1_detial')">
<font size="-1">[11..]</font>
<div id="zi1_detial" style="display: none">
枝幹, 枝椏, 樹枝, 折枝, 接枝, 比翼連枝, 同氣連枝, 金枝玉葉, 節外生枝, 細枝末節, 枝葉扶疏
<div nowrap>
<font size="-1" color="maroon">(連任)</font>
, 蟬蛻, 寒蟬
<a href="#1" onclick="xid_down('sim4_detial')">
<font size="-1">[1..]</font>
<div id="sim4_detial" style="display: none">
<div nowrap></div>
<font size="-1" color="forestgreen">
<font size="+1" color="red">s</font>
<font size="+1" color="green">im</font>
<font size="+1" color="blue">4</font>
<div nowrap></div>
<font size="-1" color="forestgreen">
<a href="search.php?q=%D6r">琀</a>
<div nowrap></div>
<font size="-1" color="forestgreen">
<a href="search.php?q=%A7t">含</a>
<div nowrap>
<font size="-1" color="maroon">(殉葬時置死者口中的蟬形玉石)</font>
唅以槁骨, 羹藜唅糗
<font size="-1" color="forestgreen">嘴裡銜著食物</font>
(7) 車、媽、唳、涌、牏
<div nowrap>
車次, 車廂, 車禍
<a href="#1" onclick="xid_down('ce1_detial')">
<font size="-1">[5..]</font>
<div id="ce1_detial" style="display: none">
車裂, 車間, 車輛, 火車, 汽車
<font size="-1" color="forestgreen">
<font size="+1" color="red">g</font>
<font size="+1" color="green">eoi</font>
<font size="+1" color="blue">1</font>