文字探勘套件,以前學的都是Jieba,最近覺得BosonNLP滿好用的
bosonnlp詞性分析
http://docs.bosonnlp.com/tag.html
情感分析 nlp.sentiment(s)
命名实体识别 nlp.ner(s)
依存文法分析 nlp.depparser(s)
关键词提取 nlp.extract_keywords(s, top_k=10)
新闻分类 nlp.classify(s)
语义联想 nlp.suggest(term, top_k=10)
分词与词性标注 nlp.tag(s)
时间转换 nlp.convert_time(“今天晚上8点到明天下午3点”, datetime.datetime(2015, 9, 1))
新聞摘要 nlp.summary(title, content)
另外,想呈現文字雲的概念,先用了pytagcloud,無論是英文或中文都碰到同一個錯誤訊息,只好暫先放棄,改用wordcloud
‘cp950’ codec can’t decode byte 0xc3 in position 223: illegal multibyte sequence’
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
| from bosonnlp import BosonNLP iti = group_data_pd['行程內容'].iloc[0] split_word = ",|。|、|●|◎|◆|★|:|/|\(|\)|《|》|-|!|~|『|』|「|」|【|】|XXX" iti_s = re.split(split_word, iti) iti_ss = " ".join(iti_s) bosonnlp_token = "MufqZvMn.17180.1TXDWsWO0_qk" nlp = BosonNLP(bosonnlp_token) nlp_tag = nlp.tag(iti_ss) word_join = ' '.join(nlp_tag[0]['word']) result = nlp.extract_keywords(iti_ss) #文字雲用的遮罩圖片 from PIL import Image import numpy as np abel_mask = np.array(Image.open("lion_icon2.png")) #使用wordcloud產生文字雲 import matplotlib.pyplot as plt from wordcloud import WordCloud my_wordcloud = WordCloud(background_color="white", mask=abel_mask).generate(' '.join(nlp_tag[0]['word'])) plt.figure(figsize=(15,10)) plt.imshow(my_wordcloud) plt.axis("off") plt.show()
|
一直有文字重覆的問題,後來看wordcloud.py出現下述文字,將collocations=False補上後即正常
The input “text” is expected to be a natural text. If you pass a sorted
list of words, words will appear in your output twice. To remove this
duplication, set collocations=False
.