[1]吴磊,李舒.中医方剂数据库文本挖掘数据预处理的尝试[J].中国中医药图书情报杂志,2015,39(3):8-11.[doi:10.3969/j.issn.2095-5707.2015.03.003]
WU Lei,LI Shu.An Attempt on Data Preprocessing for Text Mining in TCM Prescription Database[J].Chinese Journal of Library and Information Science for Traditional Chinese Medicine,2015,39(3):8-11.[doi:10.3969/j.issn.2095-5707.2015.03.003]
点击复制WU Lei,LI Shu.An Attempt on Data Preprocessing for Text Mining in TCM Prescription Database[J].Chinese Journal of Library and Information Science for Traditional Chinese Medicine,2015,39(3):8-11.[doi:10.3969/j.issn.2095-5707.2015.03.003]
中医方剂数据库文本挖掘数据预处理的尝试
《中国中医药图书情报杂志》[ISSN:2095-5707/CN:10-1113/R] 卷:
39卷 期数:
2015年3期 页码:
8-11
栏目:
中医药信息研究 出版日期:
2015-06-01
- Title:
- An Attempt on Data Preprocessing for Text Mining in TCM Prescription Database
- 文献标志码:
- A
- 摘要:
- 目的 针对中医方剂数据挖掘需要提出一套以数据清洗为主的数据预处理方法,使数据规范、准确和有序,利于后续处理。方法 通过检索技术,在方剂数据库中获取文本数据源,将非规范化的数据通过辅助词群行处理、正则表达式替换、异名处理等步骤进行清洗,改进数据质量。结果 在中国方剂数据库共检索到1758条记录,在方剂现代应用数据库共检索到91条记录。源文本数据经预处理后共得到有效记录6913味药,可成功导入相关信息挖掘系统进行方剂名称和中药名词的信息抽取。结论 本方法适用于基于中医方剂数据库的文本挖掘和知识发现,可成功对源文本数据实施清洗,得到标准统一、无噪声的数据,实现所需方药信息的有效抽取,可为中医方剂文本型数据信息分析与挖掘研究提供有益的借鉴。
- Abstract:
- Objective To propose a set of data preprocessing method based on data cleaning for TCM prescription database; To make data more standard, accurate and orderly, and convenient for follow-up processing. Methods The text data source was retrieved from prescription databases by bibliographic searching techniques. Non-normalized data were processed through steps followed by auxiliary word group line processing, regular expression substitution, and synonyms processing, with a purpose to improve data quality. Results Totally 1758 effective records were retrieved from TCM prescription database, and 91 records were retrieved from prescription modern application database. 6913 effective Chinese herbal medicines were retrieved after preprocessing, which can be successfully imported into relevant information mining system, and information about prescription and herb names can be extracted. Conclusion This method is applicable for text mining and knowledge discovery in TCM prescription database. It can successfully implement data cleaning for source text data, get data with unified standard and without noise, and finally realize the effective extraction of prescription information, which can provide references for researches on analysis and mining of TCM prescription text data.
参考文献/References:
[1] 魏琳.基于区间值聚类的锥栗数据挖掘研究与分析[J].无线互联科技,2013(12):127-128,148.
[2] 乔磊,李存华,仲兆满,等.基于规则的人物信息抽取算法的研究[J].南京师大学报:自然科学版,2012,35(4):134-139.
[3] 高学敏.中药学[M].北京:中国中医药出版社,2007.
[4] 南京中医药大学.中药大辞典[M].2版.上海:上海科学技术出版社,2006.
[5] 崔雷,刘伟,闫雷,等.文献数据库中书目信息共现挖掘系统的开发[J].现代图书情报技术,2008(8):70-75.
[6] 崔雷.医学数据挖掘[M].北京:高等教育出版社,2006:7.
备注/Memo
收稿日期:2014-08-24基金项目:辽宁省教育厅科研课题(L2012345)第一作者:吴磊,副教授,研究方向为中医药信息学。E-mail: l.wu-sy@qq.com
更新日期/Last Update:
2015-05-28