基于PMI与BTM的船舶事故原因文本挖掘

于卫红; 付飘云; 任月; 王庆武

doi:10.3963/j.jssn.1674-4861.2021.01.0005

基于PMI与BTM的船舶事故原因文本挖掘

doi: 10.3963/j.jssn.1674-4861.2021.01.0005

1.
大连海事大学航运经济与管理学院辽宁大连 116026
2.
大连海事大学航海学院辽宁大连 116026

基金项目:

国家重点研发计划资助项目 2019YFB1600602

中央高校基本科研业务费专项资金 3132020139

详细信息

通讯作者:
于卫红（1972—），博士，副教授.研究方向：智能信息处理，文本挖掘. E-mail: yuwhdmu@dlmu.edu.cn.

中图分类号: U698.6
计量
- 文章访问数: 865
- HTML全文浏览量: 492
- PDF下载量: 28
- 被引次数: 0
出版历程
- 收稿日期: 2020-10-23
- 刊出日期: 2021-02-28

Text Mining for Causes of Ship Accidents Based on PMI and BTM

1.
Maritime Economics and Management College, Dalian Maritime University, Dalian 116026, Liaoning, China
2.
Navigation College, Dalian Maritime University, Dalian 116026, Liaoning, China

摘要

摘要: 为了实现从海量的船舶事故调查报告中自动提取出水上交通安全知识，提出了从词语和主题2个层面对船舶事故调查报告进行语义挖掘的方法，并以100份船舶自沉事故调查报告为语料进行具体挖掘。在词语层面，使用PMI算法从事故原因文本中挖掘频繁共现的词语模式，通过文本特征词的共现揭示事故致因要素间的关联。在主题层面，使用BTM算法对事故原因文本进行主题建模，通过主题对数似然、主题一致性评估建模结果的优劣。通过主题建模，对表征自沉事故原因的特征词进行聚类，并根据主题在文档集合中的分布初步量化出每种原因的发生概率。根据使用500组新数据集对主题模型预测能力的测试，所构建的主题模型能够100%识别出领域无关的词并自动忽略；对于语料库中85.6%的词语，所构建的主题模型能够明确地将其归属于代表某一原因的主题；另14.4%的词主题边界不明显，难以将其单独以较大的可能性明确归属到某一主题下。
- 交通安全 /
- 船舶事故调查报告 /
- 文本挖掘 /
- 主题模型 /
- 词共现 /
- PMI算法 /
- BTM算法
Abstract: The paper proposes a method of semantic mining for ship accident investigation reports from words and topics to automatically extract knowledge of water traffic safety from massive ship accident investigation reports. Moreover, 100 investigation reports on the self-sinking accidents of ships are used as corpus for specific excavations. At the word level, the PMI algorithm is used to mine frequent co-occurrence word patterns from the texts describing the causes of the accidents, and relationships between accident-causing factors are revealed through the co-occurrence of text feature words. At the topic level, the BTM algorithm is used to model the texts describing the causes of the accidents, and the modeling results are evaluated by topic log-likelihood and coherence. The feature words representing the causes of foundering accidents are clustered through topic modeling, and the occurrence probability of each cause is preliminarily quantified according to the distribution of topics in the corpus. According to the results on the predictive ability of the topic model using 500 new data sets, the topic model can recognize 100% of the domain-independent words and automatically ignore them. For 85.6% of the words in the corpus, the topic model can attribute them to a certain topic representing a specific cause. For about 14.4% of the words, the topic boundary is not obvious, so it is not easy to attribute them with a high probability.
- traffic safety /
- ship accident investigation reports /
- text mining /
- topic model /
- word co-occurrence /
- PMI algorithm /
- BTM algorithm

HTML全文

图 1 研究思路

Figure 1. Research ideas

下载: 全尺寸图片幻灯片

图 2 词共现语义网络图

Figure 2. Word co-occurrence semantic network

下载: 全尺寸图片幻灯片

图 3 主题模型对数似然的变化曲线

Figure 3. Log-likelihood curve of the topic model

下载: 全尺寸图片幻灯片

图 4 主题一致性的变化曲线

Figure 4. Topic coherence curve

下载: 全尺寸图片幻灯片

图 5 主题建模结果的可视化展示

Figure 5. Visualization of topic-modeling results

下载: 全尺寸图片幻灯片

图 6 主题概率分布条形图

Figure 6. Bar for the probability distribution of topics

下载: 全尺寸图片幻灯片

表 1 事故调查报告不同的存储格式

Table 1. Different storage formats of accident investigation

报告的存储格式	数量
.pdf	78
.docx	1
.doc	19
.html	2

下载: 导出CSV

表 2 事故调查报告不同的来源

Table 2. Different sources of accident investigation reports

报告的来源	数量
广东海事局	21
浙江海事局	14
江苏海事局	13
上海海事局	11
河北海事局	9
交通运输部海事局	8
福建海事局	7
辽宁海事局	7
连云港海事局	3
天津海事局	3
山东海事局	2
长江海事局	2

下载: 导出CSV

表 3 频繁共现词

Table 3. Frequent co-occurrence terms

共现词对	共现次数	点互信息（PMI）
货舱进水	39	4.455 168
大量进水	31	6.065 41
恶劣天气	27	7.278 362
船员不适任	27	6.516 095
船舶稳性	27	3.176 843
安全意识淡薄	25	9.250 386
船舶储备浮力	21	3.409 883
冒险航行	20	6.600 984
船舶右倾	20	3.246 384
超航区航行	19	6.246 875
船舶丧失	19	2.757 346
甲板上浪	18	6.660 978
船舶不适航	18	4.287 026
船舶倾覆	18	3.591 881
货舱大量	17	5.232 425
丧失稳性	16	5.917 811
船舶倾斜	16	3.151 867
应急处置不当	15	7.456 837
破损进水	15	5.348 253
船舶横摇	15	3.772 453
丧失储备浮力	14	6.320 775
稳性丧失	14	5.683 346
配员不足	13	7.705 713
船体破损	12	6.251 5
大风浪天气	12	5.223 914
大风浪航行	11	3.960 88
船舶超航区	11	2.881 387
严重超载	10	7.125 078
货物移位	10	6.738 487
船舶左倾	10	3.661 422

下载: 导出CSV

表 4 主题模型评估指标值随主题数变化的情况

Table 4. Topic-number-dependent changes of topic-model evaluation measures

主题数K	主题模型的对数似然	主题一致性
5	-1 800 879.932	-59.079 7
6	-1 793 935.425	-60.541 18
⋮	⋮	⋮
20	-1 752 488.35	-48.773 9
21	-1 750 153.414	-55.454 51
⋮	⋮	⋮
49	-1 720 233.109	-69.822 96
50	-1 718 244.737	-70.871 52

下载: 导出CSV

表 5 词语在主题中的概率分布

Table 5. Probability distribution of words in topics

词语	Z1	Z2	…	Z20
安全	0.006 728	0.008 146	…	0.007 857
岸基	5.65×10^-7	1.09×10^-6	…	0.008 007
暴雨	0.001 753	1.09×10^-6	…	7.48×10^-7
⋮	⋮	⋮		⋮
溢出	5.65×10^-7	1.09×10^-6	…	0.001 946
自由液面	0.002 997	0.000 653	…	7.48×10^-7
纵倾	0.000 509	0.003 693	…	7.48×10^-7
⋮	⋮	⋮		⋮

下载: 导出CSV

表 6 各主题下出现概率最高的前10个词示例

Table 6. Demo about top 10 words under each topic

Z5		Z6		Z14
词语	概率	词语	概率	词语	概率
船舶	0.159 6	事发水域	0.056 6	船舶	0.090 4
内河	0.036 8	船舶	0.042	大风浪	0.036 4
航行	0.036 5	复杂	0.023 6	进水	0.035 8
不满足	0.021	水流	0.021 5	航行	0.033 2
超载	0.019 9	潮流	0.019 6	货舱	0.029 4
舱盖	0.019 6	驾驶员	0.019 4	风浪	0.023 4
干舷	0.019 3	不当	0.019 1	上浪	0.021 7
稳性	0.016	操纵	0.018 1	天气	0.016 6
超航区	0.015 5	大潮汛	0.017 2	不当	0.015
没有	0.015 3	紊乱	0.017 1	恶劣	0.014 8

下载: 导出CSV

表 7 主题含义

Table 7. Meaning of topics

主题编号	主题含义
Z1	船员安全意识淡薄
Z2	船体强度不足
Z3	泥沙、自由液面影响
Z4	值班疏忽未及时发现异常
Z5	船舶超航区、超载航行
Z6	事发水域通航环境复杂
Z7	船公司岸基支持不到位
Z8	码头违规进行装载作业
Z9	船舶稳性不足
Z10	船体破损进水
Z11	船员不适任
Z12	储备浮力不足，进水后倾斜
Z13	安全管理不到位
Z14	大风浪影响
Z15	值班驾驶员操纵不当
Z16	货舱未达到风雨密要求
Z17	盲目拖带
Z18	擅自非法改建船舶
Z19	货物易流态化、货物移位
Z20	船长应急处置不当

下载: 导出CSV

表 8 主题的概率分布

Table 8. Probability distribution of topics

主题编号	主题概率
Z1	0.061 047 92
Z2	0.031 747 13
Z3	0.044 759 72
Z4	0.032 590 22
Z5	0.052 630 85
Z6	0.031 090 63
Z7	0.058 484 10
Z8	0.029 038 19
Z9	0.064 579 22
Z10	0.037 648 75
Z11	0.068 407 67
Z12	0.033 253 63
Z13	0.121 570 63
Z14	0.120 803 56
Z15	0.030 302 82
Z16	0.049 700 77
Z17	0.026 868 27
Z18	0.033 191 44
Z19	0.026 177 21
Z20	0.046 107 28

下载: 导出CSV

表 9 用于测试的新数据集

Table 9. New dataset for testing

编号	词语
1	满载排水量
2	潮汐
3	精准营销
4	岗前培训
5	抢滩

下载: 导出CSV

表 10 预测结果

Table 10. Predicting outcomes

主题编号	满载排水量	潮汐	岗前培训	抢滩
Z1	0.000	0.000	0.000	0.153
Z2	0.000	0.000	0.000	0.000
Z3	0.000	0.000	0.000	0.000
Z4	0.000	0.000	0.000	0.000
Z5	0.953	0.213	0.000	0.000
Z6	0.000	0.780	0.000	0.000
Z7	0.000	0.000	0.996	0.000
Z8	0.000	0.000	0.000	0.206
Z9	0.040	0.000	0.000	0.000
Z10	0.000	0.000	0.000	0.000
Z11	0.000	0.000	0.000	0.000
Z12	0.000	0.000	0.000	0.000
Z13	0.000	0.000	0.000	0.000
Z14	0.000	0.000	0.000	0.000
Z15	0.000	0.000	0.000	0.000
Z16	0.000	0.000	0.000	0.000
Z17	0.000	0.000	0.000	0.000
Z18	0.000	0.000	0.000	0.000
Z19	0.000	0.000	0.000	0.000
Z20	0.000	0.000	0.000	0.640

下载: 导出CSV

参考文献(18)

[1]	姚厚杰. 自主货物运输船舶航行风险辨识与事故致因分析研究[D]. 武汉: 武汉理工大学, 2019. YAO Houjie. Study on navigation risk identification and accident causation analysis of autonomous cargo ships[D]. Wuhan: Wuhan University of Technology, 2019. (in Chinese)
[2]	LEE Jeongseok, LEE Bokyeong, CHO Lksoon. Text mining analysis technique on ecdis accident report[J]. Journal of the Korean Society of Marine Environment and Safety, 2019, 25 (4). http://www.researchgate.net/publication/334389437_Text_Mining_Analysis_Technique_on_ECDIS_Accident_Report/download
[3]	吴伋, 江福才, 姚厚杰, 等. 基于文本挖掘的内河船舶碰撞事故致因因素分析与风险预测[J]. 交通信息与安全, 2018, 36 (3): 8-18. doi: 10.3963/j.issn.1674-4861.2018.03.002 WU Ji, JIANG Fucai, YAO Houjie, et al. An analysis and risk forecasting of inland ship collision based on text mining[J]. Journal of Transport Information and Safety, 2018, 36 (3): 8-18. (in Chinese) doi: 10.3963/j.issn.1674-4861.2018.03.002
[4]	余晨, 毛喆, 高嵩. 基于规则的海事自由文本信息抽取方法研究[J]. 交通信息与安全, 2017, 35 (2): 40-47. https://www.cnki.com.cn/Article/CJFDTOTAL-JTJS201702007.htm YU Chen, MAO Zhe, GAO Song. An approach of extracting information for maritime unstructured text based on rules[J]. Journal of Transport Information and Safety, 2017, 35(2): 40-47. (in Chinese) https://www.cnki.com.cn/Article/CJFDTOTAL-JTJS201702007.htm
[5]	BLEI D M, NG A Y, JORDAN M I. Latent dirichlet allocation[J]. Journal of Machine Learning Research, 2003(3): 993-1022.
[6]	中华人民共和国交通部. 水上安全监督常用术语: GB/T 19945-2005[S]. 北京: 中国标准出版社, 2011. Ministry of Transport, People's Republic of China. Marine supervision terminology in common use: GB/T 19945-2005[S]. Beijing: Standards Press of China, 2011. (in Chinese)
[7]	陈鑫, 王素格, 廖健. 基于词语相关度的微博新情感词自动识别[J]. 计算机应用, 2016, 36 (2): 424-427. doi: 10.3969/j.issn.1001-3695.2016.02.024 CHEN Xin, WANG Suge, LIAO Jian. Automatic identification of new sentiment word about microblog based on word association[J]. Journal of Computer Applications, 2016, 36(2): 424-427. (in Chinese) doi: 10.3969/j.issn.1001-3695.2016.02.024
[8]	OLIVEIRA N, CORTEZ P, AREAL N. Stock market sentiment lexicon acquisition using microblogging data and statistical measures[J]. Decision Support Systems, 2016 (85): 62-73. http://dl.acm.org/citation.cfm?id=2928086
[9]	聂卉, 首欢容. 基于修正点互信息的特征级情感词极性自动研判[J]. 图书情报工作, 2020, 64 (5): 114-123. https://www.cnki.com.cn/Article/CJFDTOTAL-TSQB202005017.htm NIE Hui, SHOU Huanrong. Feature-opinion polarity identification based on the modified PMI algorithm[J]. Library and Information Service, 2020, 64 (5): 114-123. (in Chinese) https://www.cnki.com.cn/Article/CJFDTOTAL-TSQB202005017.htm
[10]	赵传君, 王素格, 李德玉. 跨领域文本情感分类研究进展[J]. 软件学报, 2020, 31 (6): 1723-1746. https://www.cnki.com.cn/Article/CJFDTOTAL-RJXB202006010.htm ZHAO Chuanjun, WANG Suge, LI Deyu. Research progress on cross-domain text sentiment classification[J]. Journal of Software, 2020, 31 (6): 1723-1746. (in Chinese) https://www.cnki.com.cn/Article/CJFDTOTAL-RJXB202006010.htm
[11]	YAN Xiaohui, GUO Jiafeng, LAN Yanyan, et al. A biterm topic model for short texts[C]. 22nd International Conference on World Wide Web, Rio de Janeiro, Brazil: ACM, 2013.
[12]	MIMNO D M, WALLACH H M, TALLEY E M. Optimizing semantic coherence in topic models[C]. The 2011 Conference on Empirical Methods in Natural Language Processing, Edinburgh, UK: ACL, 2011.
[13]	JÓNSSON E, STOLEE J. An evaluation of topic modelling techniques for twitter[C]. The 53^rd Annual Meeting of the Association for Computational Linguistics and the 7^th International Joint Conference on Natural Language Processing, Beijing, China: ACL, 2015.
[14]	RÖDER Michael, BOTH Andreas, HINNEBURG Alexander. Exploring the space of topic coherence measures[C]. The Eighth ACM International Conference on Web Search and Data Mining, Shanghai, China: ACM, 2015.
[15]	李奕良. 基于贝叶斯网络的干散货船舶自沉事故致因分析[D]. 大连: 大连海事大学, 2020. LI Yiliang. Cause analysis of ship foundering accident of dry bulk carrier based on Bayesian network[D]. Dalian: Dalian Maritime University, 2020. (in Chinese).
[16]	陈兴园. 基于MAIB事故报告的水上交通事故管理致因研究[D]. 武汉: 武汉理工大学, 2016. CHEN Xingyuan. Study on management factors of water traffic accident based on MAIB accident reports[D]. Wuhan: Wuhan University of Technology, 2016. (in Chinese)
[17]	韩俊松, 吴宛青, 杜嘉立, 等. 中国沿海固体散货运输船自沉事故分析与对策[J]. 中国航海, 2014, 37 (1): 82-86. doi: 10.3969/j.issn.1000-4653.2014.01.018 HAN Junsong, WU Wanqing, DU Jiali, et al. Countermeasures to foundering accidents of ships carrying solid bulk cargo in the coastal area of China[J]. Navigation of China, 2014, 37 (1): 82-86. (in Chinese) doi: 10.3969/j.issn.1000-4653.2014.01.018
[18]	乔赛雯. 基于贝叶斯网络方法对干散货船舶航行事故致因分析[D]. 大连: 大连海事大学, 2017. QIAO Saiwen. Based on Bayesian network method of dry bulk ships sailing accident cause analysis[D]. Dalian: Dalian Maritime University, 2017. (in Chinese)