Text Mining for Causes of Ship Accidents Based on PMI and BTM
-
摘要: 为了实现从海量的船舶事故调查报告中自动提取出水上交通安全知识,提出了从词语和主题2个层面对船舶事故调查报告进行语义挖掘的方法,并以100份船舶自沉事故调查报告为语料进行具体挖掘。在词语层面,使用PMI算法从事故原因文本中挖掘频繁共现的词语模式,通过文本特征词的共现揭示事故致因要素间的关联。在主题层面,使用BTM算法对事故原因文本进行主题建模,通过主题对数似然、主题一致性评估建模结果的优劣。通过主题建模,对表征自沉事故原因的特征词进行聚类,并根据主题在文档集合中的分布初步量化出每种原因的发生概率。根据使用500组新数据集对主题模型预测能力的测试,所构建的主题模型能够100%识别出领域无关的词并自动忽略;对于语料库中85.6%的词语,所构建的主题模型能够明确地将其归属于代表某一原因的主题;另14.4%的词主题边界不明显,难以将其单独以较大的可能性明确归属到某一主题下。Abstract: The paper proposes a method of semantic mining for ship accident investigation reports from words and topics to automatically extract knowledge of water traffic safety from massive ship accident investigation reports. Moreover, 100 investigation reports on the self-sinking accidents of ships are used as corpus for specific excavations. At the word level, the PMI algorithm is used to mine frequent co-occurrence word patterns from the texts describing the causes of the accidents, and relationships between accident-causing factors are revealed through the co-occurrence of text feature words. At the topic level, the BTM algorithm is used to model the texts describing the causes of the accidents, and the modeling results are evaluated by topic log-likelihood and coherence. The feature words representing the causes of foundering accidents are clustered through topic modeling, and the occurrence probability of each cause is preliminarily quantified according to the distribution of topics in the corpus. According to the results on the predictive ability of the topic model using 500 new data sets, the topic model can recognize 100% of the domain-independent words and automatically ignore them. For 85.6% of the words in the corpus, the topic model can attribute them to a certain topic representing a specific cause. For about 14.4% of the words, the topic boundary is not obvious, so it is not easy to attribute them with a high probability.
-
Key words:
- traffic safety /
- ship accident investigation reports /
- text mining /
- topic model /
- word co-occurrence /
- PMI algorithm /
- BTM algorithm
-
表 1 事故调查报告不同的存储格式
Table 1. Different storage formats of accident investigation
报告的存储格式 数量 .pdf 78 .docx 1 .doc 19 .html 2 表 2 事故调查报告不同的来源
Table 2. Different sources of accident investigation reports
报告的来源 数量 广东海事局 21 浙江海事局 14 江苏海事局 13 上海海事局 11 河北海事局 9 交通运输部海事局 8 福建海事局 7 辽宁海事局 7 连云港海事局 3 天津海事局 3 山东海事局 2 长江海事局 2 表 3 频繁共现词
Table 3. Frequent co-occurrence terms
共现词对 共现次数 点互信息(PMI) 货舱 进水 39 4.455 168 大量 进水 31 6.065 41 恶劣 天气 27 7.278 362 船员 不适任 27 6.516 095 船舶 稳性 27 3.176 843 安全意识 淡薄 25 9.250 386 船舶 储备浮力 21 3.409 883 冒险 航行 20 6.600 984 船舶 右倾 20 3.246 384 超航区 航行 19 6.246 875 船舶 丧失 19 2.757 346 甲板 上浪 18 6.660 978 船舶 不适航 18 4.287 026 船舶 倾覆 18 3.591 881 货舱 大量 17 5.232 425 丧失 稳性 16 5.917 811 船舶 倾斜 16 3.151 867 应急处置 不当 15 7.456 837 破损 进水 15 5.348 253 船舶 横摇 15 3.772 453 丧失 储备浮力 14 6.320 775 稳性 丧失 14 5.683 346 配员 不足 13 7.705 713 船体 破损 12 6.251 5 大风浪 天气 12 5.223 914 大风浪 航行 11 3.960 88 船舶 超航区 11 2.881 387 严重 超载 10 7.125 078 货物 移位 10 6.738 487 船舶 左倾 10 3.661 422 表 4 主题模型评估指标值随主题数变化的情况
Table 4. Topic-number-dependent changes of topic-model evaluation measures
主题数K 主题模型的对数似然 主题一致性 5 -1 800 879.932 -59.079 7 6 -1 793 935.425 -60.541 18 ⋮ ⋮ ⋮ 20 -1 752 488.35 -48.773 9 21 -1 750 153.414 -55.454 51 ⋮ ⋮ ⋮ 49 -1 720 233.109 -69.822 96 50 -1 718 244.737 -70.871 52 表 5 词语在主题中的概率分布
Table 5. Probability distribution of words in topics
词语 Z1 Z2 … Z20 安全 0.006 728 0.008 146 … 0.007 857 岸基 5.65×10-7 1.09×10-6 … 0.008 007 暴雨 0.001 753 1.09×10-6 … 7.48×10-7 ⋮ ⋮ ⋮ ⋮ 溢出 5.65×10-7 1.09×10-6 … 0.001 946 自由液面 0.002 997 0.000 653 … 7.48×10-7 纵倾 0.000 509 0.003 693 … 7.48×10-7 ⋮ ⋮ ⋮ ⋮ 表 6 各主题下出现概率最高的前10个词示例
Table 6. Demo about top 10 words under each topic
Z5 Z6 Z14 词语 概率 词语 概率 词语 概率 船舶 0.159 6 事发水域 0.056 6 船舶 0.090 4 内河 0.036 8 船舶 0.042 大风浪 0.036 4 航行 0.036 5 复杂 0.023 6 进水 0.035 8 不满足 0.021 水流 0.021 5 航行 0.033 2 超载 0.019 9 潮流 0.019 6 货舱 0.029 4 舱盖 0.019 6 驾驶员 0.019 4 风浪 0.023 4 干舷 0.019 3 不当 0.019 1 上浪 0.021 7 稳性 0.016 操纵 0.018 1 天气 0.016 6 超航区 0.015 5 大潮汛 0.017 2 不当 0.015 没有 0.015 3 紊乱 0.017 1 恶劣 0.014 8 表 7 主题含义
Table 7. Meaning of topics
主题编号 主题含义 Z1 船员安全意识淡薄 Z2 船体强度不足 Z3 泥沙、自由液面影响 Z4 值班疏忽未及时发现异常 Z5 船舶超航区、超载航行 Z6 事发水域通航环境复杂 Z7 船公司岸基支持不到位 Z8 码头违规进行装载作业 Z9 船舶稳性不足 Z10 船体破损进水 Z11 船员不适任 Z12 储备浮力不足,进水后倾斜 Z13 安全管理不到位 Z14 大风浪影响 Z15 值班驾驶员操纵不当 Z16 货舱未达到风雨密要求 Z17 盲目拖带 Z18 擅自非法改建船舶 Z19 货物易流态化、货物移位 Z20 船长应急处置不当 表 8 主题的概率分布
Table 8. Probability distribution of topics
主题编号 主题概率 Z1 0.061 047 92 Z2 0.031 747 13 Z3 0.044 759 72 Z4 0.032 590 22 Z5 0.052 630 85 Z6 0.031 090 63 Z7 0.058 484 10 Z8 0.029 038 19 Z9 0.064 579 22 Z10 0.037 648 75 Z11 0.068 407 67 Z12 0.033 253 63 Z13 0.121 570 63 Z14 0.120 803 56 Z15 0.030 302 82 Z16 0.049 700 77 Z17 0.026 868 27 Z18 0.033 191 44 Z19 0.026 177 21 Z20 0.046 107 28 表 9 用于测试的新数据集
Table 9. New dataset for testing
编号 词语 1 满载排水量 2 潮汐 3 精准营销 4 岗前培训 5 抢滩 表 10 预测结果
Table 10. Predicting outcomes
主题编号 满载排水量 潮汐 岗前培训 抢滩 Z1 0.000 0.000 0.000 0.153 Z2 0.000 0.000 0.000 0.000 Z3 0.000 0.000 0.000 0.000 Z4 0.000 0.000 0.000 0.000 Z5 0.953 0.213 0.000 0.000 Z6 0.000 0.780 0.000 0.000 Z7 0.000 0.000 0.996 0.000 Z8 0.000 0.000 0.000 0.206 Z9 0.040 0.000 0.000 0.000 Z10 0.000 0.000 0.000 0.000 Z11 0.000 0.000 0.000 0.000 Z12 0.000 0.000 0.000 0.000 Z13 0.000 0.000 0.000 0.000 Z14 0.000 0.000 0.000 0.000 Z15 0.000 0.000 0.000 0.000 Z16 0.000 0.000 0.000 0.000 Z17 0.000 0.000 0.000 0.000 Z18 0.000 0.000 0.000 0.000 Z19 0.000 0.000 0.000 0.000 Z20 0.000 0.000 0.000 0.640 -
[1] 姚厚杰. 自主货物运输船舶航行风险辨识与事故致因分析研究[D]. 武汉: 武汉理工大学, 2019.YAO Houjie. Study on navigation risk identification and accident causation analysis of autonomous cargo ships[D]. Wuhan: Wuhan University of Technology, 2019. (in Chinese) [2] LEE Jeongseok, LEE Bokyeong, CHO Lksoon. Text mining analysis technique on ecdis accident report[J]. Journal of the Korean Society of Marine Environment and Safety, 2019, 25 (4). http://www.researchgate.net/publication/334389437_Text_Mining_Analysis_Technique_on_ECDIS_Accident_Report/download [3] 吴伋, 江福才, 姚厚杰, 等. 基于文本挖掘的内河船舶碰撞事故致因因素分析与风险预测[J]. 交通信息与安全, 2018, 36 (3): 8-18. doi: 10.3963/j.issn.1674-4861.2018.03.002WU Ji, JIANG Fucai, YAO Houjie, et al. An analysis and risk forecasting of inland ship collision based on text mining[J]. Journal of Transport Information and Safety, 2018, 36 (3): 8-18. (in Chinese) doi: 10.3963/j.issn.1674-4861.2018.03.002 [4] 余晨, 毛喆, 高嵩. 基于规则的海事自由文本信息抽取方法研究[J]. 交通信息与安全, 2017, 35 (2): 40-47. https://www.cnki.com.cn/Article/CJFDTOTAL-JTJS201702007.htmYU Chen, MAO Zhe, GAO Song. An approach of extracting information for maritime unstructured text based on rules[J]. Journal of Transport Information and Safety, 2017, 35(2): 40-47. (in Chinese) https://www.cnki.com.cn/Article/CJFDTOTAL-JTJS201702007.htm [5] BLEI D M, NG A Y, JORDAN M I. Latent dirichlet allocation[J]. Journal of Machine Learning Research, 2003(3): 993-1022. [6] 中华人民共和国交通部. 水上安全监督常用术语: GB/T 19945-2005[S]. 北京: 中国标准出版社, 2011.Ministry of Transport, People's Republic of China. Marine supervision terminology in common use: GB/T 19945-2005[S]. Beijing: Standards Press of China, 2011. (in Chinese) [7] 陈鑫, 王素格, 廖健. 基于词语相关度的微博新情感词自动识别[J]. 计算机应用, 2016, 36 (2): 424-427. doi: 10.3969/j.issn.1001-3695.2016.02.024CHEN Xin, WANG Suge, LIAO Jian. Automatic identification of new sentiment word about microblog based on word association[J]. Journal of Computer Applications, 2016, 36(2): 424-427. (in Chinese) doi: 10.3969/j.issn.1001-3695.2016.02.024 [8] OLIVEIRA N, CORTEZ P, AREAL N. Stock market sentiment lexicon acquisition using microblogging data and statistical measures[J]. Decision Support Systems, 2016 (85): 62-73. http://dl.acm.org/citation.cfm?id=2928086 [9] 聂卉, 首欢容. 基于修正点互信息的特征级情感词极性自动研判[J]. 图书情报工作, 2020, 64 (5): 114-123. https://www.cnki.com.cn/Article/CJFDTOTAL-TSQB202005017.htmNIE Hui, SHOU Huanrong. Feature-opinion polarity identification based on the modified PMI algorithm[J]. Library and Information Service, 2020, 64 (5): 114-123. (in Chinese) https://www.cnki.com.cn/Article/CJFDTOTAL-TSQB202005017.htm [10] 赵传君, 王素格, 李德玉. 跨领域文本情感分类研究进展[J]. 软件学报, 2020, 31 (6): 1723-1746. https://www.cnki.com.cn/Article/CJFDTOTAL-RJXB202006010.htmZHAO Chuanjun, WANG Suge, LI Deyu. Research progress on cross-domain text sentiment classification[J]. Journal of Software, 2020, 31 (6): 1723-1746. (in Chinese) https://www.cnki.com.cn/Article/CJFDTOTAL-RJXB202006010.htm [11] YAN Xiaohui, GUO Jiafeng, LAN Yanyan, et al. A biterm topic model for short texts[C]. 22nd International Conference on World Wide Web, Rio de Janeiro, Brazil: ACM, 2013. [12] MIMNO D M, WALLACH H M, TALLEY E M. Optimizing semantic coherence in topic models[C]. The 2011 Conference on Empirical Methods in Natural Language Processing, Edinburgh, UK: ACL, 2011. [13] JÓNSSON E, STOLEE J. An evaluation of topic modelling techniques for twitter[C]. The 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Beijing, China: ACL, 2015. [14] RÖDER Michael, BOTH Andreas, HINNEBURG Alexander. Exploring the space of topic coherence measures[C]. The Eighth ACM International Conference on Web Search and Data Mining, Shanghai, China: ACM, 2015. [15] 李奕良. 基于贝叶斯网络的干散货船舶自沉事故致因分析[D]. 大连: 大连海事大学, 2020.LI Yiliang. Cause analysis of ship foundering accident of dry bulk carrier based on Bayesian network[D]. Dalian: Dalian Maritime University, 2020. (in Chinese). [16] 陈兴园. 基于MAIB事故报告的水上交通事故管理致因研究[D]. 武汉: 武汉理工大学, 2016.CHEN Xingyuan. Study on management factors of water traffic accident based on MAIB accident reports[D]. Wuhan: Wuhan University of Technology, 2016. (in Chinese) [17] 韩俊松, 吴宛青, 杜嘉立, 等. 中国沿海固体散货运输船自沉事故分析与对策[J]. 中国航海, 2014, 37 (1): 82-86. doi: 10.3969/j.issn.1000-4653.2014.01.018HAN Junsong, WU Wanqing, DU Jiali, et al. Countermeasures to foundering accidents of ships carrying solid bulk cargo in the coastal area of China[J]. Navigation of China, 2014, 37 (1): 82-86. (in Chinese) doi: 10.3969/j.issn.1000-4653.2014.01.018 [18] 乔赛雯. 基于贝叶斯网络方法对干散货船舶航行事故致因分析[D]. 大连: 大连海事大学, 2017.QIAO Saiwen. Based on Bayesian network method of dry bulk ships sailing accident cause analysis[D]. Dalian: Dalian Maritime University, 2017. (in Chinese)