A Method to Identify Traffic Incidents Based on Social Network Data
-
摘要: 为了从社交网络数据中挖掘出交通突发事件,研究了基于机器学习的文本识别方法。通过关键词和地点定位,利用网页爬虫“Beautiful Soup”爬取到原始文本。采用正则匹配、重复度计算以及“0-1”标记预处理原始文本。基于预处理后文本特征,研究了基于特征权重的特征词选取方法;其中,特征权重的计算综合了词语的出现频率和含有该词语的文本所占比例,通过将二者归一化并加权合并,获得训练集突发事件文本中各个无重复词语的特征权重;依据此值选择确定特征词,并用于后续分类器的输入。测试对比了不同的分类器以及特征词选择方法,结果表明,所提特征词选取方法与XGBoost分类器结合,在交通突发事件识别上具有最好的综合表现,精确率为0.679 6,召回率为0.648 1,F1值为0.663 5,AUC值为0.759 4。Abstract: A text classification method based on machine learning is studied to identify traffic incidents by mining the data from the social networks. The original texts are crawled by web crawler"Beautiful Soup"based on the keywords and location. These texts are preprocessed using regular expression matching, duplicate removing, and"0-1"mark? ing. According to the features of preprocessed texts, the paper proposes a method to select feature words based on fea? ture weights. The feature weight is calculated by normalizing, weighting, and combining the word frequency and the ratio of the text containing that word. Accordingly, the feature weight of each unique word in the training set of the traf? fic incident text can be achieved, used as a criterion for selecting feature words, and as the inputs of classifiers. A test is conducted to compare different classifiers and methods to select feature words. The results show that the proposed method to select feature words combined with the XGBoost classifier has the optimal performance, with a precision rate of 0.679 6, a recall rate of 0.648 1, an F1 value of 0.663 5, and an AUC value of 0.759 4.
-
表 1 样本“0-1”标记示例
Table 1. Cases of "0-1" labeled samples
表 2 分词和过滤停用词的结果示例
Table 2. A case for segmenting words and filtering stop words
表 3 精确率对比
Table 3. Comparison of precision rates
KNN SVM AdaBoost XGBoost 新方法 0.583 9 0.647 4 0.720 5 0.679 6 TF-IDF 0.528 2 0.730 2 0.640 4 0.594 2 CHI 0.655 2 0.646 7 0.655 4 0.625 8 LDA 0.397 0 0.527 3 0.402 9 0.455 1 表 4 召回率对比
Table 4. Comparison of recall rates
KNN SVM AdaBoost XGBoost 新方法 0.370 4 0.518 5 0.537 0 0.648 1 TF-IDF 0.347 2 0.425 9 0.601 9 0.569 4 CHI 0.263 9 0.449 1 0.537 0 0.472 2 LDA 0.365 7 0.268 5 0.384 3 0.375 0 表 5 F1值对比
Table 5. Comparison of F1 values
KNN SVM AdaBoost XGBoost 新方法 0.453 3 0.575 8 0.615 4 0.663 5 TF-IDF 0.419 0 0.538 0 0.620 5 0.581 6 CHI 0.376 2 0.530 1 0.590 3 0.538 3 LDA 0.380 7 0.355 8 0.393 4 0.411 2 表 6 AUC值对比
Table 6. Comparison of AUC values
KNN SVM AdaBoost XGBoost 新方法 0.629 3 0.699 5 0.724 4 0.759 4 TF-IDF 0.607 9 0.679 6 0.729 4 0.702 3 CHI 0.602 5 0.672 6 0.708 7 0.676 3 LDA 0.565 2 0.583 3 0.571 5 0.592 4 表 7 参数设置
Table 7. Parameter settings
特征词选择 W 0.5 特征词数量 150 XGBoost max_depth 5 booster gbtree objective binary: logistic scale_pos_weight 3 min_child_weight 1 learning_rate 0.1 n_estimators 100 -
[1] QIAO F X, YU L. Social media applications to publish dynamic transportation information on campus[C]. International Conference of Chinese Transportation Professionals, Nanjing, China: ICCTP, 2011. [2] 郑治豪, 吴文兵, 陈鑫, 等. 基于社交媒体数据的交通感知分析系统[J]. 自动化学报, 2018, 44(4): 656-666. https://www.cnki.com.cn/Article/CJFDTOTAL-MOTO201804007.htmZHENG Zhihao, WU Wenbing, CHEN Xin, et al. A traffic sensing and analyzing systemusing social media data[J]. Acta Automatica Sinica, 2018, 44(4): 656-666. (in Chinese) https://www.cnki.com.cn/Article/CJFDTOTAL-MOTO201804007.htm [3] 滕靖, 刘韶杰, 龚越, 等. 交通事件网络舆情分析方法[J]. 交通信息与安全, 2019, 37(6): 139-148. http://www.jtxa.net/tiasn/paper/editpaper.do?flag=abstract&PAPERID=2019-00518TENG Jing, LIU Shaojie, GONG Yue, et al. An analysis method of online public opinions on traffic incidents[J]. Journal of Transport Information and Safety, 2019, 37(6): 139-148. (in Chinese) http://www.jtxa.net/tiasn/paper/editpaper.do?flag=abstract&PAPERID=2019-00518 [4] 张恒才, 陆锋, 陈洁. 微博客蕴含交通信息的提取[J]. 中国图象图形学报, 2013, 18(1): 123-129. https://www.cnki.com.cn/Article/CJFDTOTAL-ZGTB201301017.htmZHANG Hengcai, LU Feng, CHEN Jie. Extracting traffic information from massive microblog messages[J]. Journal of Image and Graphics, 2013, 18(1): 123-129. (in Chinese) https://www.cnki.com.cn/Article/CJFDTOTAL-ZGTB201301017.htm [5] GU Y M, QIAN Z, CHEN F. From twitterto detector: realtime traffic incident detectionusing social media data[J]. Transportation Research Part C: Emerging Technologies, 2016(67): 321-342. http://www.sciencedirect.com/science/article/pii/S0968090X16000644 [6] D'ANDREA E, DUCANGE P, LAZZERINI B, et al. Real-time detection of traffic from twitter stream analysis[J]. IEEE Transactions on Intelligent Transportation Systems, 2015, 16(4): 2269-2283. doi: 10.1109/TITS.2015.2404431 [7] 徐翔, 刘悦. 全球社交网络中用户"社会互动位置—信息位置"同质效应研究——基于Twitter信息传播的数据挖掘和实证分析[J]. 华东理工大学学报(社会科学版), 2019, 34 (5): 92-102. https://www.cnki.com.cn/Article/CJFDTOTAL-HDLS201905012.htmXU Xiang, LIU Yue. Research on the homogenouseffect of "social interaction locationinformation location"of the users in the global social networks: Data mining and empirical analysis based on twitter informationdissemination[J]. Journal of East ChinaUniversity of Science and Technology(Social Science Edition), 2019, 34(5): 92-102. (in Chinese) https://www.cnki.com.cn/Article/CJFDTOTAL-HDLS201905012.htm [8] 叶颖婕. 基于关联规则的交通事故风险因素挖掘及预测模型构建[D]. 北京: 北京工业大学, 2018.YE Yingjie. Research on mining algorithm and prediction model of traffic accident risk factors based on news data[D]. Beijing: Beijing University of Technology, 2018. (in Chinese) [9] 胡泽文, 王效岳, 白如江. 国内外文本分类研究计量分析与综述[J]. 图书情报工作, 2011, 55(6): 78-81+142. https://www.cnki.com.cn/Article/CJFDTOTAL-TSQB201106021.htmHU Zewen, WANG Xiaoyue, BAI Rujiang. Quantitative Analysis and review of text classification research at home and abroad[J]. Library And Information Service, 2011, 55(6): 78-81+142. (in Chinese) https://www.cnki.com.cn/Article/CJFDTOTAL-TSQB201106021.htm [10] SALAS A, GEORGAKIS P, PETALAS Y. Incident detection using data from social media[C]. 20th International IEEE Conference on Intelligent Transportation Systems, yokohama, Janpan: IEEE, 2017. [11] SAKAKI T, MATSUO Y, YANAGIHARAT, et al. Realtime event extraction for drivinginformation from social sensors[C]. International IEEE Conference Cyber Technology in Automation, Control, and Intelligent Systems, Bangkok, Thailand: IEEE, 2012. [12] 宋呈祥, 陈秀宏, 牛强. 文本分类中基于CHI改进的特征选择方法[J]. 微电子学与计算机, 2018, 35(9): 74-78. https://www.cnki.com.cn/Article/CJFDTOTAL-WXYJ201809016.htmSONG Chengxiang, CHEN Xiuhong, NIU Qiang. Improved feature selection methodbased on chi for text categorization[J]. Microelectronics & Computer, 2018, 35(9): 74-78. (in Chinese) https://www.cnki.com.cn/Article/CJFDTOTAL-WXYJ201809016.htm [13] 吴小晴, 万国金, 李程文, 等. 一种改进TF-IDF的中文邮件识别算法研究[J]. 现代电子技术, 2020, 43(12): 83-86. https://www.cnki.com.cn/Article/CJFDTOTAL-XDDJ202012021.htmWU Xiaoqing, WAN Guojin, LI Chengwen, et al. Research on improved TF-IDF Chinese mail recognition algorithm[J]. Modern Electronics Technique, 2020, 43(12): 83-86. (in Chinese) https://www.cnki.com.cn/Article/CJFDTOTAL-XDDJ202012021.htm [14] 庄穆妮, 李勇, 谭旭, 等. 基于BERT-LDA模型的新冠肺炎疫情网络舆情演化仿真[J]. 系统仿真学报, 2021, 33(1): 24-36. https://www.cnki.com.cn/Article/CJFDTOTAL-XTFZ202101005.htmZHUANG Muni, LI Yong, TAN Xu, et al. Evolutionary simulation of online public opinion based on the BERT-LDA model under COVID-19[J]. Journal of System Simulation, 2021, 33 (1): 24-36. (inChinese) https://www.cnki.com.cn/Article/CJFDTOTAL-XTFZ202101005.htm [15] 曾奇. 面向微博的短文本分类算法研究[D]. 成都: 电子科技大学, 2019.ZENG Qi. Research on short text classification algorithms for Microblog[D]. Chengdu: University of Electronic Scienceand Technology of China, 2019. (in Chinese) [16] 柳本民, 闫寒. 基于SVM事故分类的连环追尾事故影响因素分析[J]. 交通信息与安全, 2020, 38(1): 43-51. http://www.jtxa.net/tiasn/paper/editpaper.do?flag=abstract&PAPERID=2019-00587LIU Benmin, YAN Han. An analysis of influencing factors of multi-vehicle rear-end accidentsbased on accident classification of SVM[J]. Journal of Transport Information and Safety, 2020, 38(1): 43-51. (in Chinese) http://www.jtxa.net/tiasn/paper/editpaper.do?flag=abstract&PAPERID=2019-00587 [17] 李晓峰, 马静, 李驰, 等. 基于XGBoost模型的电商商品品名识别算法研究[J]. 数据分析与知识发现, 2019, 3(7): 34-41. https://www.cnki.com.cn/Article/CJFDTOTAL-XDTQ201907005.htmLI Xiaofeng, MA Jing, LI Chi, et al. Identifying commodity names based on XGBoost model[J]. Data Analysis and Knowledge Discovery, 2019, 3(7): 34-41. (in Chinese) https://www.cnki.com.cn/Article/CJFDTOTAL-XDTQ201907005.htm [18] 徐婷, 张香, 张亚坤, 等. 基于AdaBoost算法的货车驾驶人安全倾向性分类[J]. 安全与环境学报, 2019, 19(4): 1273-1281. https://www.cnki.com.cn/Article/CJFDTOTAL-AQHJ201904024.htmXU Ting, ZHANG Xiang, ZHANG Yakun, et al. Truck driver safety tendency classification based on the AdaBoost algorithm[J]. Journal of Safety and Environment, 2019, 19(4): 1273-1281. (in Chinese) https://www.cnki.com.cn/Article/CJFDTOTAL-AQHJ201904024.htm [19] 尹何举, 昝红英, 陈俊怡, 等. 交通事故的自动判案研究[J]. 中文信息学报, 2019, 33(3): 136-144. https://www.cnki.com.cn/Article/CJFDTOTAL-MESS201903018.htmYI Heju, ZAN Hongying, CHEN Junyi, et al. Study on automatic judgment of traffic accidents[J]. Journal of Chinese Information Processing, 2019, 33(3): 136-144. (in Chinese) https://www.cnki.com.cn/Article/CJFDTOTAL-MESS201903018.htm