CTI Dataset Construction from Telegram

Arikkat, Dincy R., T., Sneha B., Nicolazzo, Serena, Nocera, Antonino, P., Vinod, A., Rafidha Rehiman K., R, Karthika

arXiv.org Artificial Intelligence 

Cyber Threat Intelligence (CTI) has become indispensable for security analysts, enabling them to identify, collect, manage, and disseminate information on vulnerabilities and attacks, and to respond proactively to emerging threats [6]. Within the CTI lifecycle, data collection encompassing sources such as security alerts and threat intelligence reports from the web represents a critical foundational stage [3]. In this context, one challenge is that not all threat intelligence is published in standard CTI databases or integrated into commercial security platforms. V aluable CTI is often disseminated through unstructured channels such as blogs, social media posts, or reports from security companies and independent experts. To capture these dispersed insights, multiple online sources can be leveraged as early signals of emerging cyber threats. Information gathering thus becomes the first and most critical step, enabling the collection of relevant data on newly discovered vulnerabilities, active exploits, security alerts, threat intelligence reports, and security tool configurations. Curating CTI datasets requires addressing key challenges, including data sourcing from heterogeneous streams, ensuring data reliability, preserving privacy, and mitigating bias. A well-designed CTI dataset not only accelerates the advancement of automated threat intelligence systems but also strengthens global cyber defense capabilities through knowledge sharing and standardized evaluation frameworks. While platforms like Twitter [20] have been widely explored for their CTI potential, other communication ecosystems remain underexamined.