AI brain analyzing a phishing email with human analysis magnifying glass.

AI vs. Human Analysts: Unmasking the Truth Behind Large Language Models' Accuracy in Cybersecurity

"Can AI Truly Replace Human Expertise in Analyzing Phishing Attacks? A Deep Dive into LLMs' Capabilities and Limitations"


Large Language Models (LLMs) have revolutionized various fields with their impressive ability to generate human-quality text and code. While LLMs excel at tasks like composing emails and essays, their capacity for statistically-driven descriptive analysis, particularly on user-specific data, remains largely unexplored. This is especially true for users with limited background knowledge seeking domain-specific insights.

This article delves into the accuracy of LLMs, specifically Generative Pre-trained Transformers (GPTs), in performing descriptive analysis within the cybersecurity domain. We examine their resilience and limitations in identifying hidden patterns and relationships within a dataset of phishing emails.

We explore whether these reasoning engine tools (LLMs) can be used as generative AI-based personal assistants to aid users with minimal or limited background knowledge in an application domain to carry out basic, as well as advanced statistical and domain-specific analysis. By comparing LLM-generated results with analyses performed by human cybersecurity experts, we aim to provide a clear understanding of AI's current capabilities and the continued importance of human expertise.

LLMs vs. Human Analysts: A Comparative Analysis of Phishing Email Detection

AI brain analyzing a phishing email with human analysis magnifying glass.

This study investigates the effectiveness and precision of LLMs in data transformation, visualization, and statistical analysis on user-specific data. Unlike models trained on general datasets, this research focuses on LLMs' ability to analyze data not included in their original training set. It involves descriptive statistical analysis and Natural Language Processing (NLP)-based investigations on a dataset of phishing emails.

The experimental setup involved analyzing phishing emails using both human analysts and LLMs (specifically LangChain and GPT-4). Human analysts utilized mainstream tools and libraries like Python and NLTK. LLMs were tasked with the same analysis, allowing for a direct comparison of their performance.
Key areas of comparison include:|Feature Engineering: LLMs and analysts were compared on their ability to extract key features from email subject lines and bodies, such as length, word count, and verb/noun counts.|Emotional Affect Analysis: The study assessed the ability of LLMs and human analysts to identify and interpret the emotional tone of phishing emails, utilizing tools like NRCLex.|Correlation Matrix Generation: Both LLMs and human analysts were tasked with creating correlation matrices to identify relationships between different variables within the phishing email data.|Temporal Analysis: The research examined how well LLMs and human analysts could identify patterns in phishing email activity over time (e.g., peak months, days, and hours for attacks).
While LLMs demonstrated strong performance in numerical reasoning tasks and feature engineering, they encountered difficulties with domain-specific analysis. For instance, GPT-4 struggled to classify the polarity (sentiment) of phishing emails and generate correlation matrices from textual data. This is because GPT-4 was not able to use specific libraries. Human analysts are still needed in some cases.

The Future of AI in Cybersecurity Analysis

LLMs are transforming cybersecurity. Although this has been studied, in the future these new tools must be researched further for all areas of human jobs. We can improve emotional analysis and correlations with added libraries to create domain-specific algorithms.

Newsletter Subscribe

Subscribe to get the latest articles and insights directly in your inbox.