AI brain analyzing a phishing email with human analysis magnifying glass.

AI vs. Human Analysts: Unmasking the Truth Behind Large Language Models' Accuracy in Cybersecurity

Theo Raines in Tech & Innovation July 2025 • 4 min read.

"Can AI Truly Replace Human Expertise in Analyzing Phishing Attacks? A Deep Dive into LLMs' Capabilities and Limitations"

Large Language Models (LLMs) have revolutionized various fields with their impressive ability to generate human-quality text and code. While LLMs excel at tasks like composing emails and essays, their capacity for statistically-driven descriptive analysis, particularly on user-specific data, remains largely unexplored. This is especially true for users with limited background knowledge seeking domain-specific insights.

This article delves into the accuracy of LLMs, specifically Generative Pre-trained Transformers (GPTs), in performing descriptive analysis within the cybersecurity domain. We examine their resilience and limitations in identifying hidden patterns and relationships within a dataset of phishing emails.

We explore whether these reasoning engine tools (LLMs) can be used as generative AI-based personal assistants to aid users with minimal or limited background knowledge in an application domain to carry out basic, as well as advanced statistical and domain-specific analysis. By comparing LLM-generated results with analyses performed by human cybersecurity experts, we aim to provide a clear understanding of AI's current capabilities and the continued importance of human expertise.

LLMs vs. Human Analysts: A Comparative Analysis of Phishing Email Detection

AI brain analyzing a phishing email with human analysis magnifying glass.

This study investigates the effectiveness and precision of LLMs in data transformation, visualization, and statistical analysis on user-specific data. Unlike models trained on general datasets, this research focuses on LLMs' ability to analyze data not included in their original training set. It involves descriptive statistical analysis and Natural Language Processing (NLP)-based investigations on a dataset of phishing emails.

The experimental setup involved analyzing phishing emails using both human analysts and LLMs (specifically LangChain and GPT-4). Human analysts utilized mainstream tools and libraries like Python and NLTK. LLMs were tasked with the same analysis, allowing for a direct comparison of their performance.

Key areas of comparison include:|Feature Engineering: LLMs and analysts were compared on their ability to extract key features from email subject lines and bodies, such as length, word count, and verb/noun counts.|Emotional Affect Analysis: The study assessed the ability of LLMs and human analysts to identify and interpret the emotional tone of phishing emails, utilizing tools like NRCLex.|Correlation Matrix Generation: Both LLMs and human analysts were tasked with creating correlation matrices to identify relationships between different variables within the phishing email data.|Temporal Analysis: The research examined how well LLMs and human analysts could identify patterns in phishing email activity over time (e.g., peak months, days, and hours for attacks).

While LLMs demonstrated strong performance in numerical reasoning tasks and feature engineering, they encountered difficulties with domain-specific analysis. For instance, GPT-4 struggled to classify the polarity (sentiment) of phishing emails and generate correlation matrices from textual data. This is because GPT-4 was not able to use specific libraries. Human analysts are still needed in some cases.

The Future of AI in Cybersecurity Analysis

LLMs are transforming cybersecurity. Although this has been studied, in the future these new tools must be researched further for all areas of human jobs. We can improve emotional analysis and correlations with added libraries to create domain-specific algorithms.

About this Article -

This article was crafted using a human-AI hybrid and collaborative approach. AI assisted our team with initial drafting, research insights, identifying key questions, and image generation. Our human editors guided topic selection, defined the angle, structured the content, ensured factual accuracy and relevance, refined the tone, and conducted thorough editing to deliver helpful, high-quality information.See our About page for more information.

This article is based on research published under:

DOI-LINK: https://doi.org/10.48550/arXiv.2405.19578,

Title: The Accuracy Of Domain Specific And Descriptive Analysis Generated By Large Language Models

Subject: cs.ce econ.gn q-fin.ec

Authors: Denish Omondi Otieno, Faranak Abri, Sima Siami-Namini, Akbar Siami Namin

Published: 29-05-2024

Everything You Need To Know

What are the core differences in how LLMs and human analysts approach phishing email detection?

LLMs, particularly GPT-4, are tested for their capabilities in several areas. These include feature engineering (extracting characteristics like length and word count), emotional affect analysis (identifying sentiment using tools like NRCLex), correlation matrix generation (finding relationships between variables), and temporal analysis (identifying patterns over time). Human analysts, on the other hand, employ established tools and libraries such as Python and NLTK. The study compares the effectiveness and precision of LLMs against human experts in these critical areas of cybersecurity analysis.

What specific challenges did GPT-4 face when analyzing phishing emails, and why?

GPT-4 encountered difficulties in several domain-specific tasks. It struggled to accurately classify the polarity (sentiment) of phishing emails and to generate correlation matrices from textual data. This limitation stems from GPT-4's inability to directly utilize specific libraries, hindering its capacity to perform in-depth, specialized analysis that human analysts can achieve with tools like Python and NLTK. This highlights the current dependence on human expertise for complex, nuanced cybersecurity investigations.

In what areas do LLMs like GPT-4 excel, and what does this imply for the future of cybersecurity?

LLMs demonstrated strong performance in numerical reasoning tasks and feature engineering, showcasing their potential in streamlining certain aspects of cybersecurity analysis. This suggests that LLMs can be effectively used as AI-based personal assistants to aid users with minimal background knowledge to carry out basic as well as advanced statistical and domain-specific analysis. However, the limitations in domain-specific analysis highlight the need for continued human oversight and the potential for hybrid approaches that combine AI strengths with human expertise.

How was the research conducted to compare the performance of LLMs and human analysts in phishing email detection?

The research employed a direct comparison. The experimental setup involved having both human analysts and LLMs (specifically GPT-4 and LangChain) analyze the same dataset of phishing emails. Human analysts used established tools and libraries like Python and NLTK. The LLMs were tasked with the same analyses. The comparison focused on feature engineering, emotional affect analysis, correlation matrix generation, and temporal analysis to assess their respective capabilities and limitations in this critical cybersecurity task.

What are the key takeaways from this comparison of AI and human analysts in phishing email detection, and what future developments are anticipated?

The study reveals that while LLMs show promise in certain areas, human expertise remains essential for comprehensive cybersecurity analysis. GPT-4's difficulties with domain-specific tasks like sentiment analysis and correlation matrix generation emphasize the need for specialized tools and human understanding. Future developments include incorporating specialized libraries to enhance emotional analysis and create domain-specific algorithms, potentially enabling LLMs to better support human analysts. The study stresses the importance of combining the strengths of both LLMs and human analysts to enhance cybersecurity effectiveness.