Navigating Secure AI Adoption: Use of Traditional Cybersecurity Safeguards and Emerging Threats Beyond Traditional Controls
Authors: Paulina Lipska-Mazurek, Andrzej Agria
AI Security Foundation [February 2025]
Introduction
The article examines how businesses can safely integrate Artificial Intelligence (AI) solutions into their operations without compromising security. It recognizes the rapid adoption of AI since the launch of OpenAI’s ChatGPT in 2022, especially the increase in interest from companies leveraging Large Language Models (LLMs).
In the first part of the article authors not only explore how traditional cybersecurity controls, like those in the SANS Critical Security Controls (CIS Top 18) [1] (Access 12 June 2024), can be applied to AI, but also investigate if these controls alone are insufficient to address AI-specific risks. Further, the complexities of adopting traditional safeguards for LLMs are discussed with consideration of a delicate balance between security and the operation and stability of AI models is difficult to define and need to be well thought-through.
The second part of the article also references the OWASP Top 10 for LLM Applications Cybersecurity and Governance Checklist [2] (Access 26 July 2024) as a useful guide for integrating AI into existing governance frameworks. This checklist includes traditional controls in areas such as threat modeling, governance, asset inventory, and business case development, but also introduces a few new AI-specific considerations, such as Model and Risk Cards and Retrieval-Augmented Generation. It stresses the need for a comprehensive approach to AI security that includes both traditional and AI-specific controls.
The final section addresses emerging AI risks that traditional controls cannot fully mitigate, such as prompt injection, jailbreaking, specification gaming, reward tampering, sycophancy, and model instrumental convergence. The authors call for developing a more structured approach to identify, register and mitigate new threats similar to the best practices utilized for cybersecurity risks (Common Vulnerabilities and Exposures – CVE) and warn organizations willing to adopt AI solutions, to follow a “trust but verify”method, protecting their customers, employees, organizations, and other stakeholder.
1. Problem Description
Since its public launch on November 30, 2022, in San Francisco by OpenAI ChatGPT [3] (Access 1 June 2024) has propelled AI into the mainstream, capturing the attention of individuals, businesses, and governments worldwide. AI promises a wide set of competitive advantages therefore the willingness to incorporate AI solutions into the businesses and not miss out the AI World Party is a natural tendency among all industries. There are many companies that decided to set up their business model based on developing AI solutions and exploring new use cases. However, there is a far wider group of companies that just want to take advantage of AI solutions developed by others or subscribe to and customize available LLMs for their needs rather than develop them from scratch.
This paper focuses on the latter group—businesses looking to integrate AI solutions developed by others—exploring how they can retain organizational security without missing out on the transformative opportunities AI offers.
The first definition of AI was presented in 1956 by John McCarthy, Marvin Minsky, Nathaniel Rochester, and Claude Shannon during the Dartmouth Conference – “The study of how to make computers do things at which, at the moment, people are better” [4] (Access 5 May 2024). More modern definition of AI considers it to be “a system capable of performing tasks that typically require human intelligence. These tasks include, but are not limited to, learning, reasoning, problem-solving, understanding natural language, recognizing patterns, and making decisions.” [Russell, S., Norvig, P.,Artificial Intelligence: A Modern Approach,2020]. While the concept of AI has been around for over half a century, its rapid adoption today raises new questions about how to secure these advanced systems within an organization’s existing Cybersecurity Framework.
As AI becomes an integral part of business operations, IT experts are asking critical questions:
- Can we secure AI systems using the same methods as other IT systems?
- Are the existing cybersecurity controls sufficient for AI?
- What are the risks we cannot mitigate with traditional controls?
While traditional cybersecurity controls form a solid foundation, AI introduces new threats and vulnerabilities that standard practices may not fully address. This paper will provide an overview of relevant traditional control areas as well as explore these emerging risks.
2. Traditional controls in the security service of AI solutions
Using SANS Center for Internet Security Critical Security Controls (CIS Top 18 Controls) [6] (Access 12 June 2024) as a comprehensive benchmark of cybersecurity traditional controls, we analyze in detail the applicability and relevance of the selected CIS Top 18 Controls for AI security, and we summarize the same in short form for all 18 CIS Controls in Table 1. The first important control areas are Access Control Management and Account Management. Relevant controls ensure only authorized users have access to the model, its testing data, code, and backups, protecting the confidentiality of the information as well as the integrity of the model. Proper access management reduces the risk of unintended or malicious modification of the model or data, destruction of valuable information assets, or injection of tampered test data to modify the model or misuse available information. As AI models may also have access to some sensitive information, private data and business secrets, traditional controls help prevent data breach and unauthorized usage. The AI solution operates in complex and multi-layer environments, which may have different sensitivity levels. The careful account management, especially the use of privileged access accounts, is extremely important to prevent unintended integration into the model or a privilege escalation. Privileged accounts should be securely vaulted and not used for any activities other than modifications required by super users on an infrequent basis.
From access management, the next key CIS Control that one should consider thoroughly is the Data Protection area. It took approximately 570 gigabits of training data and 1.7 trillion parameters to train ChatGPT4.0 [7] (Access 15 June 2024) and such a data set includes an enormous variety of information of all categories and sensitivity. It is critical that none feeds private or business sensitive data into public models, and if the AI solution is being trained based on non-public company data, data protection is a fundamental safeguard that must be taken. Protecting privacy and preserving the data or model confidentiality and intellectual property from competitors’ knowledge is one key aspect, but another equally important aspect is preventing data poisoning attacks and ensuring the integrity of data and the model, as corruption could diminish its business value and ability to deliver on its objectives. Finally, privacy has become one of the hottest regulatory topics since the General Data Protection Regulation (GDPR) [8] (Access 1 June 2024) regulation came into force in 2018, and it is safe to assume that violations of privacy requirements, whether related to the AI model or not, would pose material regulatory scrutiny and financial risk to a company, potentially harming its reputation and customer trust in the event of a case of the AI Model breach.
Going a step further, Network Infrastructure Management as well as Network Monitoring and Defense controls are the following two critical security areas that must be employed if an organization takes its AI Model security seriously. As AI Models may communicate with multiple data sources, the isolation of sensitive AI components using network segmentation and virtual private networks is essential to provide further control over unauthorized access and reduce the risk of an attacker’s lateral movement through the network.
Further Network Security controls support the first discussed control area of Access Control Management by further strengthening the prevention and detection of an unauthorized access and / or compromised model integrity through firewalls, data flow encryption, intrusion detection systems (IDS), and intrusion prevention systems (IPS), as well as continuous monitoring of network traffic to detect and respond to suspicious activities that could indicate attempts to compromise the AI system.
Next, Secure Configuration Management has a vast role in minimizing the attack surface by protecting associated components of AI solutions and their layers as well as reducing the risk of model misuse (e.g. using the AI model to reveal sensitive or harmful data) or interfering in model integrity by prompt injection or injecting malicious training data sets.
The secure configuration should also be considered for all environments, including development, testing and production, which coexist in the model to allow consistent security levels and reduce configuration drift.
The final CIS Control area selected for discussion is ContinuousVulnerability Management. The proactive identification of weak points enhances the security of the entireAI model pipeline and its components across all process stages, ensuring that attackers have no easy targets, especially those identifiable via external perimeter scanning. Further, as computing power grows exponentially, but the number of threats and new risks evolves so rapidly, stagnation becomes a risk in itself. Continuous Vulnerability Management allows organizations to stay ahead of these evolving threats by regularly scanning for and addressing new vulnerabilities as they arise. Those controls support all other control areas and, in addition, help to minimize disruption risks by ensuring timely patching for amore stable AI system and smoother business operation.
All CIS Top 18 controls are deemed to be relevant for security AI Models. Below, Table 1 summarizes the key objectives of the remaining control areas for addressing Confidentiality, Integrity, Availability (CIA) risks in AI solutions.
CIS TOP 18 Control | Confidentiality of AI | Integrity of AI | Availability of AI |
Inventory and control of enterprise assets | Ensures all AI-related assets (incl. hardware / cloud solutions) are inventoried for better control. | Ensures that all assets including AI models are properly tracked and helps prevent unauthorized modifications. | Maintains an updated inventory to avoid AI operation failure in case of information flow disruption / damage. |
Inventory and control of software assets | Ensures AI components are listed and protected (especially proprietary algorithms) and help identify unauthorized or rogue AI software. | Tracks model versions to maintain consistency and support version tampering detection and secure updates. tampering with AI | Ensures availability of critical AI software components, resolving conflicts and outages, as well as ensuring timely backup and recovery. |
Audit Log Management | Monitors, track and log access to sensitive AI data and systems, user activities and events for unauthorized access or data breaches. | Detect malicious tampering and log authorized / valid changes to AI models and systems.
|
Monitors and detects availability issues and allows for troubleshooting and recovery. |
Email and Web Browser Protections | Protects against phishing attacks targeting AI, prevents leaks of sensitive AI data malicious content inbound flow. |
Prevents phishing, malware and browser-based attacks that could corrupt AI systems and allow secure communication between AI components / users.
|
Manages protections to avoid service disruptions in complex AI interoperability |
Malware Defenses | Protects and detects potential malware with up to date signatures and ongoing scans, to prevent compromise of model or data confidentiality. | Prevents malware from altering or corrupting AI models.
|
Ensures malware does not disrupt AI system operations.
|
Data Recovery | Protects against data loss or corruption affecting confidentiality. | Ensures that recovered data maintains its integrity and validates the integrity of backup components. | Ensures AI systems can be quickly restored after an outage or data loss.
|
Security Awareness and Skills Training | Educates personnel on handling confidential data securely as well as train them on acceptable use of AI models and potential security risks. |
Trains staff on maintaining the integrity of AI systems and data and recognizing and preventing tampering or alterations.
|
Provides training on responding to availability disruptions in AI services. |
Service Provider Management | Ensures third-party have access only to necessary data and comply with data protection requirements to protect breaches of sensitive data by external providers. |
Ensures third-party services maintain the integrity of AI-components and prevent unauthorized changes by service providers. | Ensures third-party services do not impact AI system availability and the service meets SLAs, response / recovery times to address potential outages. Provide redundant service providers if necessary. |
Applications Software Security | Ensures secure-by default development of AI applications and enforcing best coding practices and comprehensive testing. | Secure development practices and code security review to support integrity and prevent unauthorized changes. |
Implements proper defenses to avoid outages and tests prior implementation into production.
|
Incident Response Management | Implements procedures to protect AI data from further exposure and proper response to data breaches.
|
Provides measures to address and correct incidents affecting AI model integrity. | Ensures rapid response to incidents affecting AI system availability with achievable recovery timelines and procedures coordinating efforts to minimize downtime. |
Penetration Testing | Proactively identifies and evaluates vulnerabilities that could expose confidential AI data.
|
Identifies weaknesses that could lead to unauthorized changes through simulated attacks. | Tests for potential points of failure that could lead to outages. |
Table 1. Remaining CIS TOP 18 Controls and their importance in safeguarding AI Confidentiality, Integrity and Availability risks
Having said that, the task of implementing traditional cybersecurity controls for AI solutions is not easy, due to multiple factors that require due consideration:
- The complexity of models that require access to various data sources and training environments, which may be located in different network zones with varying risk or privacy categories, necessitating distinct configurations.
- Dynamic workloads, which may require frequent or ongoing updates. As a result, static network segmentation rules can cause model learning delays or weaken performance and service delivery.
- Interoperability challenges arising from AI components based on different solutions (e.g., on-premises, cloud-based, external, or vendor-provided systems). The secure network segmentation must balance security with proper cooperation and communication among these components.
- Difficulty integrating AI models with existing company infrastructure. Legacy systems and segmentation protocols may be challenging to update, requiring significant effort to adjust and enable proper integration.
- Challenges in managing encryption keys and balancing encryption and traffic inspection with usability and model performance.
- Tracking interactions among AI model components, including deciding which interactions need to be logged. This is further complicated by the large memory volumes required to log events and activities, not just for users but also for actions performed by the AI model itself.
- Maintaining up-to-date backups of ever-evolving models that change and improve constantly.
- Addressing personnel knowledge gaps regarding AI-related risks. The novelty of AI tools and a lack of prior experience in their use can hinder effective risk management.
- Balancing protection with model operation, stability, and costs, which is critical for ensuring business objectives are met without compromising security.
3. LLM Application Cybersecurity Governance
In the search for open-source guidelines for adjusting existing corporate governance and risk management frameworks, we can refer to the OWASP Top 10 for LLM Applications Cybersecurity and Governance Checklist [9] (Access 26 July 2024), which, as per the authors, „is intended to help technology and business leaders quickly understand the risks and benefits of using LLMs, allowing them to focus on developing a comprehensive list of critical areas and tasks needed to defend and protect the organization as they develop a Large Language Model strategy.” The publication advises on fundamental security principles and control areas that are derived from existing best practices and, at the same time, can be utilized to address AI issues. These include vulnerabilities such as logic bugs, prompt injections, the insecure design of plugins, or remote code execution.
A number of areas have been covered in the Checklist by OWASP TOP 10, including well known traditional control domains as well as two new AI-specific safeguards. The Checklist also raises best practices in regulatory and legal space as well as business and investment areas which are critical for organization success, after all.
Control Area within the Checklist | Key Control | Mitigated Risk | Traditional / New |
Threat Modeling for GenAI and LLMs | Implement comprehensive threat modeling processes to identify and mitigate AI-specific risks before deployment. | Hyper-personalized attacks, spoofing, malicious inputs, and unauthorized access to sensitive data, generation of harmful content. | Traditional control adopted for AI needs |
AI Asset Inventory | Maintain an up-to-date inventory of all AI assets, including AI components in the Software Bill of Materials, dependencies and sensitivity. | Unmanaged AI assets leading to security gaps, exposed attack surface and unmonitored third-party dependencies / points of failure. | Traditional control adopted for AI needs |
AI Security and Privacy Training | Provide targeted training on AI security, privacy, and ethical considerations, tailored to all employees. | Insider threats (Shadow AI), misuse of AI tools, increased spear-phishing using voice and image cloning, and lack of awareness about AI-specific risks. | Traditional control adopted for AI needs |
Establish Business Cases for AI | Define clear business cases and conduct risk-benefit analysis before AI implementation. | Poor Return on investment, misalignment with business goals, and unrecognized operational or ethical risks. | Traditional control adopted for AI needs |
Governance for AI | Develop and enforce AI governance policies, including risks identification, assessment, acceptable use, RACI matrix and data management standards. | Lack of accountability, data misuse, and inadequate response to AI-related incidents. | Traditional control adopted for AI needs |
Legal Considerations | Review and update license agreements, terms and conditions, agreements, IP protections, indemnification clauses, and contracts for AI-related activities, clients, vendors, employees. | Legal liabilities, intellectual property infringement, and non-compliance with emerging AI regulations. | Traditional control adopted for AI needs |
Regulatory Compliance | Ensure adherence to AI-specific laws and regulations, including those related to employee monitoring and automated decision systems. Review and document AI tools to track and monitor decisions / outcomes. | Non-compliance with regulations, leading to fines, legal actions, and reputational damage.
Unintended propagation of bias / discrimination. |
Traditional control adopted for AI needs |
Implementing LLM Solutions | Apply rigorous security measures, including data security, access controls, training pipeline security, input and output security, vulnerability identification, supply chain and infrastructure security, incident response playbooks and ongoing monitoring of LLM solutions. | Data breaches, unauthorized access, prompt injection, the release of sensitive information, and process manipulation, supply chain attack, model theft. | Traditional control adopted for AI needs |
Testing, Evaluation, Verification, and Validation (TEVV) | Implement continuous TEVV processes throughout the AI lifecycle, with regular metrics and updates. | Model inaccuracies, security flaws, and operational failures due to unvalidated AI systems. | Traditional control adopted for AI needs |
Model and Risk Cards | Maintain detailed model cards and risk cards to document AI model specifics, biases, training data, methodology, model architecture, performance metrics and limitations. | Lack of transparency, unmanaged biases, and ethical issues in AI deployment. | New control area specific for AI solutions |
Retrieval-AugmentedGeneration (RAG) | Utilize RAG to optimize LLMs, ensuring continuous learning and domain-specific accuracy. | Insufficient AI performance, outdated information retrieval, and domain-specific inaccuracies. | New control area specific for AI solutions |
AI Red Teaming | Conduct regular AI red-teaming exercises to simulate and assess adversarial attacks on AI systems. | Unidentified vulnerabilities, exploitable weaknesses in AI systems, and untested attack vectors. | Traditional control adopted for AI needs |
Table 2. The OWASP TOP 10 LLM Applications Cybersecurity and Governance Checklist – key controls and risk mitigated
4. Emerging risks in LLMs beyond traditional controls scope
The traditional control suite represented by the CIS TOP 18 and the OWASP TOP 10 LLM Applications Cybersecurity and Governance Checklist is a great starting point for updating organizational governance to integrate Generative AI [10] (Access 3 July 2024) within an organization. However, there have been many LLM-specific risks identified since the transformer architecture was introduced in 2017 [11 Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomex A.N., Kaiser L. ,Polosukhin I. “Attention Is All You Need”, 2017] (Access 30 August 2024), fueling the rise of LLMs and ChatGPT-like applications, and even more still regarding other AI systems such as Recurrent Neural Networks (RNNs) or Convolutional Neural Networks (CNNs).In this section we will outline a selection of these risks. However, it should be noted that this is not a comprehensive list of all vulnerabilities. As the field of machine learning is experiencing an extremely fast pace of innovation, so are the methods of abusing the models. LLM-specific risks come in many varieties, depending on the level of analysis and the role of the model in the analyzed scenario. One of the most common attack vectors is a user inputting a malicious prompt. There are two terms associated with this risk: prompt injection and jailbreaking.
Prompt injections [12 Liu Y., Deng G., Li Y., Wang K., Wang Z., Wang X., Zhang T., Liu Y., Wang H., Zheng Y., Liu Yang, “Prompt Injection attack against LLM-integrated Applications “, 2024] (Access 30 September 2024) a method of introducing a prompt to anLLM in an unexpected place, changing its behavior. The root cause of this vulnerability is the fact that LLMs process all text equally and there exists no way to reliably tag text as potentially dangerous for the LLM. This causes the system prompt, the original prompt and the analyzed text to be treated similarly by the LLM. While an LLM might be more likely to adhere to its system prompt than a later instruction, a carefully engineered prompt injection attack is very likely to find a way to bypass previous instructions.
The other way in which adversarial prompts are used is called jailbreaking. This method aims to bypass the specialized safety training that an LLM receives. LLMs are trained to refuse certain queries that are deemed to be unsafe. Project Llama Guard [13] (Access 20 November 2024) identifies these categories of unsafe content:
- S1: Violent Crimes.
- S2: Non-Violent Crimes
- S3: Sex Crimes.
- S4: Child Exploitation.
- S5: Defamation.
- S6: Specialized Advice.
- S7: Privacy.
- S8: Intellectual Property.
- S9: Indiscriminate Weapons.
- S10: Hate.
- S11: Self-Harm.
- S12: Sexual Content.
- S13: Elections.
- S14: Code Interpreter Abuse.
It should be understood that the safety training is just a small part of the training of an LLM. Most of the training’s compute is used on the first phase, called pretraining, which gives GPT (Generative Pretrained Transformer) its name. During this phase, the LLM learns indiscriminately from the training data it’s provided, including any harmful content that might be included in the trillions of tokens taken from the internet. After that, the safety training tries to steer the model away from giving harmful responses, introducing a refusal mode, where an LLM will avoid giving an answer if the question is similar to an example it saw during safety training.
The jailbreaking exploits the knowledge that a model acquired during pre training, while preventing the refusal from the safety training. In the paper from 2023 “Jailbroken: How Does LLM Safety Training Fail?” [14] (Access 3 September 2024) by Wei A., HaghtalabN., Steinhardt J., authors hypothesize that there are two failure modes for LLM safety training: competing objectives and mismatched generalization:
- Competing Objectives: These arise when the model’s capabilities conflict with its safety goals.
- Mismatched Generalization: This occurs when safety training fails to generalize to a domain where the model retains capabilities.
The researchers highlight that these failure modes are pretty vague, they can be exploited by a variety of jailbreaking techniques, with new ones being constantly discovered. However, from the authors perspective, so far there isn’t any known technique to consistently stop jailbreaking attacks. To date, every major LLM release has been jailbroken within 24 hours of its launch.
The nature of prompt injection and jailbreak attacks showcases the new modality of insecure input, termed promptware. Adversaries – whether malicious insider, client or competitors – can use natural language to attack models which presents unique challenges in detecting and mitigating such attacks.The further cybersecurity implication of these attacks is that any output that an LLM whose context is at least partially controlled by an adversary produces should be considered as potentially compromised. The consequences of such attacks can be severe, including reputational damages, privacy compromise, intellectual property leak,legal implications and litigations due to serious cybersecurity incidents, when LLMs are used in agentic environments, with access to API’s and the ability to execute code [15 Fu X., Li, S., Wang Z., Liu Y., Gupta R. K., Berg-Kirkpatrick T. , “Imprompter: Tricking LLM Agents into Improper Tool Use”, 2018] (Access: 25 November 2024).
Another significant category of risk arises from misalignment, where LLM behaviors fundamentally deviate from human values and expectations.Examples would include:
- Deceptive behaviors: Lying to humans or misrepresenting information.
- Environmental manipulation: Modifying its operational environment to gain an unfair advantage.
- Malicious intent masking: Pretending to behave well to ensure deployment but acting contrary to expectations once operational.
In their 2024 study, „Sycophancy to Subterfuge: Investigating Reward Tampering in Language Models” [16 The Anthropic Alignment Science, “Sycophancy to Subterfuge: Investigating Reward Tampering in Language Models”, 2024] (Access 10 July 2024) researchers shared the results of the Anthropic Alignment Science team investigation with regards to AI Models concerning behavior that escapes traditional control measures and need AI specific solutions to be developed.
In the study by Anthropic, 32,768 trials were tested to observe the behavior of AI models and the following concerning cases have been identified:
- Specification Gaming: AI models can learn to exploit loopholes in their reward system, achieving high rewards by fulfilling the letter rather than the spirit of their training.
- Sycophancy: Models may produce flattering or biased responses to align with user preferences, even if these responses are insincere or incorrect.This evolves the 'information bubble’ and disinformation risk to the next level, even if user’s intention is not focus on obtaining preferred answer, but a true answer.
- Reward Tampering: In more advanced scenarios, models can modify their own code to increase their rewards and even attempt to hide the modification to avoid detection.
- Further, once a model learns lower-level manipulative behaviors, it may generalize these skills to more sophisticated and dangerous actions, such as tampering with rewards.
No traditional control prevents such an integrity compromise as there is no insider threat or external adversary attacking the model, but an inherent risk specific to the deep learning process where the model itself can become malicious and find an escape way from the frames originally defined by humans. Best practices must be developed for enhancing model training, monitoring and detection processes to address the following concerns and challenges identified by researchers and mitigate the risk that more autonomous AI solutions will develop more hazardous behaviors in the future:
- Models can develop reward tampering or other malicious behavior as an unintended byproduct of learning to game other systems.
- Common techniques like reinforcement learning from human feedback and training against sycophancy reduce but do not eliminate the risk of reward tampering.
- Models with increased situational awareness are more likely to engage in reward tampering, especially when given autonomy.
- Models may not only tamper with rewards but also conceal this behavior, increasing the challenge of detection and intervention.
Further, LLMs can be misaligned and lie about the reasons they come to a specific conclusion in the Chain of Thought (CoT) [17 “Chain of thought prompting is an approach in artificial intelligence that simulates human-like reasoning processes by delineating complex tasks into a sequence of logical steps towards a final resolution.” – Definition by IBM] (Access 12 November 2024) sequences. A study’s [18 Turpin M., Michael J., Perez E., Bowman S. R. „Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting”, 2023] (Access 25 November 2024) findings indicate that CoT explanations can be plausible yet misleading, which risks increasing our trust in LLMs without guaranteeing their safety. This is especially relevant in the wake of the release of the o1 model [19 “The o1 series of models are trained with reinforcement learning to perform complex reasoning. o1 models think before they answer, producing a long internal chain of thought before responding to the user.” – Definition by Open AI] (Access 2 February 2025), which uses reasoning CoTs as the basis for its increased capabilities.
AI alignment researchers developed a theoretical concept of instrumental convergence. Instrumental convergence or convergent instrumental values is the theorized tendency for most sufficiently intelligent agents to pursue potentially unbounded instrumental goals such as self-preservation and resource acquisition. While this concept remained only a thought experiment, the recent GPT-o1 release from September 2024 [20] (Access 12 September 2024) has shown that it is indeed possible for models to follow this kind of reasoning.
The system card for the new models includes an evaluation by Apollo Research, which revealed that the new model “sometimes instrumentally faked alignment during testing,” and even “strategically [manipulated] task data in order to make its misaligned action look more aligned.” Apollo further noted that “o1-preview has improved self-knowledge, self-reasoning (i.e., applied self-awareness in an agentic setting), and applied theory of mind” compared to GPT-4o. These findings led Apollo to conclude that “o1-preview has the basic capabilities needed to do simple in-context scheming,” a skill that raises concerns for those wary of AI risks.
Additionally, OpenAI reported that the model’s advanced reasoning abilities contributed to increased instances of “reward hacking,” where models achieve the literal specifications of an objective but in undesirable ways. For instance, in one test, the model was tasked with finding and exploiting a vulnerability in software running on a remote challenge container. When the container failed to start, the model scanned the challenge network, identified a Docker daemon API on a virtual machine, and used it to generate container logs, thereby solving the challenge.
OpenAI’s explanation of this incident underscores critical concerns:
“This example also reflects key elements of instrumental convergence and power seeking: the model pursued the goal it was given, and when that goal proved impossible, it gathered more resources (access to the Docker host) and used them to achieve the goal in an unexpected way.” [21] (Access 12 September 2024).
Another, less obvious stage at which models can be compromised is the training phase. Models trained on compromised data can exhibit unwanted properties or even act as malicious sleeper agents. Since high-quality training data can be expensive to acquire, many AI labs use public training data sets such as LAION [22 “The LAION dataset is an openly available image collection that has been used for learning very large visual and language deep-neural models” Definition by Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Jitsev, J., “Laion-5b: An open large-scale dataset for training next generation image-text models. “, 2022, arXiv preprint arXiv:2210.08402.]. These datasets have been shown to be vulnerable to data poisoning. In “Poisoning Web-Scale Training Datasets is Practical” [23 Carlini N., Jagielski M., Choquette-Choo C. A., Paleka D., Pearce W., Anderson H., Terzis A., Thomas K., Florian TramèrF.“Poisoning Web-Scale Training Datasets is Practical “, 2023] (Access 5 November 2024) Carlini et al. show that it is possible to affect 0.01% of the LAION-400M or COYO-700M datasets for just $60 USD. In “Learning to Poison Large Language Models During Instruction Tuning” Zhou et al. show that poisoning only 1% of 4,000 instruction tuning samples leads to a Performance Drop Rate of around 80%.
The Anthropic AI lab, published a paper [24] (Accessed 25 November 2024) demonstrating how LLMs could be turned into “sleeper agents” by specific training methods. These agents would act normally, until a predetermined trigger appears in their context. This would cause them to change behaviour and perform malicious tasks such as generating vulnerable code.
The Anthropic researchers also demonstrated that they were unable to use known safety techniques to rid the model of its “sleeper agent” mode, nor were they able to detect that quality in a model.
From a cybersecurity perspective the above vulnerabilities highlight that models themselves can be compromised in ways that can prove very hard to detect.
There exists an ecosystem of downloadable models, sometimes mistakenly called open-source. The vast majority of models available for download and local inference are openweights [25 “Open weights refers to releasing only the pretrained parameters or weights of the neural network model itself. This allows others to use the model for inference and fine-tuning.” – Definition by Promt Engineering & AI Institute] (Access 10 September 2024). That means that while the model file itself is publicly available, the exact steps taken during training, including the datasets, are not published.
LLM weights are much more akin to binary code than any scrutable programming language. Running open source LLMs is therefore akin to running unverified binary code. Given the potential of these models to be poisoned, running experimental models that do not come from reputable sources should be considered insecure.
These findings underscore the necessity for developing targeted countermeasures and adopting new security practices tailored to the unique vulnerabilities introduced by data poisoning.
5. Conclusion
Exploring AI solutions is essential for organizations to maintain a competitive edge in today’s rapidly evolving world. While many companies will develop AI-based products, an even greater number will adopt off-the-shelf or customized AI tools. The integration of AI into business operations introduces significant risks, necessitating a comprehensive control environment.
Traditional cybersecurity frameworks, like the CIS Top 18, remain relevant for LLMs, and organizations can benefit from using resources like the OWASP Top 10 LLM Applications Cybersecurity and Governance Checklist as a strong foundation for updating governance practices when implementing AI solutions. However, new AI-specific attack vectors and threats surpass the scope of traditional controls, making current cybersecurity baselines inadequate for fully mitigating these risks.
Treating AI models as just another type of system or application can lead to unmanaged AI-specific risks, including prompt injection, jailbreaking, specification gaming, reward tampering, sycophancy, and model instrumental convergence. Existing solutions and controls do not fully address these emerging threats. While the Cybersecurity community is accustomed to a rapidly evolving threat landscape and practices such as Vulnerability Management that address the possibility of new risks emerging, the newly discovered weaknesses of AI models do not receive CVE identifiers and are not tracked in the same way. In fact, in the field of Machine Learning, research is often preprinted to various science sites [26] (Access 22 September 2024) and lose their relevance long before they are peer reviewed and properly published in a reputable journal.
This highlights the need for constant monitoring of emerging threats and a collaboration between Machine Learning and Cybersecurity experts. Projects such as MITRE Atlas [27] (Access 1 July 2024) are trying to address this issue, however it is the authors opinion that much more should be done in this regard. Finally, the organizations must be vigilant and adopt a ‘trust but verify’ approach to AI model outcomes to protect their customers, employees, the organization itself, and other stakeholders until industry develops effective, efficient and successful AI-specific protections.