model safety

Model safety refers to the set of practices, principles, and safeguards designed to ensure that an artificial intelligence (AI) or machine learning model operates reliably, ethically, and without causing unintended harm. It involves preventing harmful outputs, misuse, bias, or unsafe behavior, as well as ensuring robustness, interpretability, and compliance with relevant regulations and values throughout the model’s design, training, deployment, and monitoring phases.
  1. Longer AI Reasoning Makes Models Easier to Jailbreak, Study Finds

    Longer AI Reasoning Makes Models Easier to Jailbreak, Study Finds

    Longer AI Reasoning Makes Models More Vulnerable to Jailbreaks, Researchers Warn A new joint study by Anthropic, Stanford University, and the University of Oxford challenges one of the core assumptions in modern AI safety: that extending a model’s reasoning time makes it harder to exploit...
Top