What's new Search

interpretability

Interpretability refers to the degree to which a human can understand the internal mechanics or decision-making process of a model, system, or algorithm. In contexts such as machine learning or artificial intelligence, it means how easily one can explain why a model made a particular prediction or decision, often by linking its behavior to understandable features or rules.

interpretability

Anthropic reports emergent introspective awareness in leading LLMs