🎉 Hey Gate Square friends! Non-stop perks and endless excitement—our hottest posting reward events are ongoing now! The more you post, the more you win. Don’t miss your exclusive goodies! 🚀
🆘 #Gate 2025 Semi-Year Community Gala# | Square Content Creator TOP 10
Only 1 day left! Your favorite creator is one vote away from TOP 10. Interact on Square to earn Votes—boost them and enter the prize draw. Prizes: iPhone 16 Pro Max, Golden Bull sculpture, Futures Vouchers!
Details 👉 https://www.gate.com/activities/community-vote
1️⃣ #Show My Alpha Points# | Share your Alpha points & gains
Post your
Research Reveals Vulnerabilities in the Trustworthiness of GPT Models, Calling for Enhanced AI Security
Evaluating the Credibility of Language Models
Researchers recently released a comprehensive trustworthiness assessment platform for large language models (LLMs), which was introduced in the paper "DecodingTrust: A Comprehensive Assessment of the Trustworthiness of GPT Models."
The evaluation results reveal some previously unknown vulnerabilities related to credibility. The study found that GPT models are prone to generating toxic and biased outputs and may leak private information from training data and conversation history. While GPT-4 is generally more reliable than GPT-3.5 in standard benchmark tests, it is actually more susceptible to attacks when faced with maliciously designed prompts, possibly because it adheres more strictly to misleading instructions.
This work conducted a comprehensive credibility assessment of the GPT model, revealing gaps in credibility. The evaluation benchmarks are publicly accessible, and the research team hopes to encourage other researchers to continue exploring this basis to prevent potential malicious use.
The evaluation conducted a comprehensive analysis of the GPT model from eight credibility perspectives, including robustness against adversarial attacks, toxicity and bias, privacy leakage, and more. For instance, to assess the robustness against textual adversarial attacks, the study constructed three evaluation scenarios, including standard benchmark tests, tests under different instructional task guidelines, and more challenging adversarial text tests.
The research found some interesting results. In terms of adversarial demonstrations, GPT models are not misled by counterfactual examples, but may be misled by anti-fraud demonstrations. Regarding toxicity and bias, GPT models show little bias under benign prompts, but are easily misled by deceptive prompts that "coax" agreement with biased content, with GPT-4 being more susceptible than GPT-3.5.
In terms of privacy protection, the GPT model may leak sensitive information from the training data, such as email addresses. GPT-4 performs better than GPT-3.5 in protecting personal identity information, but both are robust in protecting certain types of information. However, in some cases, GPT-4 may be more prone to leaking privacy than GPT-3.5, possibly because it adheres more strictly to misleading instructions.
This study provides a comprehensive perspective on the credibility assessment of large language models, revealing the strengths and weaknesses of existing models. The researchers hope that these findings will promote the development of safer and more reliable AI models.