- Syntha AI
- How to identify AI-generated text. Adversarial attacks
How to identify AI-generated text. Adversarial attacks
OpenAI’s generated text detection. Play with Google’s text AI model.
With the increasing popularity of ChatGPT and GPT-3, more and more services that generate text are appearing. Despite the fact that these tools are very powerful, sometimes it is essential to understand whether the text you are reading is real or generated. This is especially important in education, where teachers expect essays to be written by students. In addition, it is valuable for search engines to be able to distinguish between human-created and AI-generated text.
For this reason, OpenAI has released a model that can identify text generated by artificial intelligence. It uses a pre-trained version of GPT that predicts one of five labels: "very unlikely", "unlikely", "unclear if it is AI-generated", "possibly AI-generated", or "likely AI-generated". In other words, OpenAI's model is a classifier. Almost every classifier can be deceived.
There exists a field of research called Adversarial Attacks. Usually, an adversarial attack is a process of deceiving an existing classifier (or another type of model).
Let's consider an example. Imagine there exists a classifier, to which you don’t have access. Say it distinguishes different animal breeds in photos (dogs, cats, etc.). The only thing you can change is the input image.
Let’s assume that you have an image of a cat. And for some reason you want this classifier to predict the class "dog" for your image. Usually, in order to perform such an adversarial attack, people take the output values of another neural network for a dog image and add these numbers to the pixel values of the cat photo. If everything is done correctly, then the attacked neural network will start predicting a "dog" class for the photo of a cat.
Adversarial attack example
While there exist methods for attacking existing neural networks, there are also methods for defending against such attacks. I even remember data science challenges from the world's top AI conference, NeurIPS, about attacking classifiers and defending them.
How is it relates to generated text?
Well, now we see new methods that help to detect whether the text was generated besides the OpenAI model. For example, a recently released scientific paper describes a way how to add watermarks to a text in order to detect generated content without neural networks.
This may potentially lead to appearing of new methods for attacking such classifiers. People using generated content may want it to be indistinguishable from real text, not only for humans but also for AI guards. There already exist methods in the form of scientific papers and GitHub repositories for text attacks. However, I expect new services to appear that either generate indistinguishable text or fix existing generated text, removing all hidden watermarks.
What one should do about it?
There are several potential directions to explore:
Developing new services to help prevent text generation for educational purposes, such as detecting if an essay or thesis was generated.
Creating plugins for anti-plagiarism services with generated text detection capabilities.
Creating services that modify the existing generated text to make it indistinguishable from the human-written text.
News of the week
Google is set to showcase a version of its search engine with AI chatbot capabilities and introduce 20 new products, after the unveiling of ChatGPT's "code red".
If you would like to participate in the testing of Google’s AI models for text generation, you can join Google’s AI Test Kitchen. They grant access to their AI model LaMDA (Language Model for Dialogue Applications).
Tool of the week
Cleanvoice is an artificial intelligence platform designed to eliminate filler sounds, stuttering, and mouth noises from a podcast and audio recordings.
Thank you for reading my newsletter. I would greatly appreciate any feedback you have. Additionally, if there are any topics you would like to see covered, please let me know. You can just reply to this email.