AI has been developing rapidly throughout the recent years. The release of GPT-3 has shown a remarkable development in AI text generation as compared to previous attempts. In 2021, Microsoft introduced their new tool Github Copilot, an AI trained on publicly available code from Github, to generate code suggestions for developers. As this and similar tools become more and more prevalent, teachers and those in the academic field will need to prepare themselves as students begin to take advantage of these tools.
When professors give their students assignments, if the students are dishonest, students can already search the internet and plagiarize code from sources like StackOverflow. The problem with this approach is that the code is often copied verbatim, which means that professors can use automated systems to determine if a student's code is found online. Github Copilot, on the other hand, procedurally generates its own code. While this is important in corporate environments in order to avoid copyright restrictions, it makes detecting plagiarism nearly impossible for professors.
What makes using Copilot different than using other tools, like code snippets or autocomplete? Is Copilot powerful enough to generate its own correct solutions to professors' assignments? My goal in this paper is to analyze these question and to examine one example submission generated by the AI.
Over time, programmers have worked to make their own lives easier. They write tools which help them write programs more quickly and more easily. Some tools automate more of the process then others, and it seems that Copilot is yet another step forward in making coding easier. We will examine different tools currently used by developers to make programming more convenient and see how Copilot relates to them.
We will limit our discussion here to only a tool levels which can be used to help developers. While these can be separated more granularly, we will limit our discussion to the levels listed below.
The first level is to use no suggestions at all. This would be using an app like Notepad or TextEdit to write code. With these tools, there is be absolutely no help for the developer. While this is certainly the hardest way to write code, it would also force the students to remember every rule in a given programming language, from remembering to close parenthesis to always ending a line with a semicolon.
The next level of assistance would be to use a simple editor with a few more tools. This editor would include highlighting, parenthesis matching, auto indent, and a few other basic tools to help users on their way. Apps like Vim and Gedit fall into this category. They make it easier to find mistakes and to fix them, but you still need to remember every rule of a given language. It helps, but you still need to know how Java works, for example, in order to write Java programs.
The next level would be to use an editor with syntax highlighting and intelligent code completion. This would gives you suggestions for functions and object names you may have been planning on typing based on what you've already said. For example, if you start typing Integer.parseInt()
, the editor may recognize what you're typing and attempt to finish the statement for you. This encompasses most mainstream code editors, like Eclipse or Visual Studio Code with IntelliSense. You still need to know how to code, but you don't need to spend as much time typing.
The next level is to use an editor with snippets. Snippets are standard coding patterns that the editor will automatically complete for you. For example, if you're creating a function called functionname
in Java, it can autocomplete the header public static void functionname(){}
. If you have private variables in a Java class, snippets can be used to automatically generate getter and setter methods for it. Snippets are used to remove the monotony of rewriting the same code over and over again, but they still require you to know most of the language. In high school, I didn't remember how to write a Java main function, as I would always copy and paste the function header. I didn't know how to start a main function, but I still knew enough Java that, once I had my the header written, I could do anything else I needed to in Java.
Finally, we have Github Copilot. With Copilot, you don't always need to know how a language works in order to complete an assignment using it. Copilot can generate code in any language, so a user could use it even if it they don't really have a proper understanding of a given language. If the user is lucky, they won't even need to know what their code is doing, depending on how well the generated code works. As long as the function works, it works, so some students would have no problem plugging an assignment into Copilot, generating and testing 10 different responses, and determining if any of them pass all the given test cases.
The main difference between Copilot and the other tools is that Copilot can generate the logic of a program for you. While snippets will create the framework of a program, a programmer will still need to think through the logic of how to write a function before implementing it. Copilot, on the other hand, can generate the logic for the code for you. Thus, there is a substantial jump between these two levels.
Now that we've gone over some background to how Github Copilot compares to other tools, let's examine how good Copilot is at generating responses to college assignments.
In my initial test, I received three assignments from Dr. Edward Kovach, a computer science professor at Franciscan University of Steubenville. I placed the assignments body in a comment at the beginning of a file named FMCLprog<number>.<extension>
and had Copilot generate responses based off of these. I won't detail all three responses here, but they can be found on Github. All three were generated in a similar way as the analysed response was.
In this paper, we will analyze the program for the first piece of homework. The filename was FMCLprog1.py
# Homework 1
# Design a program with the class FMCLprog1. This class will prompt the user for two ints and display those numbers with their sum. 5 points. Due Friday, 9/3/21. FMCL = First, Middle, Confirmation (if any), Last initials in your name.
# Done in python
class FMCLprog1:
def __init__(self):
self.num1 = int(input("Enter a number: "))
self.num2 = int(input("Enter another number: "))
self.sum = self.num1 + self.num2
print("The sum of your numbers is", self.sum)
FMCLprog1()
We will now go through the lines of the program and analyze Copilot's role.
I entered the first three lines myself. These were copied verbatim from the professor's assignment. I simply added the #
s so that these lines were treated as comments by the program and by Copilot.
After I entered three blank new lines, Copilot generated the FMCLprog1
class all at once. Often, before generating a new part of a program, Copilot will wait for three new lines, so this was expected.
The program is a near perfect interpretation of the professor's instructions. It creates a class FMCLprog1
which asks for two inputs. It then displays the sum of these two inputs.
Then, after another three lines, Copilot calls the init
method, causing the program to run FMCLprog1()
.
One thing to note about Copilot's response is that it does not completely follow the instructions. The instructions say that the program should "display those numbers with their sum", but Copilot only outputs the sum. The program is able to properly add the two numbers together, but does not display them properly. A student would have to be aware of this and modify Copilot's suggestion to get the best possible grade.
In his article Your Wish Is My CMD, Neil Savage points out that very little code on sources like Github is labeled with its intention FOOTNOTE. Even if it can generate runnable code, from this experiment, Copilot is not always able to generate code that perfectly follows the given instructions. It can generally follow the instructions, but it often needs some human correction to complete a task properly. This lines up with my experience testing Copilot. Copilot can generate functions, classes, and other code incredibly quickly and accurately; the problem comes when Copilot tries to generate code based off of a very specific prompt. It sometimes gets it right, but more often than not, it'll end up generating a related response which does part of the task provided, but not all of it.
From this and from my other experimentation, Copilot is imperfect, but it is still able to produce very nearly perfect submissions. Professors should be aware of this tool as it grows in popularity and scale and be prepared to face it.