education - Use of automated assessment of programming assignments

(This is a reworking of Software for submitting and testing programming assignments.)

First, a bit of background. Many courses in applied mathematics have a programming component, where students are asked to implement algorithms (say, in Matlab) and possibly test them using a given set of interesting data. Although they are a valuable part of the education, these usually receive little love from the students (who, moreover, have rarely received a rigorous training in programming). The result is -- with rare exceptions -- lazy hacks at best and "at least it looks like code" (often followed by my favorite, "it worked on my machine") at worst.

So I am thinking about having students submit their programming assignments via an automated assessment software. The idea is to give them instant feedback on their (repeated) submission with the goal of

saving the TAs from having to check every submission and (if they are generous) inserting all the missing semicola to make the code run, and

trying to increase the student's motivation to do more than the bare minimum by introducing some gamification elements (giving points for the fastest/most accurate code, for example, or a current ranking to encourage resubmitting improved solutions).

Hence my question (which is hopefully relevant to other disciplines as well): Has anybody tried such a thing? Did it actually lead to less work and/or more student involvement? Any hints on what to do, and what to avoid?

(I know there is the VPL module for Moodle, and we have an in-house system that can provide the necessary functionality, so I'm not asking for software recommendations here. That said, if some software provides a specific feature you've used successfully, by all means mention it -- bonus points if it's open source.)

Answer

TL;DR

We did something along this lines for Java-based programming assignments. Students and TAs generally like it. It was a lot of work for us.

(Very) Long Version

I used to teach software engineering at a large public university in Austria. We implemented a similar approach for two courses back there, for a 400+ students bachelor-level distributed systems course, and for a 150 students master-level course on middleware and enterprise application engineering. Both courses included large, time-consuming Java-based programming assignments, which students needed to solve individually three times per semester.

Traditionally, students would submit their solutions for both courses as Maven projects via our university-wide Moodle system. After the deadline, TAs would download and execute the submissions, and grade manually based on extensive check lists that we provided. There was usually a bit of huffing and puffing among students about this grading. Sometimes, TAs would not understand correct solutions (after looking at dozens of similar programs, your mind tends to get sloppy). Sometimes, different TAs would grade similar programs differently (the sheer size required some parallelization of grading and the tasks were complex, hence it was impossible to do check lists that covered all possible cases). Sometimes, the assignments were actually under-specified or unclear, and students lost points for simple misunderstandings. Sometimes, applications that actually worked on the student's machine failed on the TA's machine. Generally, it was hard for students to estimate in advance how many points their submissions would be worth. Given that those two courses are amongst the most difficult / most time-consuming courses in the entire SE curriculum, this was all but optimal.

Hence, we decided to move to a more automated solution. Basically, we codified our various check lists into a set of hundreds of JUnit test cases, which we gave to our students in source code. Additionally, we kept back a smaller set of tests, which were similar but used different test data. The tests would also serve as reference implementation - if the assignment text did not specify how e.g., a given component should behave in a given borderline case, what the tests expected was the expected behavior. Our promise to the students was that, if all the tests for a given task pass, and the student did not game our test system (more on this below), not much can go wrong anymore with the grading (it was possible to get minor point losses anyway for things impossible to test automatically, but those things could not amount for more than a small percentage of all points).

EDIT after comments Note that I was saying more automated, not fully automated. Human TAs still look at the code. There is no grading Web service or something like that (this usually does not work, as keshlam correctly notes). On a high level, what we did is that we provided the requirements for students not only in written text anymore, but also in form of formal executable unit tests. The new process is roughly like that:

TA downloads submission from Moodle and starts tests (takes about 10 minutes).

While tests are running, TA browses over the code, does some spot checking of code, and checks a few things not covered by the tests.

When the tests are done, the TA notes down the results of the tests and his own observations.

If something severe happens (e.g., compile error), the TA sighs, takes a deep breath, and falls back to manually grading (and, likely, complaining a bit in our mailing list).

Advantages:

Students now have a good feeling about how many points they will get in the end. No nasty surprises anymore.

The TA effort for grading is now much, much lower, maybe 1/3 of the previous time. Hence, they have more time to look at problematic solutions and help students actually produce assignments.

The tests, good or bad (more on this later), are the same for everybody. Lenient and strict TAs are much less of a problem.

Working solutions now work no matter if the grading TA understands the solution or not.

Students get rapid feedback on the quality of their solution, but not so rapid that it makes sense to just program against the tests without thinking. One good side effect of the way we built our tests is that simply executing the tests for many tasks takes 10+ minutes (starting an application server, deploying code, ...). Students need to think before starting tests, just like they would if they would test against, e.g., a real staging server.

Students do not need to waste time writing test applications / test data. Before, one problem that made the student effort in these courses skyrocket was that students did not only have to program the assignments themselves, but also needed to write various demo programs / test data sets to test and showcase their solutions. This has become entirely obsolete with the automated test framework, significantly reducing the amount of boilerplate code that students need to write. In the end, we can now focus more on the actual content of the labs, and not on writing stupid test code.

All in all, our personal impression as well as student evaluations have shown that the majority of students appreciates the automated test system.

Problems:

The initial effort for us was certainly non-trivial. Coming up with the first version of the tests required a concerted effort of 6 or more TAs and 2 junior faculty over multiple months during the summer (not full-time of course, but still). Note that this was even though we utilize a widely used, standard testing framework - the effort was just in codifying our grading rules, finding and defining all corner cases, and writing everything down in useful tests. Further, as the assignments are complex distributed systems, so were the tests - we had to start application servers in-memory, hot-deploy and undeploy code, and make sure that all of this works out of the box on all major OS versions. All in all, writing the tests was an actual project, that required a lot of leadership and time commitment.

While, overall, our total time spent arguing about grading has decreased, it is not zero. Most importantly, as students can now see the grading guidelines clearly written down in source code, they seem to feel more compelled to question requirements and grading decisions than before. Hence, we now get a lot of But why should I do it like that? Doing it in X-Y-Other-Way would be much better! questions and complaints (incidentally, in practically all cases, the "much better" way is also much less work / much easier for the student).

While we were able to cover most of our original assignments, some things are impossible to cover in tests. In these cases, TAs still need to grade manually. In addition, in order to make our test framework technically work, students are now a bit more restricted in how they can solve the assignments than before.

Some students feel compelled to try to game our grading system. They spend significant efforts into finding solutions that do not actually solve the assignments but still get full points. In general, these efforts fail (as we actually have pretty sophisticated backend tests that do not only check whether the behavior is correct on interface level, but also look "under the hood"). Occasionally, they succeed, leading us to improve our tests.

Initially, we received some flack from a few students for a small number of bugs in our tests. The bugs were easy to fix and mostly not particularly severe, and most students understand that if you roll out a couple thousand lines of Java code for the first time, bugs will happen. However, some will take the opportunity to complain. A lot.

We had the impression that the number of copied solutions (plagiarism) was up the first time we rolled out the automated tests. Note that we had automated plagiarism checks long before, and nothing had changed in this regard, but apparently students assumed that cheating would go unnoticed with the automated testing.

Blog

Thursday, 13 April 2017

education - Use of automated assessment of programming assignments

No comments:

Post a Comment

evolution - Are there any multicellular forms of life which exist without consuming other forms of life in some manner?