Volunteer contributions are a potentially powerful tool for performing scientific tasks like image classification that are difficult for computers yet easy for humans. Volunteers can be motivated to take part in image classification campaigns by “gamification,” which turns the tasks into a fun activity. Volunteer input may be particularly valuable for tasks such as land cover mapping where automated classification schemes struggle to attain high levels of accuracy. The effective use of volunteers requires an understanding of how well individual volunteers perform the tasks assigned to them. Previous work on this topic has assumed that individual tasks are relatively uniform in terms of difficulty. We investigate the impact of tasks of non-uniform difficulty and, in particular, of extreme difficulty on quality assessment. Understanding these interrelated factors can ultimately inform better game design and incentives for participation.
We assessed different measures of work quality and task difficulty in a dataset from the “Cropland Capture” game which has over 4.5 million classifications of 165,000 images by about 2,700 volunteers. The game has a simple mechanism, entailing users seeing an image, either from satellite or ground-based photographs, and being asked whether or not it contains cropland. If unsure, they can respond “maybe.” Volunteers were occasionally shown images that they had previously rated as a test of their consistency. To provide an external reference, 342 images from the game were validated by land cover classification experts. Together, this data allows calculation of three metrics of user quality: i) agreement with majority vote of other volunteers; ii) self-agreement with previously rated images; and iii) agreement with experts.
While many methods assume that the majority vote yields the correct classification of an image, comparison of expert and volunteer classifications shows that, at least for identification of cropland, this is frequently untrue. Agreement with other volunteers and self-agreement consistently overestimate user quality compared with the gold standard of expert validations. Examination of image-specific rates of agreement with expert validations reveals that this problem is due to certain images being extremely difficult for volunteers to classify correctly. Unfortunately, comparison among volunteers’ efforts based on their expert-based error rates was not possible. These error rates had broad confidence intervals because most volunteers classified very few of the images that were also seen by experts. This situation is an inevitable result of random assignment of images to volunteers and a total number of images that is several orders of magnitude greater than can be feasibly validated by experts.
These results suggest several lessons for game design for citizen science tasks. Most importantly, a core set of expert-validated tasks should be prepared before volunteer classifications are accepted. These tasks should be chosen with care to represent a range of task difficulties. They can then be used in both the training of volunteers and in the evaluation of their performance. A common set of tasks performed by all volunteers would allow a much more robust comparison of the quality of their work than was possible in the Cropland Capture game. These lessons are already being implemented in the next game for which design and testing are now under way.
Last edited: 17 February 2015
Know your planet
On the Blog
How games can help science: Introducing Cropland Capture
International Institute for Applied Systems Analysis (IIASA)
Schlossplatz 1, A-2361 Laxenburg, Austria
Phone: (+43 2236) 807 0 Fax:(+43 2236) 71 313