Designing a system for Triplet Verification and Extraction by Human Computation through a User-Centered Approach.
Triplets (subject -> predicate/relation -> object) often define a common understanding of the meaning of information. This enables sharing and reuse of data. Triplet extraction by human computation frequently requires domain experts and can be considered a tedious and repetitive task. To overcome these limitations, I developed a game with a purpose to make the task of triplet verification and extraction entertaining.
Over the recent years, digitization has led to exponential growth of digital information, mostly made available through the world wide web. Digital information is, in the absence of structure, very heterogeneous which hinders the share and reuse of data. Humans can interpret this information, but machines cannot capture the semantics of this heterogeneous information since it is not represented. A common way to represent the semantics of information is by using triplets:
The extraction of triplets from text by human computation often requires domain experts and can be considered a tedious and repetitive task. To extend these limitations, I developed a game with a purpose along with a workflow that supports high-quality triplets. The system is divided into two parts: an Information Extraction (IE) engine and a Human Computing (HC) engine. The IE engine deals with the automatic extraction of nouns, verbs, etc. and automatic triplet extraction. The HC engine was put in place to identify relationships between entities and verify triplets by human computation.
Once the user has uploaded text to our system, the text is processed by a pipeline of Natural Language Processing (NLP) tools. Our method transforms the text into sentences, each provided with a list of tokens. Each token is tagged with part of speech (POS) and dependencies. Besides, we get the dependencies of tokens within the sentence, to gain insight in the directed links between words so that it can be used to, for example, divide the main clause from a possible sub-clause.
Furthermore, I employed Stanford’s Open Information Extraction (Open IE) system in the pipeline to make sure triplets that can be extracted automatically were obtained. Stanford’s Open IE can extract triplets from plain text so that the schema for these triplets does not need to be declared in advance. The extracted triplets are saved to a triplet store so that they are accessible by the HC engine for verification and ready for export for further usage outside our system.
The HC engine is responsible for two different tasks, namely, triplet verification and extraction. The extracted relation results in a triplet and is verified based on the sentence wherefrom it is extracted. A decision whether to include a triplet in our IE triplet store is made based on the distribution of votes and majority voting. Triplet extraction is done by selecting a subject, predicate and object from a presented sentence.
1. Pick sentence
To extract a relation, the user picks a document with sentences from the system. Therefore, the user is focused on one sentence at a time. By splitting documents up into sentences, I attempt to create a more organized and clear interface. For the task of manual triplet extraction, we do not put strict lexical constraints in place, in contrast to multiple automatic triplet extraction approaches. Since the imported text does not necessarily adhere to any specific structure, I am cautious to implement such strict constraints. However, a sentence may have many words what leads to a lot of options for a possible relation. To reduce the number of options and to attempt to steer the focus of the user slightly, we implement the following lexical guidelines:
- When the predicate in our system is active, all the verbs in the given sentence are highlighted;
- When the subject or object in our system is active, all the nouns in the given sentence are highlighted.
2. Return extracted triplet
Since one user has now extracted a triplet, one vote is counted for this triplet (one for correct, zero votes for incorrect). Therefore, the triplet is saved to the IE triplet store, until majority voting indicates that the triplet is incorrect.
3. Receive points
After the triplet is saved, the user receives points for his effort. These points are added to the total amount of points the user has collected.
4. Triplet verification
Triplet verification is conducted by either agreeing or disagreeing with a given triplet. The triplet verification task contributes to the following goals:
- Cheating detection: Since our system deals with crowdsourced users and tasks that cannot be verified automatically, it becomes difficult to detect cheating users. To overcome this issue, we implement triplet verification by using a control group approach. By allowing users to verify triplets from other users, the system can use majority voting make a distinction between correct and incorrect extracted triplets and can take action against cheating users.
- Precision: An extracted triplet is verified by multiple users. Following the idea of wisdom of crowds, triplet verification by multiple users contributes to a high precision of extracted triplets.
- Cheap task: Compared to the task of triplet verification, the task of manual triplet extraction requires more interaction with the system, and therefore it is considered an expensive task. Triplet verification shows the user already extracted triplets, to minimize the need for the expensive task of triplet extraction.
5. Crowd verification
After a user has verified a triplet, the vote is submitted to the HC datastore after which we calculate with majority voting whether the triplet is correct or incorrect.
6. Majority voting
When majority voting calculates the triplet is found incorrect, the triplet is removed from our IE datastore. When the triplet is found correct, the triplet is added to our IE datastore. In case the outcome of majority voting has not changed compared to the previous calculation, only the vote is saved.
7. Update points
When the status of the triplet has either changed from correct to incorrect or from incorrect to correct, points are withdrawn from or added to the user who initially extracted the triplet.
The relation verification UI shows the current amount of points the user has, the position of the user relative to the crowd based on points, a highlighted triplet and the sentence to give context to the provided triplet. The user has the choice to either accept or reject the presented triplet. When the user clicks on his answer of choice, feedback about given points is shown and the next triplet is loaded. In case all triplets for the given sentence are verified, the context of the game changes to the task of manual triplet extraction.
To design the triplet extraction interface, we have to decide on the balance of precision versus recall. By focusing on the precision, we concentrate on relations where there is little doubt about whether they are correct or not. We can also choose to focus on the recall of manual triplet extraction. By focusing on recall, we concentrate on as many relations as possible, attempting to extract all possible triplets for a given sentence. When focusing on recall, users need to have a degree of freedom, so that it is possible to extract as many triplets as possible. When we focus on precision, we give our users less freedom and let them play according to rules, to ensure that triplets are of high quality.
Since the uploaded text in our system does not necessarily follow a grammatical structure, we decided to focus in our triplet extraction interface on recall. This focus is expressed by giving the user freedom to apply any word from a sentence to one of the attributes of a triplet.