-
Notifications
You must be signed in to change notification settings - Fork 3
/
project.html
96 lines (83 loc) · 5.91 KB
/
project.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
<head>
<title>Copy Cats</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-15" />
<link rel="stylesheet" href="styles.css" />
</head>
<body>
<div id="conteneur">
<div id="header">Copy Cats: Plagiarism Detection</div>
<div id="haut">
<ul class="menuhaut">
<li><a href="index.html">Home</a></li>
<li><a href="project.html">Project</a></li>
<li><a href="about_us.html">About the Copy Cats</a></li>
<li><a href="code.html">Code</a></li>
</ul>
</div>
<div id="centre">
<h1>Plagiarism Detection: A Brief Overview</h1>
For a more detailed report on our project, please see our paper <a href="plagcomps.pdf">here</a>.
<br /><br />
A motivating example -- consider these three paragraphs from <em>Harry Potter and the Sorcerer's Stone</em>:
<br /><br />
<p>
"But on the edge of town, drills were driven out of his mind by something else. As he sat in the usual morning traffic jam, he couldn't help noticing that there seemed to be a lot of strangely dressed people about. People in cloaks. Mr. Dursley couldn't bear people who dressed in funny clothes -- the getups you saw on young people! He supposed this was some stupid new fashion. He drummed his fingers on the steering wheel and his eyes fell on a huddle of these weirdos standing quite close by. They were whispering excitedly together.
</p>
<p>
Mr. Dursley was enraged to see that a couple of them weren't young at all; why, that man had to be older than he was, and wearing an emerald-green cloak! The nerve of him! But then it struck Mr. Dursley that this was probably some silly stunt -- these people were obviously collecting for something...yes, that would be it. The traffic moved on and a few minutes later, Mr. Dursley arrived in the parking lot, his mind back on drills.
</p>
<p>
The evil of the actual disparity in their ages (and Mr. Woodhouse had not married early) was much increased by his constitution and habits; for having been a valetudinarian all his life, without activity of mind or body, he was a much older man in ways than in years; and though everywhere beloved for the friendliness of his heart and his amiable temper, his talents could not have recommended him at any time."
</p>
<h2>Intrinsic Detection</h2>
<p>
"Hmmm...that third paragraph seems much more sophisticated than the first two." <br />
<em>
Use stylometric features from the text to detect passages that are "different" from the rest of the passages.
</em>
</p>
<h2>Extrinsic Detection</h2>
<p>
"Hmmm...that third paragraph is from Emma, not Harry Potter and the Sorcerer's Stone." <br />
<em>
Find passages that are similar to passages in an external corpus.
</em>
</p>
<h2>Intrinsic Plagiarism Detection</h2>
<img src="intrinsic_pipeline.jpg" />
The goal of intrinsic plagiarism detection is to find passages within a document which appear to be significantly different from the rest of the document. In order to do so, we break the process down into three steps.
<ul>
<li>Atomization -- Deconstruct a document into passages.</li>
<li>Feature Extraction -- Quantify the style of each passage by extracting
stylometric features based on linguistic properties of the text. Each passage
is represented numerically as a vector of feature values.
</li>
<li>Classification -- Compare the feature vectors of passages to one another; those
passages that are significantly different will have higher confidences of
plagiarism. Return a confidence that a passage was plagiarized.
</li>
</ul>
<h2>Extrinsic Plagiarism Detection</h2>
<img src="extrinsic_pipeline.jpg" />
Extrinsic plagiarism detection is given more information to work with: in addition
to a suspicious document, we are also given a number of <em>external</em> documents,
or <em>source</em> documents to compare to the suspicious document. The extrinsic
detection process can be broken into three steps:
<ul>
<li>Atomization -- Deconstruct a document into passages.</li>
<li>Fingerprinting -- Compress a passage of text into a <em>fingerprint</em>,
a set of integers that represent the passage. The integers come from applying
a <em>hash function</em> to some subset of <em>n</em>-grams of the passage.
</li>
<li>Fingerprint Matching -- Passages are now represented by fingerprints, which are
simply sets of integers. To compare a fingerprint from the suspicious document
with source fingerprints, we can use set similarity measures. Fingerprints
with high similarity indicate a high confidence of the presence of plagiarism.
</li>
</ul>
</div>
</div>
</body>
</html>