-
Notifications
You must be signed in to change notification settings - Fork 0
/
layman_summary.tex
74 lines (62 loc) · 3.66 KB
/
layman_summary.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
\newpage
\phantomsection
\addcontentsline{toc}{chapter}{Vulgarising Summary}
\noindent \textbf{\huge Vulgarising Summary }
\vspace{1.5cm}
This thesis is situated in the field of Semantic Web, where the
World Wide Web, \emph{the internet} as we know it, is extended
with standards and specifications to make the data on
the internet machine-readable. Most of these data are in different
formats or from different sources,
which might not be compatible with each other. Furthermore,
they are also difficult for our computers to understand and use the
data since they do not inherently have semantics.
How do we make these data, which are already widely available,
compatible with each other? This is where a part of
the Semantic Web technology comes into play --- the
transformation of heterogeneous data to semantically linked data.
\section*{Mapping of heterogeneous data to linked data}
Presently, large amounts of data are generated at very high frequency.
Think about the millions of tweets generated by Twitter users at a high
frequency at any moment in a non-linked data format. For this work, we
are interested in transforming these data straight into linked data format,
the very moment they are generated. Two major approaches currently exist
to transform these data; a query-based and a mapping-based approach. We shall denote
the engines responsible for transformation as \emph{mapping engines}.
The query-based approach allows the user to pose \emph{questions}
to the \emph{mapping}
engines which will then interpret the \emph{questions} to transform
heterogeneous data to linked data. This removes the need for the user
to define exactly how the transformation of data should be.
In contrast, mapping-based approach requires user to explicitly
state the transformation in the form of rules.
The mapping engine will then follow these rules to transform the data.
There are implementations of both approaches supporting
the transformation of the data \emph{online} --- the very moment
the data is generated. Unlike processing data in large files on disks,
these online data sources are \emph{unbounded} --- theoretically there is no
upper limit to the amount of data generated by the source. Therefore,
we cannot hope to process everything in memory on a machine which is limited.
This is where \emph{window} processing is applied to allow processing on a
subset of the \emph{unbounded} data.
\section*{Window processing in data stream}
Processing \emph{unbounded} data is usually done by splitting the
incoming data into subsets of data with \emph{windows}. You could view
windows like a bus which collects passengers at a
bus stop (processing step). All buses are of fixed size and passengers board
the bus one by one. Now, imagine a scenario where the number of passengers
boarding the bus changes over time. The bus' fixed size will not be able
to accommodate the passenger's rising arrival rate and they will have to wait for the
next bus. This is not desirable as the passengers will have
to wait (high latency)
and the rate of passengers leaving are also lowered (low throughput).
\section*{Solution}
The aforementioned problem leads us to propose a dynamic approach
to windowing. Windows will change their size to process the data
records depending on the arrival rate of the data.
To change the size of the \emph{window}, we calculate a metric
to adjust the size based on the arrival rate of the data and the
current size of the window. Bringing it back to the bus analogy,
this leads to buses of different sizes coming to the bus stop to
pick up the passengers to lower the wait time (low latency) and
improve the departure rate of the passengers (high throughput).