forked from simsong/bulk_extractor
-
Notifications
You must be signed in to change notification settings - Fork 0
/
NEWS
459 lines (327 loc) · 17 KB
/
NEWS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
BEViewer (with bulk_extractor v1.2):
Readable Feature entries:
Features in the scrollable Feature list will be formatted based on their type
(namely GPS, IP, TCP, and JSON)
and will print the result in the scrollable Feature list rather than the original Feature text.
================================================================
BEViewer (with bulk_extractor v1.1):
Feature view:
Features in the scrollable Feature list has been corrected to display actual values
rather than octal notation, which was incorrect.
Feature files that are Stoplist files, as identified by extension "_stopped.txt"
will not show in the list of feature files unless specifically requested
by new menu control "View | Show Stoplist Features"
================================================================
Subject: Announcing bulk_extractor Release version 0.7.0
From: Simson L. Garfinkel
Date: Dec 4, 2010
bulk_extractor 0.7.0 is released. Significant new features include:
* Automatic detection and decompression of Windows HIBERFIL.SYS means
that you'll get more URLs, email addresses, and other artifacts out
of system memory.
* Improved crash protection.
* POSIX threading is now enabled by default, and bulk_extractor
automatically determines the right number of threads to use based on
how many cores your CPU has.
* Improved extraction of ZIP files.
bulk_extractor is nearly feature complete. When it is we'll push the
revision number to 1.0.0 and and put the system in maintenance mode
while we work on the GUI and some post-processing scripts.
================================================================
Subject: Announcing bulk_extractor Release version 0.6.0
From: Simson L. Garfinkel
Date: Dec 4, 2010
bulk_extractor 0.6.0 is released. Signifiacnt new features include:
* Recovery of text from certain kinds of PDF files. (This will improve with time.)
* Improved recovery and reliability when recovering ZIP components.
* Improved recovery from GZIP streams.
* (EXPERIMENTAL) Posix threading via a thread poool on Linux and
Macos. Specify -P to test it. The thread pool uses 1 thread to read
the disk image and -jN threads to process each buffer as the thread
becomes available. This provides for dramatically more efficient I/O
and proper utilization of multiple threads. (ie: bulk_extractor will
no longer schedule threads that are not needed.)
>>> POSIX threading is recommended for MacOS but has not been thoroughly
tested under Linux. It will not work under Windows.
* -p option that allows you to view any forensic "path" that where a
feature is found this allows you to view the contents of compressed
files, for example.
* -f option allows you to search for arbitrary regular expressions.
BE CAREFUL! This can dramatically impact performance.
* -c "Crash Protection" option sets a signal handler that will cause
bulk_extractor to automatically restart from some crashes. By
default this is diabled, but it will be enabled by default in the
future.
* Directly handles multi-volume VMDK files.
* Memory allocation bug in reading raw images fixed.
* Margin can now be specified.
* bulk_extractor now supports the following scanners:
-x accts - disable scanner accts
-x base64 - disable scanner base64
-x email - disable scanner email
-x exif - disable scanner exif
-x zip - disable scanner zip
-x gzip - disable scanner gzip
-x pdf - disable scanner pdf
-e tcp - enable scanner tcp
-e bulk - enable scanner bulk
-e find - enable scanner find
-e wordlist - enable scanner wordlist
================================================================
Date: Oct. 11, 2010
Subject: Announcing bulk_extractor Release version 0.5.0
From: Simson L. Garfinkel
bulk_extractor 0.5.0 is released. Significant new features include:
* Automatic carving of gzip-compressed residual data. This is
particularly useful when encountering web browser caches that were
downloaded with GZIP compression.
* A "Repeat Crash" mode (-R) which causes bulk_extractor to
re-analyzing the last page analyzed by each thread after a
crash. This makes it easy to run the program under a debugger and
quickly replicate the crash, for the purpose of fixing the relevant
scanner. (A future "restart" mode will cause bulk_extractor to keep
going after a crash.)
* Post-processing of the wordlist to create a set of 100 megabyte
deduplicated terms suitable for directly providing to a
password-cracking program.
* Bug fixes in the PKZIP carving.
Date: Sept. 27, 2010
Subject: Announcing bulk_extractor Release version 0.4.2.
From: Simson L. Garfinkel
bulk_extractor 0.4.2 is released. Significant features include:
* Support for context-based stop lists
* Automatic carving of PKZIP files
* Improved support for EXIF carving
Context-based stop list
=======================
Many users of bulk_extractor report surprise at the large number of
email addresses, URLs, JPEGs, and other information that are contained
within the standard Microsoft Windows and Linux distributions. For
example, Microsoft Windows XPSP3 contains 306 distinct email
addresses, including not just addresses like [email protected] and
[email protected], but email addresses that look like they belonging
to individuals such as [email protected] and [email protected].
The initial way that we attempted to resolve this issue was by
creating a "stop list" of the distribution email addresses and
building that stoplist into the bulk_extractor binary. The problem
with this approach, we quickly learned, is that these problematic
email addresses might appear in a variety of contexts, but we only
want them suppressed when they are harvested as part of the operating
system files. For example, we don't want to be alerted to the
[email protected] email address when it appears as part of Microsoft
Windows, but we do want this email address reported if it is found
elsewhere.
To resolve this problem bulk_extractor now supports a context-based
stop list. Instead of simply a list of email addresses that should be
suppressed, the context-based stop list conatins both the email
address and the context in which that email address occures. Here we
define "context" to mean the 8 characters before the email address and
the 8 characters following the email address in the disk image.
The context-based stop list is distributed as a specially formatted
text file that contains the element to be suppressed, a tab, and the
element in context. Unprintable characters are reported as
underbars. For example, these two entries suppress the two occuresses
of the [email protected] email address in Windows XPSP3:
[email protected] [email protected]_priklad
[email protected] [email protected]__prikla
All items suppressed by the traditional regular-expression stop list
or the context-based stop list are now presented in separate feature
files --- for example, email_stop.txt. In no case is information
actually suppressed. Presenting the suppressed results is important in
for tool validation, both in testing and when the tool is actually
run. Stopped terms may also useful for performing a profile of the
hard drive.
bulk_extractor now comes with a Python program called
make_context_stop_list.py. This program will process the output of
bulk_extractor from multiple runs and create a single context-based
stop list. We are also distributing a sample context-based stop list
which is derrived from the following operating systems:
fedora12-64
redhat54-ent-64
w2k3-32bit
w2k3-64bit
win2008-r2-64
win7-ent-32
win7-utl-64
winXP-32bit-sp3
winXP-64bit
You can download version 1.0 of the stoplist.
Be sure to decompress the list first! We are distributing it in ZIP
form because is the 70 megabytes in length. A future version of
bulk_extarctor may read the compressed list directly.
Context-based stop lists correct the stop-list problem that surfaced with
bulk_extractor 0.4.0. That version simply suppressed terms that were
already present in the Windows and Linux distributions. Unfortunately
this created an attack vector in which attackers could register and
use these email addresses and in so doing escape detection.
PKZIP Carving
=============
Version 0.4.2 introduces carving of PKZIP components. Whenever
bulk_extractor finds a component of a ZIP file that includes a valid
header, it attempts to decompress the fragment and then recursively
reprocesses the decompressed data with all of the extractors.
Currently the results of ZIP carving are reported with standard
offsets. In the feature the offsets will be reported NNNNNN-ZIP where
NNNNN is the byte offset of the ZIP component.
Improved support for EXIF Carving
=================================
Version 0.4.2 finds and carves EXIF headers of JPEG files. All of the
results are stored in a feature file that consists of the MD5 hash of
the first 4K of the JPEG and an XML structure. bulk_extractor now also
comes with a program called post_process_exif.py which reads this file
and creates a tab-delimited file that can be imported into Microsoft
Excel that breaks each EXIF field into its own spreadsheet column.
Download Information
====================
As always, the current version of bulk_extractor can be downloaded
from the website at http://digitalcorpora.org/downloads/bulk_extractor.
================================================================
Date: Sept. 6, 2010
Subject: Announcing bulk_extractor Release version 0.4.0.
From: Simson L. Garfinkel
bulk_extractor 0.4.0 is released. This version has several new
features based on user-feedback, and a few bug fixes based on a
thorough code review.
New features:
exif.txt - this file contains information on all of the carved EXIFs
found on the subject media. The output file consists of the
offset of the JPEG or MPEG that contains the EXIF, the MD5
of the first 4096 bytes of the file, and an XML structure
containing all of the extractable fields less than 10,000
bytes in length.
The program post_process_exif.py, also included with the
distribution, will turn exif.txt into a comma-separated
value file that can be easily imported into Microsoft
Excel.
stoplist - bulk_extractor now includes a default stoplist that
suppresses all of the domain names, email addresses,
telephone numbers, etc., that appear in any of the
following operating systems:
Fedora 12 (64 bit)
Redhat 5.4 Enterprise (64 bit)
Windows 2003 (32 bit)
Windows 2003 (64 bit)
Windows 2008 R2 (64 bit)
Windows 7 Enterprise (32 bit)
Windows 7 Ultimate (64 bit)
Windows XP SP3 (32 bit)
Windows XP SP3 (64 bit)
Ubuntu 8.10 (32 bit)
Ubuntu 10.10 (32 bit)
The stoplist is enabled automatically, but you can disable
it with the "-s" flag.
You can dump the stoplist with the "-S" flag. The 0.4.0
release is 155,476 terms in the stoplist, from
(000)-000-0000 to zzzcomputing.com.
General Improvements:
- The domain and IP address finders now have lower false
positive rates.
Incompatabilities:
- bulk_extractor now *requires* the experimental, alpha-version
of libewf version 2. You are best off with libewf-alpha-20100907 or later.
bulk_extractor can be downloaded from http://digitalcorpora.org/downloads/bulk_extractor.
================================================================
Date: May 15, 2010
Subject: Announcing bulk_extractor Release version 0.3.1.
From: Simson L. Garfinkel
bulk_extractor 0.3.1 is released. This version has several new
features based on user-feedback, and a few bug fixes based on a
thorough code review.
New Features:
* url_services.txt - a histogram of all URLs by domain.
* url_searches.txt - a histogram of all search terms, including
Google, Yahoo, Bing, and any other search service
with "search" in the domain and "q=" or "p=" in
the URL.
* ccn.txt - this file now reports Federal Express account numbers,
SSNs (if properly formatted or prefixed), DOBs, and other info.
* tcp.txt - This experimental feature looks for IP and TCP packets in
PAGEFILE.SYS, memory dumps and hibernation files, and stores the results.
* the whitelist and redlist files may now contain globbed terms. For example, put
*@company.com in the redlist and any mention of [email protected] will be flagged
and also put into a special file called redlist_found.txt.
* CONTEXT: The ccn.txt now show the context from which the matched
information was taken. hosts.txt shows context for numeric IP addresses.
Bug Fixes:
* Improved handling of raw devices and files.
* bulk_extractor is now less likely to error on some input data sets.
================================================================
Date: May 2, 2010
Subject: Announcing bulk_extractor Release version 0.3.0.
From: Simson L. Garfinkel
I am pleased to announce the release of bulk_extractor 0.3.0,
available immediately for Windows, MacOS and Linux.
bulk_extractor is a fast feature extractor and histogram analysis
tool. This program operates below the file level, extracting email
addresses, URLs, Cookies, credit card numbers, and other information
from disk images. bulk_extractor has native support for raw, AFF and
E01 file formats, allowing you to start extracting immediately without
file conversion.
bulk_extractor 0.3.0 adds multi-threading support. This is important,
as most extractions are CPU-bound, not I/O bound. Multi-core support
allows you to turn any CPU-bound task into an I/O bound
task. Multi-core support is available on all supported platforms.
Performance with the multi_core support is quite exciting. A recent
extraction of a 17GB E01 file required approximately 70
minutes. Running the same extraction with the "-j4" flag (meaning, run
4 simultaneous extractions on each of 4 cores) lowered the the time to
just 18 minutes.
You can download bulk_extractor from http://digitalcorpora.org/downloads/bulk_extractor.
================================================================
Date: April 13, 2010
Subject: Announcing bulk_extractor Release version 0.1.0.
From: Simson L. Garfinkel
I am pleased to announce the release of bulk_extractor 0.1.0.
bulk_extractor 0.1.0 is a C++ program that will scan a disk image (or
any other file) and extract useful information without parsing the
file system. It is like a combination of "strings" and "grep" with a
whole bunch of useful patterns. bulk_extractor will also perform a
histogram analysis on the resulting extracted features.
Release version 0.1.0 brings two significant changes to the C++ bulk_extractor:
1 - New extractors - credit card numbers and "wordlist"
2 - New two-phase design to support big data sets and low-memory situations.
3 - Direct support for Encase-formatted E01 files
4 - ETA when doing large extractions.
The Java bulk_extractor (which runs 3 times faster) has not been
updated. This would make a nice term project for a student interested
in computer forensics. If we don't have forward progresson on the Java
version within a few months we will remove the Java bulk_extractor
from the C++ version and package it separately.
1. New Extractors.
bulk_extractor will now recognize credit card numbers.
bulk_extractor can now make a "word list" of every "word" that appears
on the hard drive. This can be useful for building a list of words
that can be used to crack password-based cryptography.
2. Two-phase design
bulk_extractor has been redesigned to have a two-pass process.
phase 1 - the hard drive image is analyzed and "extracted feature files" are created.
phase 1 is the time-intensive part which requires scanning the entire
input hard drive. This redeisgn allows phase 1 to run in constant
memory.
phase 2 - the extracted feature files are analyzed for a report.
phase 2 is the memory-intensive part. Now if the report generation
fails, you still have the files that were produced during phase 1. You
can then move these files to a computer with more memory and re-run
just phase 2.
bulk_extractor now creates an output directory that has the following layout:
logfile.txt - reports what was done
domains.txt - domain names intermediate file
emails.txt - email addresses intermediate file
ccns.txt - credit card numbers intermediate file
wordlist.txt - wordlist
3. Direct support for EnCase E01 files
bulk_extractor now has direct support for E01 files. It turns out that
this is much faster than going through the AFFLIB virtual file
system.
4. ETA when doing large extractions.
bulk_extractor now evaluates how fast it is running and reports and
Estimated Time of Completion (ETA).
bulk_extractor can be downloaded in source code form from
http://digitalcorpora.org/downloads/bulk_extractor. We should have a command-line windows version
released shortly.
================================================================
Release version 0.0.10.
- Removed -a switch for "all". Now the default is "all."
For low memory situations, use the "-m" flag (which uses a bloom filter to suppress items that only appear once.)
- Added support for ASCII Unicode extraction.
- Added support for URL extraction.