-
Notifications
You must be signed in to change notification settings - Fork 12
/
deliciousapi.py
1248 lines (1037 loc) · 49.1 KB
/
deliciousapi.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
"""
Unofficial Python API for retrieving data from Delicious.com.
This module provides the following features plus some more:
* retrieving a URL's full public bookmarking history including
* users who bookmarked the URL including tags used for such bookmarks
and the creation time of the bookmark (up to YYYY-MM-DD granularity)
* top tags (up to a maximum of 10) including tag count
* title as stored on Delicious.com
* total number of bookmarks/users for this URL at Delicious.com
* retrieving a user's full bookmark collection, including any private bookmarks
if you know the corresponding password
* retrieving a user's full public tagging vocabulary, i.e. tags and tag counts
* retrieving a user's network information (network members and network fans)
* HTTP proxy support
* updated to support Delicious.com "version 2" (mini-relaunch as of August 2008)
The official Delicious.com API and the JSON/RSS feeds do not provide all
the functionality mentioned above, and in such cases this module will query
the Delicious.com *website* directly and extract the required information
by parsing the HTML code of the resulting Web pages (a kind of poor man's
web mining). The module is able to detect IP throttling, which is employed
by Delicious.com to temporarily block abusive HTTP request behavior, and
will raise a custom Python error to indicate that. Please be a nice netizen
and do not stress the Delicious.com service more than necessary.
It is strongly advised that you read the Delicious.com Terms of Use
before using this Python module. In particular, read section 5
'Intellectual Property'.
The code is licensed to you under version 2 of the GNU General Public
License.
More information about this module can be found at
http://www.michael-noll.com/wiki/Del.icio.us_Python_API
Changelog is available at
http://code.michael-noll.com/?p=deliciousapi;a=log
Copyright 2006-2010 Michael G. Noll <http://www.michael-noll.com/>
"""
__author__ = "Michael G. Noll"
__copyright__ = "(c) 2006-2010 Michael G. Noll"
__description__ = "Unofficial Python API for retrieving data from Delicious.com"
__email__ = "coding[AT]michael-REMOVEME-noll[DOT]com"
__license__ = "GPLv2"
__maintainer__ = "Michael G. Noll"
__status__ = "Development"
__url__ = "http://www.michael-noll.com/"
__version__ = "1.6.7"
import cgi
import datetime
import hashlib
from operator import itemgetter
import re
import socket
import time
import urllib2
try:
from BeautifulSoup import BeautifulSoup
except:
print "ERROR: could not import BeautifulSoup Python module"
print
print "You can download BeautifulSoup from the Python Cheese Shop at"
print "http://cheeseshop.python.org/pypi/BeautifulSoup/"
print "or directly from http://www.crummy.com/software/BeautifulSoup/"
print
raise
try:
import simplejson
except:
print "ERROR: could not import simplejson module"
print
print "Since version 1.5.0, DeliciousAPI requires the simplejson module."
print "You can download simplejson from the Python Cheese Shop at"
print "http://pypi.python.org/pypi/simplejson"
print
raise
class DeliciousUser(object):
"""This class wraps all available information about a user into one object.
Variables:
bookmarks:
A list of (url, tags, title, comment, timestamp) tuples representing
a user's bookmark collection.
url is a 'unicode'
tags is a 'list' of 'unicode' ([] if no tags)
title is a 'unicode'
comment is a 'unicode' (u"" if no comment)
timestamp is a 'datetime.datetime'
tags (read-only property):
A list of (tag, tag_count) tuples, aggregated over all a user's
retrieved bookmarks. The tags represent a user's tagging vocabulary.
username:
The Delicious.com account name of the user.
"""
def __init__(self, username, bookmarks=None):
assert username
self.username = username
self.bookmarks = bookmarks or []
def __str__(self):
total_tag_count = 0
total_tags = set()
for url, tags, title, comment, timestamp in self.bookmarks:
if tags:
total_tag_count += len(tags)
for tag in tags:
total_tags.add(tag)
return "[%s] %d bookmarks, %d tags (%d unique)" % \
(self.username, len(self.bookmarks), total_tag_count, len(total_tags))
def __repr__(self):
return self.username
def get_tags(self):
"""Returns a dictionary mapping tags to their tag count.
For example, if the tag count of tag 'foo' is 23, then
23 bookmarks were annotated with 'foo'. A different way
to put it is that 23 users used the tag 'foo' when
bookmarking the URL.
"""
total_tags = {}
for url, tags, title, comment, timestamp in self.bookmarks:
for tag in tags:
total_tags[tag] = total_tags.get(tag, 0) + 1
return total_tags
tags = property(fget=get_tags, doc="Returns a dictionary mapping tags to their tag count")
class DeliciousURL(object):
"""This class wraps all available information about a web document into one object.
Variables:
bookmarks:
A list of (user, tags, comment, timestamp) tuples, representing a
document's bookmark history. Generally, this variable is populated
via get_url(), so the number of bookmarks available in this variable
depends on the parameters of get_url(). See get_url() for more
information.
user is a 'unicode'
tags is a 'list' of 'unicode's ([] if no tags)
comment is a 'unicode' (u"" if no comment)
timestamp is a 'datetime.datetime' (granularity: creation *day*,
i.e. the day but not the time of day)
tags (read-only property):
A list of (tag, tag_count) tuples, aggregated over all a document's
retrieved bookmarks.
top_tags:
A list of (tag, tag_count) tuples, representing a document's so-called
"top tags", i.e. the up to 10 most popular tags for this document.
url:
The URL of the document.
hash (read-only property):
The MD5 hash of the URL.
title:
The document's title.
total_bookmarks:
The number of total bookmarks (posts) of the document.
Note that the value of total_bookmarks can be greater than the
length of "bookmarks" depending on how much (detailed) bookmark
data could be retrieved from Delicious.com.
Here's some more background information:
The value of total_bookmarks is the "real" number of bookmarks of
URL "url" stored at Delicious.com as reported by Delicious.com
itself (so it's the "ground truth"). On the other hand, the length
of "bookmarks" depends on iteratively scraped bookmarking data.
Since scraping Delicous.com's Web pages has its limits in practice,
this means that DeliciousAPI could most likely not retrieve all
available bookmarks. In such a case, the value reported by
total_bookmarks is greater than the length of "bookmarks".
"""
def __init__(self, url, top_tags=None, bookmarks=None, title=u"", total_bookmarks=0):
assert url
self.url = url
self.top_tags = top_tags or []
self.bookmarks = bookmarks or []
self.title = title
self.total_bookmarks = total_bookmarks
def __str__(self):
total_tag_count = 0
total_tags = set()
for user, tags, comment, timestamp in self.bookmarks:
if tags:
total_tag_count += len(tags)
for tag in tags:
total_tags.add(tag)
return "[%s] %d total bookmarks (= users), %d tags (%d unique), %d out of 10 max 'top' tags" % \
(self.url, self.total_bookmarks, total_tag_count, \
len(total_tags), len(self.top_tags))
def __repr__(self):
return self.url
def get_tags(self):
"""Returns a dictionary mapping tags to their tag count.
For example, if the tag count of tag 'foo' is 23, then
23 bookmarks were annotated with 'foo'. A different way
to put it is that 23 users used the tag 'foo' when
bookmarking the URL.
@return: Dictionary mapping tags to their tag count.
"""
total_tags = {}
for user, tags, comment, timestamp in self.bookmarks:
for tag in tags:
total_tags[tag] = total_tags.get(tag, 0) + 1
return total_tags
tags = property(fget=get_tags, doc="Returns a dictionary mapping tags to their tag count")
def get_hash(self):
m = hashlib.md5()
m.update(self.url)
return m.hexdigest()
hash = property(fget=get_hash, doc="Returns the MD5 hash of the URL of this document")
class DeliciousAPI(object):
"""
This class provides a custom, unofficial API to the Delicious.com service.
Instead of using just the functionality provided by the official
Delicious.com API (which has limited features), this class retrieves
information from the Delicious.com website directly and extracts data from
the Web pages.
Note that Delicious.com will block clients with too many queries in a
certain time frame (similar to their API throttling). So be a nice citizen
and don't stress their website.
"""
def __init__(self,
http_proxy="",
tries=3,
wait_seconds=3,
user_agent="DeliciousAPI/%s (+http://www.michael-noll.com/wiki/Del.icio.us_Python_API)" % __version__,
timeout=30,
):
"""Set up the API module.
@param http_proxy: Optional, default: "".
Use an HTTP proxy for HTTP connections. Proxy support for
HTTPS is not available yet.
Format: "hostname:port" (e.g., "localhost:8080")
@type http_proxy: str
@param tries: Optional, default: 3.
Try the specified number of times when downloading a monitored
document fails. tries must be >= 1. See also wait_seconds.
@type tries: int
@param wait_seconds: Optional, default: 3.
Wait the specified number of seconds before re-trying to
download a monitored document. wait_seconds must be >= 0.
See also tries.
@type wait_seconds: int
@param user_agent: Optional, default: "DeliciousAPI/<version>
(+http://www.michael-noll.com/wiki/Del.icio.us_Python_API)".
The User-Agent HTTP Header to use when querying Delicous.com.
@type user_agent: str
@param timeout: Optional, default: 30.
Set network timeout. timeout must be >= 0.
@type timeout: int
"""
assert tries >= 1
assert wait_seconds >= 0
assert timeout >= 0
self.http_proxy = http_proxy
self.tries = tries
self.wait_seconds = wait_seconds
self.user_agent = user_agent
self.timeout = timeout
socket.setdefaulttimeout(self.timeout)
def _query(self, path, host="delicious.com", user=None, password=None, use_ssl=False):
"""Queries Delicious.com for information, specified by (query) path.
@param path: The HTTP query path.
@type path: str
@param host: The host to query, default: "delicious.com".
@type host: str
@param user: The Delicious.com username if any, default: None.
@type user: str
@param password: The Delicious.com password of user, default: None.
@type password: unicode/str
@param use_ssl: Whether to use SSL encryption or not, default: False.
@type use_ssl: bool
@return: None on errors (i.e. on all HTTP status other than 200).
On success, returns the content of the HTML response.
"""
opener = None
handlers = []
# add HTTP Basic authentication if available
if user and password:
pwd_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
pwd_mgr.add_password(None, host, user, password)
basic_auth_handler = urllib2.HTTPBasicAuthHandler(pwd_mgr)
handlers.append(basic_auth_handler)
# add proxy support if requested
if self.http_proxy:
proxy_handler = urllib2.ProxyHandler({'http': 'http://%s' % self.http_proxy})
handlers.append(proxy_handler)
if handlers:
opener = urllib2.build_opener(*handlers)
else:
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', self.user_agent)]
data = None
tries = self.tries
if use_ssl:
protocol = "https"
else:
protocol = "http"
url = "%s://%s%s" % (protocol, host, path)
while tries > 0:
try:
f = opener.open(url)
data = f.read()
f.close()
break
except urllib2.HTTPError, e:
if e.code == 301:
raise DeliciousMovedPermanentlyWarning, "Delicious.com status %s - url moved permanently" % e.code
if e.code == 302:
raise DeliciousMovedTemporarilyWarning, "Delicious.com status %s - url moved temporarily" % e.code
elif e.code == 401:
raise DeliciousUnauthorizedError, "Delicious.com error %s - unauthorized (authentication failed?)" % e.code
elif e.code == 403:
raise DeliciousForbiddenError, "Delicious.com error %s - forbidden" % e.code
elif e.code == 404:
raise DeliciousNotFoundError, "Delicious.com error %s - url not found" % e.code
elif e.code == 500:
raise Delicious500Error, "Delicious.com error %s - server problem" % e.code
elif e.code == 503 or e.code == 999:
raise DeliciousThrottleError, "Delicious.com error %s - unable to process request (your IP address has been throttled/blocked)" % e.code
else:
raise DeliciousUnknownError, "Delicious.com error %s - unknown error" % e.code
break
except urllib2.URLError, e:
time.sleep(self.wait_seconds)
except socket.error, msg:
# sometimes we get a "Connection Refused" error
# wait a bit and then try again
time.sleep(self.wait_seconds)
#finally:
# f.close()
tries -= 1
return data
def get_url(self, url, max_bookmarks=50, sleep_seconds=1):
"""
Returns a DeliciousURL instance representing the Delicious.com history of url.
Generally, this method is what you want for getting title, bookmark, tag,
and user information about a URL.
Delicious only returns up to 50 bookmarks per URL. This means that
we have to do subsequent queries plus parsing if we want to retrieve
more than 50. Roughly speaking, the processing time of get_url()
increases linearly with the number of 50-bookmarks-chunks; i.e.
it will take 10 times longer to retrieve 500 bookmarks than 50.
@param url: The URL of the web document to be queried for.
@type url: str
@param max_bookmarks: Optional, default: 50.
See the documentation of get_bookmarks() for more information
as get_url() uses get_bookmarks() to retrieve a url's
bookmarking history.
@type max_bookmarks: int
@param sleep_seconds: Optional, default: 1.
See the documentation of get_bookmarks() for more information
as get_url() uses get_bookmarks() to retrieve a url's
bookmarking history. sleep_seconds must be >= 1 to comply with
Delicious.com's Terms of Use.
@type sleep_seconds: int
@return: DeliciousURL instance representing the Delicious.com history
of url.
"""
# we must wait at least 1 second between subsequent queries to
# comply with Delicious.com's Terms of Use
assert sleep_seconds >= 1
document = DeliciousURL(url)
m = hashlib.md5()
m.update(url)
hash = m.hexdigest()
path = "/v2/json/urlinfo/%s" % hash
data = self._query(path, host="feeds.delicious.com")
if data:
urlinfo = {}
try:
urlinfo = simplejson.loads(data)
if urlinfo:
urlinfo = urlinfo[0]
else:
urlinfo = {}
except TypeError:
pass
try:
document.title = urlinfo['title'] or u""
except KeyError:
pass
try:
top_tags = urlinfo['top_tags'] or {}
if top_tags:
document.top_tags = sorted(top_tags.iteritems(), key=itemgetter(1), reverse=True)
else:
document.top_tags = []
except KeyError:
pass
try:
document.total_bookmarks = int(urlinfo['total_posts'])
except (KeyError, ValueError):
pass
document.bookmarks = self.get_bookmarks(url=url, max_bookmarks=max_bookmarks, sleep_seconds=sleep_seconds)
return document
def get_network(self, username):
"""
Returns the user's list of followees and followers.
Followees are users in his Delicious "network", i.e. those users whose
bookmark streams he's subscribed to. Followers are his Delicious.com
"fans", i.e. those users who have subscribed to the given user's
bookmark stream).
Example:
A --------> --------> C
D --------> B --------> E
F --------> --------> F
followers followees
of B of B
Arrows from user A to user B denote that A has subscribed to B's
bookmark stream, i.e. A is "following" or "tracking" B.
Note that user F is both a followee and a follower of B, i.e. F tracks
B and vice versa. In Delicious.com terms, F is called a "mutual fan"
of B.
Comparing this network concept to information retrieval, one could say
that followers are incoming links and followees outgoing links of B.
@param username: Delicous.com username for which network information is
retrieved.
@type username: unicode/str
@return: Tuple of two lists ([<followees>, [<followers>]), where each list
contains tuples of (username, tracking_since_timestamp).
If a network is set as private, i.e. hidden from public view,
(None, None) is returned.
If a network is public but empty, ([], []) is returned.
"""
assert username
followees = followers = None
# followees (network members)
path = "/v2/json/networkmembers/%s" % username
data = None
try:
data = self._query(path, host="feeds.delicious.com")
except DeliciousForbiddenError:
pass
if data:
followees = []
users = []
try:
users = simplejson.loads(data)
except TypeError:
pass
uname = tracking_since = None
for user in users:
# followee's username
try:
uname = user['user']
except KeyError:
pass
# try to convert uname to Unicode
if uname:
try:
# we assume UTF-8 encoding
uname = uname.decode('utf-8')
except UnicodeDecodeError:
pass
# time when the given user started tracking this user
try:
tracking_since = datetime.datetime.strptime(user['dt'], "%Y-%m-%dT%H:%M:%SZ")
except KeyError:
pass
if uname:
followees.append( (uname, tracking_since) )
# followers (network fans)
path = "/v2/json/networkfans/%s" % username
data = None
try:
data = self._query(path, host="feeds.delicious.com")
except DeliciousForbiddenError:
pass
if data:
followers = []
users = []
try:
users = simplejson.loads(data)
except TypeError:
pass
uname = tracking_since = None
for user in users:
# fan's username
try:
uname = user['user']
except KeyError:
pass
# try to convert uname to Unicode
if uname:
try:
# we assume UTF-8 encoding
uname = uname.decode('utf-8')
except UnicodeDecodeError:
pass
# time when fan started tracking the given user
try:
tracking_since = datetime.datetime.strptime(user['dt'], "%Y-%m-%dT%H:%M:%SZ")
except KeyError:
pass
if uname:
followers.append( (uname, tracking_since) )
return ( followees, followers )
def get_bookmarks(self, url=None, username=None, max_bookmarks=50, sleep_seconds=1):
"""
Returns the bookmarks of url or user, respectively.
Delicious.com only returns up to 50 bookmarks per URL on its website.
This means that we have to do subsequent queries plus parsing if
we want to retrieve more than 50. Roughly speaking, the processing
time of get_bookmarks() increases linearly with the number of
50-bookmarks-chunks; i.e. it will take 10 times longer to retrieve
500 bookmarks than 50.
@param url: The URL of the web document to be queried for.
Cannot be used together with 'username'.
@type url: str
@param username: The Delicious.com username to be queried for.
Cannot be used together with 'url'.
@type username: str
@param max_bookmarks: Optional, default: 50.
Maximum number of bookmarks to retrieve. Set to 0 to disable
this limitation/the maximum and retrieve all available
bookmarks of the given url.
Bookmarks are sorted so that newer bookmarks are first.
Setting max_bookmarks to 50 means that get_bookmarks() will retrieve
the 50 most recent bookmarks of the given url.
In the case of getting bookmarks of a URL (url is set),
get_bookmarks() will take *considerably* longer to run
for pages with lots of bookmarks when setting max_bookmarks
to a high number or when you completely disable the limit.
Delicious returns only up to 50 bookmarks per result page,
so for example retrieving 250 bookmarks requires 5 HTTP
connections and parsing 5 HTML pages plus wait time between
queries (to comply with delicious' Terms of Use; see
also parameter 'sleep_seconds').
In the case of getting bookmarks of a user (username is set),
the same restrictions as for a URL apply with the exception
that we can retrieve up to 100 bookmarks per HTTP query
(instead of only up to 50 per HTTP query for a URL).
@type max_bookmarks: int
@param sleep_seconds: Optional, default: 1.
Wait the specified number of seconds between subsequent
queries in case that there are multiple pages of bookmarks
for the given url. sleep_seconds must be >= 1 to comply with
Delicious.com's Terms of Use.
See also parameter 'max_bookmarks'.
@type sleep_seconds: int
@return: Returns the bookmarks of url or user, respectively.
For urls, it returns a list of (user, tags, comment, timestamp)
tuples.
For users, it returns a list of (url, tags, title, comment,
timestamp) tuples.
Bookmarks are sorted "descendingly" by creation time, i.e. newer
bookmarks come first.
"""
# we must wait at least 1 second between subsequent queries to
# comply with delicious' Terms of Use
assert sleep_seconds >= 1
# url XOR username
assert bool(username) is not bool(url)
# maximum number of urls/posts Delicious.com will display
# per page on its website
max_html_count = 100
# maximum number of pages that Delicious.com will display;
# currently, the maximum number of pages is 20. Delicious.com
# allows to go beyond page 20 via pagination, but page N (for
# N > 20) will always display the same content as page 20.
max_html_pages = 20
path = None
if url:
m = hashlib.md5()
m.update(url)
hash = m.hexdigest()
# path will change later on if there are multiple pages of boomarks
# for the given url
path = "/url/%s" % hash
elif username:
# path will change later on if there are multiple pages of boomarks
# for the given username
path = "/%s?setcount=%d" % (username, max_html_count)
else:
raise Exception('You must specify either url or user.')
page_index = 1
bookmarks = []
while path and page_index <= max_html_pages:
data = self._query(path)
path = None
if data:
# extract bookmarks from current page
if url:
bookmarks.extend(self._extract_bookmarks_from_url_history(data))
else:
bookmarks.extend(self._extract_bookmarks_from_user_history(data))
# stop scraping if we already have as many bookmarks as we want
if (len(bookmarks) >= max_bookmarks) and max_bookmarks != 0:
break
else:
# check if there are multiple pages of bookmarks for this
# url on Delicious.com
soup = BeautifulSoup(data)
paginations = soup.findAll("div", id="pagination")
if paginations:
# find next path
nexts = paginations[0].findAll("a", attrs={ "class": "pn next" })
if nexts and (max_bookmarks == 0 or len(bookmarks) < max_bookmarks) and len(bookmarks) > 0:
# e.g. /url/2bb293d594a93e77d45c2caaf120e1b1?show=all&page=2
path = nexts[0]['href']
if username:
path += "&setcount=%d" % max_html_count
page_index += 1
# wait one second between queries to be compliant with
# delicious' Terms of Use
time.sleep(sleep_seconds)
if max_bookmarks > 0:
return bookmarks[:max_bookmarks]
else:
return bookmarks
def _extract_bookmarks_from_url_history(self, data):
"""
Extracts user bookmarks from a URL's history page on Delicious.com.
The Python library BeautifulSoup is used to parse the HTML page.
@param data: The HTML source of a URL history Web page on Delicious.com.
@type data: str
@return: list of user bookmarks of the corresponding URL
"""
bookmarks = []
soup = BeautifulSoup(data)
bookmark_elements = soup.findAll("div", attrs={"class": re.compile("^bookmark\s*")})
timestamp = None
for bookmark_element in bookmark_elements:
# extract bookmark creation time
#
# this timestamp has to "persist" until a new timestamp is
# found (delicious only provides the creation time data for the
# first bookmark in the list of bookmarks for a given day
dategroups = bookmark_element.findAll("div", attrs={"class": "dateGroup"})
if dategroups:
spans = dategroups[0].findAll('span')
if spans:
date_str = spans[0].contents[0].strip()
timestamp = datetime.datetime.strptime(date_str, '%d %b %y')
# extract comments
comment = u""
datas = bookmark_element.findAll("div", attrs={"class": "data"})
if datas:
divs = datas[0].findAll("div", attrs={"class": "description"})
if divs:
comment = divs[0].contents[0].strip()
# extract tags
user_tags = []
tagdisplays = bookmark_element.findAll("div", attrs={"class": "tagdisplay"})
if tagdisplays:
aset = tagdisplays[0].findAll("a", attrs={"class": "tag noplay"})
for a in aset:
tag = a.contents[0]
user_tags.append(tag)
# extract user information
metas = bookmark_element.findAll("div", attrs={"class": "meta"})
if metas:
links = metas[0].findAll("a", attrs={"class": "user user-tag"})
if links:
try:
user = links[0]['href'][1:]
except IndexError:
# WORKAROUND: it seems there is a bug on Delicious.com where
# sometimes a bookmark is shown in a URL history without any
# associated Delicious username (username is empty); this could
# be caused by special characters in the username or other things
#
# this problem of Delicious is very rare, so we just skip such
# entries until they find a fix
pass
bookmarks.append( (user, user_tags, comment, timestamp) )
return bookmarks
def _extract_bookmarks_from_user_history(self, data):
"""
Extracts a user's bookmarks from his user page on Delicious.com.
The Python library BeautifulSoup is used to parse the HTML page.
@param data: The HTML source of a user page on Delicious.com.
@type data: str
@return: list of bookmarks of the corresponding user
"""
bookmarks = []
soup = BeautifulSoup(data)
ul = soup.find("ul", id="bookmarklist")
if ul:
bookmark_elements = ul.findAll("div", attrs={"class": re.compile("^bookmark\s*")})
timestamp = None
for bookmark_element in bookmark_elements:
# extract bookmark creation time
#
# this timestamp has to "persist" until a new timestamp is
# found (delicious only provides the creation time data for the
# first bookmark in the list of bookmarks for a given day
dategroups = bookmark_element.findAll("div", attrs={"class": "dateGroup"})
if dategroups:
spans = dategroups[0].findAll('span')
if spans:
date_str = spans[0].contents[0].strip()
timestamp = datetime.datetime.strptime(date_str, '%d %b %y')
# extract url, title and comments
url = u""
title = u""
comment = u""
datas = bookmark_element.findAll("div", attrs={"class": "data"})
if datas:
links = datas[0].findAll("a", attrs={"class": re.compile("^taggedlink\s*")})
if links and links[0].contents:
title = links[0].contents[0].strip()
url = links[0]['href']
divs = datas[0].findAll("div", attrs={"class": "description"})
if divs:
comment = divs[0].contents[0].strip()
# extract tags
url_tags = []
tagdisplays = bookmark_element.findAll("div", attrs={"class": "tagdisplay"})
if tagdisplays:
aset = tagdisplays[0].findAll("a", attrs={"class": "tag noplay"})
for a in aset:
tag = a.contents[0]
url_tags.append(tag)
bookmarks.append( (url, url_tags, title, comment, timestamp) )
return bookmarks
def get_user(self, username, password=None, max_bookmarks=50, sleep_seconds=1):
"""Retrieves a user's bookmarks from Delicious.com.
If a correct username AND password are supplied, a user's *full*
bookmark collection (which also includes private bookmarks) is
retrieved. Data communication is encrypted using SSL in this case.
If no password is supplied, only the *public* bookmarks of the user
are retrieved. Here, the parameter 'max_bookmarks' specifies how
many public bookmarks will be retrieved (default: 50). Set the
parameter to 0 to retrieve all public bookmarks.
This function can be used to backup all of a user's bookmarks if
called with a username and password.
@param username: The Delicious.com username.
@type username: str
@param password: Optional, default: None.
The user's Delicious.com password. If password is set,
all communication with Delicious.com is SSL-encrypted.
@type password: unicode/str
@param max_bookmarks: Optional, default: 50.
See the documentation of get_bookmarks() for more
information as get_url() uses get_bookmarks() to
retrieve a url's bookmarking history.
The parameter is NOT used when a password is specified
because in this case the *full* bookmark collection of
a user will be retrieved.
@type max_bookmarks: int
@param sleep_seconds: Optional, default: 1.
See the documentation of get_bookmarks() for more information as
get_url() uses get_bookmarks() to retrieve a url's bookmarking
history. sleep_seconds must be >= 1 to comply with Delicious.com's
Terms of Use.
@type sleep_seconds: int
@return: DeliciousUser instance
"""
assert username
user = DeliciousUser(username)
bookmarks = []
if password:
# We have username AND password, so we call
# the official Delicious.com API.
path = "/v1/posts/all"
data = self._query(path, host="api.del.icio.us", use_ssl=True, user=username, password=password)
if data:
soup = BeautifulSoup(data)
elements = soup.findAll("post")
for element in elements:
url = element["href"]
title = element["description"] or u""
comment = element["extended"] or u""
tags = []
if element["tag"]:
tags = element["tag"].split()
timestamp = datetime.datetime.strptime(element["time"], "%Y-%m-%dT%H:%M:%SZ")
bookmarks.append( (url, tags, title, comment, timestamp) )
user.bookmarks = bookmarks
else:
# We have only the username, so we extract data from
# the user's JSON feed. However, the feed is restricted
# to the most recent public bookmarks of the user, which
# is about 100 if any. So if we need more than 100, we start
# scraping the Delicious.com website directly
if max_bookmarks > 0 and max_bookmarks <= 100:
path = "/v2/json/%s?count=100" % username
data = self._query(path, host="feeds.delicious.com", user=username)
if data:
posts = []
try:
posts = simplejson.loads(data)
except TypeError:
pass
url = timestamp = None
title = comment = u""
tags = []
for post in posts:
# url
try:
url = post['u']
except KeyError:
pass
# title
try:
title = post['d']
except KeyError:
pass
# tags
try:
tags = post['t']
except KeyError:
pass
if not tags:
tags = [u"system:unfiled"]
# comment / notes
try:
comment = post['n']
except KeyError:
pass
# bookmark creation time
try:
timestamp = datetime.datetime.strptime(post['dt'], "%Y-%m-%dT%H:%M:%SZ")
except KeyError:
pass
bookmarks.append( (url, tags, title, comment, timestamp) )
user.bookmarks = bookmarks[:max_bookmarks]
else:
# TODO: retrieve the first 100 bookmarks via JSON before
# falling back to scraping the delicous.com website
user.bookmarks = self.get_bookmarks(username=username, max_bookmarks=max_bookmarks, sleep_seconds=sleep_seconds)
return user
def get_urls(self, tag=None, popular=True, max_urls=100, sleep_seconds=1):
"""
Returns the list of recent URLs (of web documents) tagged with a given tag.
This is very similar to parsing Delicious' RSS/JSON feeds directly,
but this function will return up to 2,000 links compared to a maximum
of 100 links when using the official feeds (with query parameter
count=100).
The return list of links will be sorted by recency in descending order,
i.e. newest items first.
Note that even when setting max_urls, get_urls() cannot guarantee that
it can retrieve *at least* this many URLs. It is really just an upper
bound.
@param tag: Retrieve links which have been tagged with the given tag.
If tag is not set (default), links will be retrieved from the
Delicious.com front page (aka "delicious hotlist").
@type tag: unicode/str
@param popular: If true (default), retrieve only popular links (i.e.
/popular/<tag>). Otherwise, the most recent links tagged with
the given tag will be retrieved (i.e. /tag/<tag>).
As of January 2009, it seems that Delicious.com modified the list
of popular tags to contain only up to a maximum of 15 URLs.
This also means that setting max_urls to values larger than 15