Need to archive a site to static HTML? Here's a Bash script for converting your dynamic site (primarily Plone) into a functional, offline collection of files. Powered by Wget, it slices, dices and mixes your currently-hosted site into a set of files that you can either view locally or host on any web server.
Read the Disclaimer further down this page for a number of warnings.
./wget_plone.sh SITE_URL [username] [password]
When executed with a username and password, the script attempts to authenticate with the site and obtain a session cookie for access. When used without login credentials, the site is copied anonymously.
- Recent version of Wget (tested with Wget 1.15)
- Recent version of Bash (tested on 4.3.11 on Ubuntu <= 14.04)
- Plone site to archive (tested on Plone 2.5.x, Plone 3.x, Plone 4.x, Plone 5.x)
- Plone 2.5.x, 3.x, 4.x, 5.x
- Joomla 1.5 (authentication not supported)
Don't forget to turn your portal_css entries to Link. Without this, Wget won't be able to see your CSS. (This does not appear to be necessary for 5.x)
Disable any portal actions or the like that you don't want shown on your offline site.
The script disables all dynamic content (eg login forms, search etc) but you may wish to remove aspects of your pages first before download. It'll save you time!
Seriously consider whether you need Calendar portlets. They have custom links on any page the portlet is on, so it's wise to disable these where possible because if not, you'll downloading for a long time (if not forever).
- To get all the content in a Plone 2.5 site, you'll need to enable the Contents tab for folders. Head to the ZMI -> portal_actions -> turn Contents visible. The tab will now appear like it does in Plone 3 and above.
If using with a username and password, create a special user account with the
Reader
role only or thestatic_backup
role described below. Wget'ing your site with an Administrative user will have disastrous consequences!Turn off automatic user folder creation to prevent issues with the special user's folder.
For Plone 5.x the best approach to get the whole site is to create a new role at the site root called something like
static_backup
. Then assign the following permissions to the role.Make sure that
aquire permission settings
is unchecked for the following permissions:Add portal member
Content rules: Manage rules
Copy or Move
Delete own comments
Reply to item
Set own password
Set own properties
Use external editor
plone.resource: Export ZIP file
plone.resourceeditor: Manage Sources
Be aware that if you uncheck any of these it may change the access people on your site have. Make sure that at least the
Manager
andSite Administrator
have these permissions checked beneath their roles. Ideally, these changes should not cause problems. If they do, the permissions will have to be reset after using this script.The following permissions should be checked for the
static_backup
role:ATContentTypes: View history
Access contents information
CMFEditions: Access previous versions
List folder contents
View
View History
Once this role exists, create a user with this role only. This user will be able to see everything on the site, but not do anything to the site. This is the user to log in as when using the backup script.
If you have any custom workflows, make sure that the
static_backup
role has all of the permissions listed previously in all states.
If you've got user folders you want to grab, remove
index_html
from theMembers
folder by using the ZMI. This takes out thespecial
view the Members folder has and lets you pick a normal layout for link spidering. (This does not appear necessary in 5.x).
If you're archiving a subdirectory site (eg plone.example.org/foobar/) then you need to make sure that your home links end with a slash. You'll need to hack this file (or similar):
--- plone.app.layout-2.0-py2.6.egg/plone/app/layout/globals/portal.py @memoize_contextless def navigation_root_url(self): rootPath = self.navigation_root_path() return self.request.physicalPathToURL(rootPath)+'/'
Specifically, make sure you add the
+'/'
to the return statement. This will force all root navigation links to end with a slash, resolving issues with Wget's traversal.For Plone 2.5 or earlier, you'll need to modify
Products/CMFPlone/browser/plone.py
and find thenavigationRootUrl
method. Add the same+'/'
to the end of the return statement. You'll also need to modify the other parts of Plone too if they don't use this method. These currently include:/portal_skins/plone_portlets/portlet_navigation/manage_main
- Change
href root/absolute_url;
tohref string:${root/absolute_url}/;
This script can potentially be very damaging if used incorrectly. This
script uses recursive wget
, which means it will spider every link it
finds. This will be fine for anonymous users and public views of sites.
However, given Plone offers content and administrative controls for logged-in
users, hitting every link will likely move/rename/delete content, change site
settings, and, in general, be a very bad thing.
Recommendation to protect your site:
Rather than do the backup from the active site make a complete duplicate of
the Plone folder (usually called 'Plone'
by default) but with a different name
at the same level in the directory structure (this makes all relative links
work). Edit the buildout.cfg
to change the port to some local port that is not
being used. Run ./bin/buildout
to set the new port. Start the copy of the site
in foreground mode so that you can see any problems: ./bin/plonectl fg
. Use
a local web browser to access the site at localhost:XXXX/SiteName
, where XXXX
is the port number and check that login of the user you are going to use
for backup works. Now you can run ./wget_plone.sh SITE_NAME [username] [password]
.
If you do want an internal view of a Plone instance, then create a Reader
or
static_backup
account and use this. If you use a Reader
account you will want to
check that someone with Reader
access doesn't get some extra permissions if you've
customised things like your workflow's security.
This tool is designed for Plone so it may or may not work with other types of sites.
In any case, absolutely no warranty is given for its suitability.