Skip to content

Latest commit

 

History

History
170 lines (135 loc) · 6.56 KB

README.md

File metadata and controls

170 lines (135 loc) · 6.56 KB

Guilty Spark

A very basic search engine for my static website. Consists of a indexer that will run over the markdown+frontmatter pre-rendered documents, and then a small server that will respond to queries with a list of matching documents based on a combination of the corpus saved by the indexer and a bibliography saved by the static site generator.

This has been tested with my set of Hugo websites, but should work with any frontmatter based static site with a little work (e.g., Jekyll, In-Context, etc.), as it relies on the site generator outputing a support file to save Guilty Spark guessing how names get changed during generation.

Admin side usage

There's three stages here, two of which you'll most likely do offline, and the last stage that will need to run on a live server somewhere:

  1. Indexing the files for terms - this is done by the Guilty Spark indexer command.
  2. Generating a site bibliography JSON - this should be generated by your static site generator.
  3. The search engine service - this is another Guilty Spark command that you'll run on the internet that'll need access to the output generated by the previous two stages.

The Indexer

This tool will read through all your frontmatter and markdown files and generate an index that can be used to search for pages. You invoke this like so:

swift run indexer [PATH TO STATIC SITE CONTENT DIRECTORY] [FILENAME.json]

If you have multiple sites you can generate a corpus per site.

You can get the corpus to your website however you like, but the trick I do is to have the output file end up in the public directory of my static site generator so it gets uploaded at the same time I upload everything else.

Bibliography

Guilty Spark assumes there is a helping hand from the static site builder, in the form of a bibliography JSON file that has an entry for every searchable page. This saves Guilty Spark from having to guess how your static site generator will mangle names etc. It just has to match the origin field in the bibliography to that in the index. An example bibliography file looks like:

{
	"pages": [
		{
  			"title": "Some notes",
  			"link": "/blog/some-notes/",
  			"date": "2022-04-21T09:13:56Z",
  			"synopsis": "A look at stuff.",
  			"thumbnail": {
				"1x": "120x120_fit_box.png",
				"2x": "240x240_fit_box.png"
  			},
  			"tags": [
				"stuff",
				"notes"
  			],
  			"origin": "blog/some-notes/index.md"
		},
		... // rest of pages here.
	]
}

The thumbnail and synopsis tags are optional, the other fields are mandatory. An example template for generating in Hugo would be:

{
  "pages": [{{- range $index, $page := .Site.RegularPages }}
	{{- if $index -}}, {{ end -}}
	{
	  "title": "{{ $page.Title }}",
	  "link": "{{ $page.RelPermalink }}",
	  "date": "{{ $page.Date.Format "2006-01-02T15:04:05Z0700" }}",
	  {{ if .Params.Synopsis -}}
	  "synopsis": "{{ $page.Params.Synopsis }}",
	  {{- end }}
	  {{- if .Params.titleimage -}}
	  "thumbnail": {
		{{ $image := $page.Resources.GetMatch $page.Params.titleimage -}}
		{{- $image1x := $image.Fit "120x120 png" -}}
		{{- $image2x := $image.Fit "240x240 png" -}}
		"1x": "{{ $image1x.RelPermalink }}",
		"2x": "{{ $image2x.RelPermalink }}"
	  },
	  {{- end }}
	  "tags": [
	  {{ if $page.Params.tags -}}
		{{- range $i, $e := sort $page.Params.tags -}}
		{{- if $i -}}, {{ end -}}
		"{{ $e }}"
		{{- end -}}
	  {{- end }}
	  ],
	  "origin": "{{ $page.File.Path }}"
	}
  {{ end -}}
  ]
}

You'll note that this uses some custom frontmatter entries, notably tags, which is a list of tag keywords, synopsis, which is a summary bit of text, and titleimage which is the name of the image for a thumbnail - you should change this to suit your particular setup.

The bibliogrpahy JSON file that this generates will be needed alongside the indexed JSON you generated in the first stage.

The Librarian

The librarian is the name for the active service that you'll run on a web server somewhere. It doesn't need to be the same place as you serve your static files from, but it will be somewhere that it can access the two JSON files you've already generated.

You'll need to write a configuration file that explains to the librarian where the index and bibliography files are for a given site, and give them a name that identifies the pairs (if you run just one site this seems a bit overkill, but I run the search for three static sites with one Guilty Spark instance this way).

An example config file might look like:

[
	{
		"corpusName": "mynameismwd",
		"corpusFilePath": "/var/www/mynameismwd.org/index.json",
		"bibliographyFilePath": "/var/www/mynameismwd.org/bibliography.json"
	},
	{
		"corpusName": "digitalflapjack",
		"corpusFilePath": "/var/www/digitalflapjack.com/index.json",
		"bibliographyFilePath": "/var/www/digitalflapjack.com/bibliography.json"
	},
	{
		"corpusName": "electricflapjack",
		"corpusFilePath": "/var/www/electricflapjack.com/index.json",
		"bibliographyFilePath": "/var/www/electriclflapjack.com/bibliography.json"
	}
]

You can then run the library thus:

swift run librarian config.json

If you change any of the files, you will either need to restart the process, or you can send it a HUP signal and it will reload the data without interrupting existing requests.

Server side usage

Once you have the librarian running you can start to query it. The server responds on /search/ and expects the following query parameters:

  • q: this is the term you want to search for, and should be URL encoded for spaces etc.
  • corpus: this is the particular site definition you want to query.

The corpus one is optional, if you don't specify it the librarian will instead look for an X-Corpus HTTP header. If the librarian finds neither it'll respond with an error. If you run nginx or some other proxy in front of the librarian you can use that to set the X-Corpus header:

	location /api/ {
		proxy_set_header X-Corpus mynameismwd;
		proxy_pass http://localhost:4242/;
	}

Currently the port 4242 is hardcoded.

The response from the server for a query will be a JSON list of responses:

curl "https://digitalflapjack.com/api/search/?q=swift" | jq
[
  {
	"origin": "blog/search_with_server_side_swift/index.md",
	"date": "2022-08-21T17:44:20Z",
	"tags": [
	  "IR",
	  "search",
	  "swift"
	],
	"synopsis": "Some notes from my first experiments in writing a light-weight server-side Swift project, as I build a search-engine for my various websites.",
	"title": "Building search with Server-Side Swift",
	"link": "/blog/search_with_server_side_swift/"
  },
  ...
]