Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add key transform functions for massive rate limit improvements #205

Open
wants to merge 24 commits into
base: master
Choose a base branch
from

Conversation

v-stickykeys
Copy link

@v-stickykeys v-stickykeys commented Nov 12, 2021

Without sharding in S3, it is unreasonable from a performance and cost perspective to use it as an IPFS datastore.

This PR adds customizable sharding methods, matching the implementation of go-ds-flatfs and compatible with js-datastore-core (also js-datastore-s3).

This incorporates the suggestion made by @whyrusleeping #195 (comment) to allow developers to decide the layout of their data on S3.

obo20 and others added 12 commits August 16, 2021 17:40
This changes the way the go-ds-s3 plugin stores data in s3 in an effort to drastically improve speed. Instead of storing data all in the /ipfs directory (which counts as one prefix), we're storing things in a way so each block gets its own prefix.
comma fix
* feat: add key transformation method for sharding

* fix: use offset for next to last sharding
@welcome
Copy link

welcome bot commented Nov 12, 2021

Thank you for submitting this PR!
A maintainer will be here shortly to review it.
We are super grateful, but we are also overloaded! Help us by making sure that:

  • The context for this PR is clear, with relevant discussion, decisions
    and stakeholders linked/mentioned.

  • Your contribution itself is clear (code comments, self-review for the
    rest) and in its best form. Follow the code contribution
    guidelines

    if they apply.

Getting other community members to do a review would be great help too on complex PRs (you can ask in the chats/forums). If you are unsure about something, just leave us a comment.
Next steps:

  • A maintainer will triage and assign priority to this PR, commenting on
    any missing things and potentially assigning a reviewer for high
    priority items.

  • The PR gets reviews, discussed and approvals as needed.

  • The PR is merged by maintainers when it has been approved and comments addressed.

We currently aim to provide initial feedback/triaging within two business days. Please keep an eye on any labelling actions, as these will indicate priorities and status of your contribution.
We are very grateful for your contribution!

@v-stickykeys
Copy link
Author

I wasn't able to test this unless the module name matched my repo name so that is one thing I will change (may need some guidance to figure out how to do this properly as I'm new to go!)

@BigLep
Copy link
Contributor

BigLep commented Nov 12, 2021

@MichaelMure : are you able to review this?

@obo20
Copy link

obo20 commented Nov 12, 2021

From conversations with @whyrusleeping , it's my understanding that the next-to-last/2 method (and other methods of filesystem sharding) were originally implemented in order to better scale linux based filesystems.

@achingbrain would you be able to share the reasons for sharding things out in the way you did in js-datastore-s3? Did you run into similar scalability issues or were you just trying to follow the existing filesystem sharding patterns of go-ipfs?

Also, @aschmahmann / @BigLep as we move towards go-ipfs 0.12.0, do any considerations need to be taken with the desire to "Switch to raw multihashes in datastores (including datastore version upgrade)" in the 0.12.0 upgrade? If possible, I'd love to make sure any possible compatibility issues are addressed pre-emptively here.

@Stebalien
Copy link
Member

@obo20 see #105 and #199 (specifically the second one).

@Stebalien
Copy link
Member

Oh, wait, sorry. You already know that.

For sharding blocks, "next to last" is probably fine. Although I'd personally just use a cheap hash function.

@achingbrain
Copy link
Member

would you be able to share the reasons for sharding things out in the way you did in js-datastore-s3? Did you run into similar scalability issues or were you just trying to follow the existing filesystem sharding patterns of go-ipfs?

@jacobheun did the initial implementation so perhaps he can speak to the reason, but I get the feeling it was just following the principles set out by other datastore implementations.

@jacobheun
Copy link

@jacobheun did the initial implementation so perhaps he can speak to the reason, but I get the feeling it was just following the principles set out by other datastore implementations.

This is correct, it was just following the patterns of the other JS datastore implementations at the time (2018), it's just a lightweight interface to s3. @achingbrain I believe the sharding behavior changed in ipfs/js-datastore-s3#30, but it's not managed by the s3 datastore itself.

@v-stickykeys
Copy link
Author

@MichaelMure I believe this is ready for review. PTAL when you get the chance!

@v-stickykeys
Copy link
Author

@BigLep given no word so far from @MichaelMure is there someone else who might be available to review this?

@BigLep
Copy link
Contributor

BigLep commented Jan 14, 2022

Similar response as in #195 (comment)

Core IPFS maintainers aren't planning to take this on as we aren't using this ourselves. We need to figure out a way to setup and denote community membership for this and other datastore repos. We're game to setup others as the maintainers here.

Maybe pair up with @obo20 and review each other's code (@obo20 has #195 open) or ask for other volunteers in the ipfs-operators Slack channel?

@lidel lidel requested a review from guseggert February 18, 2022 16:57
}

var KeyTransforms = map[string]func(ds.Key) string{
"default": func(k ds.Key) string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling this "default" isn't very descriptive, how about "identity" or something like that?

Comment on lines +69 to +74
"next-to-last/2": func(k ds.Key) string {
s := k.String()
offset := 1
start := len(s) - 2 - offset
return s[start:start+2] + "/" + s
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are some problems here

datastore.NewKey("a") -> panic
datastore.NewKey("ab") -> /a//ab
datastore.NewKey("abc") -> ab//abc
datastore.NewKey("abcd") -> bc//abcd
  • Panics on keys w/ len == 1, should return an error instead
  • Keys w/ len == 2 have a slash in their prefix (should probably return an error instead?)
  • Keys w/ len > 2 have two slashes instead of one

A small unit test would be nice here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @guseggert ! In practice, I am not seeing the double slash issue. Also just running through the logic I don't see how that would be possible but I'm new to Go 🙈.

However for keys of len 2 or smaller I agree this should be handled differently.
For keys of length 3 or more it still seems correct given a param of 2.

Copy link
Contributor

@guseggert guseggert Mar 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All I did was add this test to print out the results, should help you reproduce:

func TestTransforms(t *testing.T) {
	transform := KeyTransforms["next-to-last/2"]
	keys := []string{"ab", "abc", "abcd"}
	for _, k := range keys {
		fmt.Printf("datastore.NewKey(\"%s\") -> %s\n", k, transform(datastore.NewKey(k)))
	}
}

Full test:

func TestTransforms(t *testing.T) {
	cases := []struct {
		transform     string
		key           string
		expectedS3Key string
	}{
		{
			transform:     "next-to-last/2",
			key:           "ab",
			expectedS3Key: "ab/ab",
		},
		{
			transform:     "next-to-last/2",
			key:           "abc",
			expectedS3Key: "ab/abc",
		},
		{
			transform:     "next-to-last/2",
			key:           "abcd",
			expectedS3Key: "bc/abcd",
		},
	}
	for _, c := range cases {
		t.Run(fmt.Sprintf("%s %s", c.transform, c.key), func(t *testing.T) {
			transform := KeyTransforms[c.transform]
			dsKey := datastore.NewKey(c.key)
			s3Key := transform(dsKey)
			if s3Key != c.expectedS3Key {
				t.Fatalf("expected %s, got %s", c.expectedS3Key, s3Key)
			}
		})
	}
}

output:

go test -v -run TestTransforms\$ . 
=== RUN   TestTransforms
=== RUN   TestTransforms/next-to-last/2_ab
    s3_test.go:105: expected ab/ab, got /a//ab
=== RUN   TestTransforms/next-to-last/2_abc
    s3_test.go:105: expected ab/abc, got ab//abc
=== RUN   TestTransforms/next-to-last/2_abcd
    s3_test.go:105: expected bc/abcd, got bc//abcd
--- FAIL: TestTransforms (0.00s)
    --- FAIL: TestTransforms/next-to-last/2_ab (0.00s)
    --- FAIL: TestTransforms/next-to-last/2_abc (0.00s)
    --- FAIL: TestTransforms/next-to-last/2_abcd (0.00s)
FAIL
FAIL	github.com/3box/go-ds-s3	0.004s
FAIL

Comment on lines +22 to +25
localBucketName, localBucketNameSet := os.LookupEnv("LOCAL_BUCKET_NAME")
if !localBucketNameSet {
localBucketName = fmt.Sprintf("localbucketname%d", time.Now().UnixNano())
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

go fmt?

Region: "local",
AccessKey: "test",
SecretKey: "testdslocal",
KeyTransform: "default",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add tests for the other transforms?

@@ -1,7 +1,7 @@
package main

import (
plugin "github.com/ipfs/go-ds-s3/plugin"
plugin "github.com/3box/go-ds-s3/plugin"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
plugin "github.com/3box/go-ds-s3/plugin"
plugin "github.com/ipfs/go-ds-s3/plugin"

@@ -3,7 +3,7 @@ package plugin
import (
"fmt"

s3ds "github.com/ipfs/go-ds-s3"
s3ds "github.com/3box/go-ds-s3"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
s3ds "github.com/3box/go-ds-s3"
s3ds "github.com/ipfs/go-ds-s3"

@@ -4,7 +4,7 @@ import (
"reflect"
"testing"

s3ds "github.com/ipfs/go-ds-s3"
s3ds "github.com/3box/go-ds-s3"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
s3ds "github.com/3box/go-ds-s3"
s3ds "github.com/ipfs/go-ds-s3"

Comment on lines +94 to +95
`next-to-last/2`
- Shards by storing block data based on the second to last 2 characters of its key. E.g. `CX/CIQJ7IHPGOFUJT5UMXIW6CUDSNH6AVKMEOXI3UM3VLYJRZUISUMGCXQ`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
`next-to-last/2`
- Shards by storing block data based on the second to last 2 characters of its key. E.g. `CX/CIQJ7IHPGOFUJT5UMXIW6CUDSNH6AVKMEOXI3UM3VLYJRZUISUMGCXQ`
`next-to-last/2`
- Shards by storing block data based on the second to last 2 characters of its key. E.g. `CX/CIQJ7IHPGOFUJT5UMXIW6CUDSNH6AVKMEOXI3UM3VLYJRZUISUMGCXQ`
- This is for compatibility with [js-datastore-s3](https://github.com/ipfs/js-datastore-s3)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For compatibility with js-datastore-s3 and also go-ds-flatfs. Don't think it's worth the note really. Seems most of the datastores utilize some form of sharding but next-to-last/2 happens to be the default for some.

@BigLep
Copy link
Contributor

BigLep commented Mar 25, 2022

@v-stickykeys : are you going to incorporate the feedback? This looks like a good change that is worth landing.

@lidel lidel added the need/author-input Needs input from the original author label Apr 15, 2022
@BigLep
Copy link
Contributor

BigLep commented Apr 22, 2022

@v-stickykeys : are you going to be able to incorporate the feedback? We'd like to support getting this landed.

@v-stickykeys
Copy link
Author

Hey @BigLep the short answer is yes but honestly it hasn't been the top of my priority list and is now more of a weekend project. My PR as it stands meets our needs but I do have another branch with improvements addressing the review which I can try to get wrapped and pushed up in the next week.

@BigLep
Copy link
Contributor

BigLep commented Jun 3, 2022

@v-stickykeys : just checking in. Maintainers can take a look if you post an new revision with incorporated comments.

* wip

* wip

* wip

* go back to 3box org
@eth-limo
Copy link

This would be great to have!

@koxon
Copy link
Contributor

koxon commented Dec 3, 2022

Hi guys.

I wrote the IPFS on AWS Fargate blog for AWS and I'm currently looking at making this S3 plugin work with the serverless architecture mentioned in the blog.

This would be a shame not to merge this.

S3 Sharding is key to making this a viable option compared to traditional drives.
With IPFS we're looking for high throughput. Low latency data access is less important.

Referencing this post from @mikeal, if we can achieve 40GB/s throughput per bucket we could run large clusters at a fraction of the cost. We can leverage S3 infrequent access or intelligent tiering to save on cost even more.

I will try this branch and I'm willing to help move this forward.

@guseggert you've submitted multiple feedback. What is your assessment of the situation?
Looks like there is minor work left to finalize this PR.

@v-stickykeys are you still around?

@BigLep can you add me as contributor to this PR?

@hsanjuan
Copy link
Contributor

@koxon sorry that no one answered this... are you still interested in picking this up?

Right now I don't see any upsides in letting users configure the number of prefixes between next-to-last-2,3 or whatever... might as well set it to a number that 128 or a number that maximizes the usage of S3 and leave it there. Unless there is a good reason to use a worse number.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need/author-input Needs input from the original author
Projects
None yet
Development

Successfully merging this pull request may close these issues.