Reintroduce IPFS chunkstore #3799

ORBAT · 2018-03-20T10:33:48Z

This PR reintroduces the IPFS chunkstore. It also makes spec's external protocol registration thread-safe.

Commit 217ba38 by @aboodman stated that

I think the right way to handle IPFS is to have it outside the tree, so I'm hoping to work on refactoring the code out of our repo entirely.

but as it currently stands, cmd/noms can't interact with IPFS stores unless it's in the main noms repo.

Some notes regarding this PR:

I only noticed d/try.go after using github.com/pkg/errors. What might actually be worth while is refactoring try.go to use pkg/errors (see e.g. this blog post) as getting textual information about error context is extremely handy
ChunkStoreTestSuite should probably be refactored so that it can be used for testing the IPFS chunkstore as well. I actually tried doing that (including adding a Factory for the IPFS chunkstore) but tests kept freezing so I scrapped it
I went with the functional option pattern as I expect there'll be a lot more options for the IPFS chunkstore. The SetPortIdx option should probably be complemented with separate options for setting the API/gateway/swarm ports (although only the swarm port seemed to be open when eg. running ipfs-chat)
passing options for external protocols is clunky at the moment. It might be worth while to have Spec support e.g. query strings for passing parameters
I've added a vendor.json and package.json to make dealing with the vendored dependencies easier
it doesn't touch the actual code much, so e.g. pre-existing concurrency problems are still there

Don't use gx for just _some_ of the IPFS imports

aboodman · 2018-03-26T09:04:11Z

@ORBAT - I'm really sorry about the slow reply here. I completely missed the notification about this PR due to email config. Should be fixed now.

as it currently stands, cmd/noms can't interact with IPFS stores unless it's in the main noms repo.

eep. good point.

The main reason to pull out the IPFS code was to work around some dependency issues we have experienced with it (e.g., #3786).

I'm not sure what the right solution to that is, but on future reflection, if people are going to be actively working with the IPFS support (e.g., you), then we might as well have the code in this tree for easier maintainability rather than separate.

So I support moving this back in.

That said, I don't think that it should use the 'external protocol' concept if it is linked into the binary.

aboodman · 2018-03-26T08:41:33Z

Makefile

+# the standard vet set is atomic,bool,buildtags,nilfunc,printf. printf causes a lot of spurious failures, so leave that
+# out
+test:
+	$(GOTEST) -vet=atomic,bool,buildtags,nilfunc ./...


This is a good idea, thanks. Can you mention it in the README?

Sure thing!

aboodman · 2018-03-26T08:44:29Z

go/ipfs/options.go

+	})
+}
+
+// TODO: figure out less grody way of passing options to the external protocol. Maybe even extend Spec so that it supports


One way that we have done it in the past is put it in a config file (see go/config), which works so long as you don't want to set them per-connection.

At least personally I won't be needing per-connection settings, so the config file should work fine.

aboodman · 2018-03-26T08:47:14Z

go/ipfs/options.go

+	c := &config{maxConcurrent: 1}
+	for _, opt := range opts {
+		if err := opt.apply(c); err != nil {
+			return nil, errors.Wrap(err, "error in configuration")


I agree that we should move to this. Our error handling has gotten messy and fragmented and now that there's a standard approach to what we were trying to do we should just do that.

I'll take a look at refactoring d/try.go to use github.com/pkg/errors, it seems like it should be fairly painless. I'll plonk it in a separate PR once I get around to it.

aboodman · 2018-03-26T08:52:13Z

go/ipfs/options.go

+	return errors.Wrap(r.SetConfig(rc), "error setting IPFS repo config")
+}
+
+// An Option configures the IPFS ChunkStore


This is cool, I was not familiar with this pattern.

aboodman · 2018-03-26T08:54:38Z

go/ipfs/options.go

+}
+
+// SetLocal makes the ChunkStore only use the local IPFS blockstore for both reads and writes.
+func SetLocal() Option {


Why the distinction between optFunc and Option? Why not just return optFunc?

Also, why not:

func SetLocal(c *config) error { ... }

...rather than returning a new function?

Option should actually be refactored into type Option func(*config) error as I didn't end up needing anything but plain 'ol functions for the options.

All the option-setting functions return Option to have them all appear under the Option type in the documentation; see e.g. zap's docs for an example

aboodman · 2018-03-26T09:03:21Z

go/ipfs/protocol.go

+//
+// The external protocol is implemented by the Protocol type.
+func RegisterProtocols(opts ...Option) error {
+	if err := register(NetworkProtoName, false, opts...); err != nil {


I don't think that there is any value to using external protocols system if the code is going to be linked into the binary anyway. Might as well just put it back as a peer to the other protocols.

Yeah, started out with the protocol before realizing I have no way of sanely registering it from the noms binary. Might as well yank it out.

ORBAT · 2018-03-26T15:03:16Z

Hi, thanks for the reply and the feedback; no worries about how long it took.

I'll check the current IPFS dependencies and see if #3786 is still an issue. Getting rid of gx dependencies may or may not be an option, dunno yet.

ORBAT · 2018-04-03T11:39:52Z

Hi @aboodman, I've got a couple of questions (with likely more incoming since I'm trying to learn the ropes of noms internals, so please bear with me.)

`Blockstore` / `Blocks`

Regarding the reasoning behind having ipfs-local use Blockstore, instead of creating an offline node with core.BuildCfg{Online: false}. Was the intention to have the node always be online so the pubsub subsystem will still work regardless of whether the user wants the actual blocks to be exposed to the wider IPFS network?

Do blocks stored directly via the Blockstore API never get exposed to the network? To me it seems like the BlockService always gets initialized with a reference to the Blockstore, and the the Exchange is networked if cfg.Offline is true. So if I've understood this correctly, adding blocks via Blockstore does expose them to the network; it's just that the current IPFS chunkstore code limits the its ability to get blocks from the network by only using Blockstore in "local mode."

The reason I'm asking this is that it'd be neat to have just one protocol, ipfs, for both the local and networked use cases, with the configuration defining whether to enable networking or not. As far as I can tell, this'd remove the need for the different code paths for local vs networked chunk stores (if cs.c.local { ... } else { ... }). Of course, having a non-networked node would mean the pubsub system wouldn't work, so that'd require some refactoring of the IPFS example.

Concurrency

What was the reasoning behind limiting concurrency to 1 in the chat example? Is it mainly to limit the number of concurrent readers/writers of the root hash file? Or is the Blockstore not thread-safe? I haven't looked into that yet, and IPFS's docs didn't seem to mention anything about thread-safety at a quick glance.

Chunkstore roots / `Rebase`

I'm a bit unclear on the concept of root and Rebase.

I sort of understand root's meaning for datasets and databases, but what's its meaning in the chunkstore context? ** The database root is Map<String, Ref<Commit>> pointing to datasets, but if I've understood correctly, multiple databases can be backed by the same chunkstore. Does a chunkstore root point to 1+ database roots?

What is Rebase used for, in both the database and chunkstore contexts? As in, in what sort of situations would one need to do a rebase? I've seen it used in e.g. the memory stores with views, which is understandable (brings the view in line with the "top level" memory store), but I'm not clear on its use in other cases.

aboodman · 2018-06-08T03:47:24Z

@ORBAT - I'm so sorry (again).

I'm not sure why I keep missing these mails. If I'm unresponsive, can you please just ping the thread again. I'm trying to spend more time on Noms the last few weeks, and I'd love to have contributions. I think right now the traffic is just low enough that it's easy to miss in my inbox.

You can also just shake my tree at any other channel (e.g., aboodman on Twitter). But hopefully now that I'm spending time on this again, it won't continue to happen.

Regarding your questions:

more incoming since I'm trying to learn the ropes of noms internals

Awesome.

Regarding the reasoning behind having ipfs-local use Blockstore, instead of creating an offline node with core.BuildCfg{Online: false}. Was the intention to have the node always be online so the pubsub subsystem will still work regardless of whether the user wants the actual blocks to be exposed to the wider IPFS network?

It's hard to remember, and it's possible we didn't even explore the Online flag. But it's true that we would have wanted the pubsub system to work in both modes.

The ipfs/ipfs-local distinction was a bit of a quick hack and with more time I hope we could do something better. IIRC (and my memory is very foggy) the issue was that Noms has a deep assumption that a dag is always complete. Meaning that if a particular chunk is present, then all chunks reachable from it will also be present.

Because of the magic of the IPFS network, with ipfs we would frequently end up with chunks present locally but not their children. This was bad news for Noms.

I don't think we ever really got the ipfs scheme working well.

Do blocks stored directly via the Blockstore API never get exposed to the network? To me it seems like the BlockService always gets initialized with a reference to the Blockstore, and the the Exchange is networked if cfg.Offline is true. So if I've understood this correctly, adding blocks via Blockstore does expose them to the network; it's just that the current IPFS chunkstore code limits the its ability to get blocks from the network by only using Blockstore in "local mode."

The latter was the goal IIRC. Perhaps the special case in the put path is not necessary!

The reason I'm asking this is that it'd be neat to have just one protocol, ipfs, for both the local and networked use cases, with the configuration defining whether to enable networking or not. As far as I can tell, this'd remove the need for the different code paths for local vs networked chunk stores (if cs.c.local { ... } else { ... }). Of course, having a non-networked node would mean the pubsub system wouldn't work, so that'd require some refactoring of the IPFS example.

I actually came to the conclusion over time that the ipfs path was not very useful. If I were doing this in a principled way I would just be using pubsub and libp2p directly, and then having them talk to a normal nbs database.

Concurrency

What was the reasoning behind limiting concurrency to 1 in the chat example? Is it mainly to limit the number of concurrent readers/writers of the root hash file? Or is the Blockstore not thread-safe? I haven't looked into that yet, and IPFS's docs didn't seem to mention anything about thread-safety at a quick glance.

No idea! Can you give me a pointer to the code in question?

Chunkstore roots / Rebase

I'm a bit unclear on the concept of root and Rebase.

I'm so sad I didn't catch this question faster. Cool to have someone digging in.

I sort of understand root's meaning for datasets and databases, but what's its meaning in the chunkstore context? ** The database root is Map<String, Ref<Commit>> pointing to datasets, but if I've understood correctly, multiple databases can be backed by the same chunkstore. Does a chunkstore root point to 1+ database roots?

It's not the case that multiple databases can be backed by one ChunkStore.

The difference between ChunkStore and Database is basically that Database knows about types.Value and ChunkStore doesn't.

So when you write something to Database the thing you write is a types.Value. Database does all kinds of validation to ensure the system at the end of the day still looks like a valid database. It makes sure that when you commit a value, all the values it depends on are present.

ChunkStore doesn't do any of that. You can, erm, put chunks in it, and get them out. The only other thing of significance that you can do with it is read/write to a single unnamed mutable register. Writes to this register have certain concurrency guarantees on which the entire system depends.

Basically if you think of a stack of components, with your code at the top, and the OS at the bottom, Noms is in the middle. The ChunkStore interface is the boundary between the OS and Noms. It represents the minimum requirements Noms has on the underlying storage system in order to function properly.

Database is the boundary between Noms and you. So whereas ChunkStore's design goal is to be absolutely minimal, Database's design goal is to be high-level, useful, ergonomic, etc.

What is Rebase used for, in both the database and chunkstore contexts? As in, in what sort of situations would one need to do a rebase? I've seen it used in e.g. the memory stores with views, which is understandable (brings the view in line with the "top level" memory store), but I'm not clear on its use in other cases.

Rebase() is useful occasionally when you know that the underlying storage may have changed and you want to bring yourself up to date with it.

Think of like a webserver that is servicing requests. If this server caches a single database instance across all requests (which would be a reasonable thing to do) then the server would want to rebase the db before each request to ensure it is seeing the latest state of the world.

The reason it is called Rebase is because like git rebase it brings along its current state. So for example if you do Put a few times, then all Rebase, then call Commit, the chunks that you put will still be visible to commit. Whereas if you just dropped the db and built a new one, you'd lose that state. I think there is also some cached state that comes along for the ride too, which makes Rebase more efficient that simply building a new database instance, but can't recall the details.

===

HTH, and hope you're still interested.

ORBAT · 2019-03-05T14:51:59Z

Closing this for now since I've moved onto other projects

Tom Eklöf added 2 commits March 20, 2018 11:40

Re-add IPFS chunkstore

0d6a27a

Be consistent about using gx

d1444e6

Don't use gx for just _some_ of the IPFS imports

aboodman reviewed Mar 26, 2018

View reviewed changes

Tom Eklöf added 2 commits April 3, 2018 13:10

make Option into a function type

b173c19

add note about running tests

40441de

don't repeate package name in type name

fad08f6

ORBAT closed this Mar 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reintroduce IPFS chunkstore #3799

Reintroduce IPFS chunkstore #3799

ORBAT commented Mar 20, 2018 •

edited

Loading

aboodman commented Mar 26, 2018

aboodman Mar 26, 2018

ORBAT Mar 26, 2018

aboodman Mar 26, 2018

ORBAT Mar 26, 2018

aboodman Mar 26, 2018

ORBAT Mar 26, 2018 •

edited

Loading

aboodman Mar 26, 2018

aboodman Mar 26, 2018

aboodman Mar 26, 2018

ORBAT Mar 26, 2018

aboodman Mar 26, 2018

ORBAT Mar 26, 2018

ORBAT commented Mar 26, 2018 •

edited

Loading

ORBAT commented Apr 3, 2018 •

edited

Loading

aboodman commented Jun 8, 2018

Concurrency

Chunkstore roots / `Rebase`

ORBAT commented Mar 5, 2019

Reintroduce IPFS chunkstore #3799

Reintroduce IPFS chunkstore #3799

Conversation

ORBAT commented Mar 20, 2018 • edited Loading

aboodman commented Mar 26, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ORBAT Mar 26, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ORBAT commented Mar 26, 2018 • edited Loading

ORBAT commented Apr 3, 2018 • edited Loading

Blockstore / Blocks

Concurrency

Chunkstore roots / Rebase

aboodman commented Jun 8, 2018

Concurrency

Chunkstore roots / Rebase

ORBAT commented Mar 5, 2019

ORBAT commented Mar 20, 2018 •

edited

Loading

ORBAT Mar 26, 2018 •

edited

Loading

ORBAT commented Mar 26, 2018 •

edited

Loading

ORBAT commented Apr 3, 2018 •

edited

Loading

`Blockstore` / `Blocks`

Chunkstore roots / `Rebase`

Chunkstore roots / `Rebase`