Skip to content
This repository has been archived by the owner on Aug 28, 2021. It is now read-only.

Reintroduce IPFS chunkstore #3799

Closed
wants to merge 5 commits into from
Closed

Conversation

ORBAT
Copy link

@ORBAT ORBAT commented Mar 20, 2018

This PR reintroduces the IPFS chunkstore. It also makes spec's external protocol registration thread-safe.

Commit 217ba38 by @aboodman stated that

I think the right way to handle IPFS is to have it outside the tree, so I'm hoping to work on refactoring the code out of our repo entirely.

but as it currently stands, cmd/noms can't interact with IPFS stores unless it's in the main noms repo.

Some notes regarding this PR:

  • I only noticed d/try.go after using github.com/pkg/errors. What might actually be worth while is refactoring try.go to use pkg/errors (see e.g. this blog post) as getting textual information about error context is extremely handy
  • ChunkStoreTestSuite should probably be refactored so that it can be used for testing the IPFS chunkstore as well. I actually tried doing that (including adding a Factory for the IPFS chunkstore) but tests kept freezing so I scrapped it
  • I went with the functional option pattern as I expect there'll be a lot more options for the IPFS chunkstore. The SetPortIdx option should probably be complemented with separate options for setting the API/gateway/swarm ports (although only the swarm port seemed to be open when eg. running ipfs-chat)
  • passing options for external protocols is clunky at the moment. It might be worth while to have Spec support e.g. query strings for passing parameters
  • I've added a vendor.json and package.json to make dealing with the vendored dependencies easier
  • it doesn't touch the actual code much, so e.g. pre-existing concurrency problems are still there

Tom Eklöf added 2 commits March 20, 2018 11:40
Don't use gx for just _some_ of the IPFS imports
@aboodman
Copy link
Contributor

@ORBAT - I'm really sorry about the slow reply here. I completely missed the notification about this PR due to email config. Should be fixed now.

as it currently stands, cmd/noms can't interact with IPFS stores unless it's in the main noms repo.

eep. good point.

The main reason to pull out the IPFS code was to work around some dependency issues we have experienced with it (e.g., #3786).

I'm not sure what the right solution to that is, but on future reflection, if people are going to be actively working with the IPFS support (e.g., you), then we might as well have the code in this tree for easier maintainability rather than separate.

So I support moving this back in.

That said, I don't think that it should use the 'external protocol' concept if it is linked into the binary.

# the standard vet set is atomic,bool,buildtags,nilfunc,printf. printf causes a lot of spurious failures, so leave that
# out
test:
$(GOTEST) -vet=atomic,bool,buildtags,nilfunc ./...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good idea, thanks. Can you mention it in the README?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing!

})
}

// TODO: figure out less grody way of passing options to the external protocol. Maybe even extend Spec so that it supports
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One way that we have done it in the past is put it in a config file (see go/config), which works so long as you don't want to set them per-connection.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least personally I won't be needing per-connection settings, so the config file should work fine.

c := &config{maxConcurrent: 1}
for _, opt := range opts {
if err := opt.apply(c); err != nil {
return nil, errors.Wrap(err, "error in configuration")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we should move to this. Our error handling has gotten messy and fragmented and now that there's a standard approach to what we were trying to do we should just do that.

Copy link
Author

@ORBAT ORBAT Mar 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a look at refactoring d/try.go to use github.com/pkg/errors, it seems like it should be fairly painless. I'll plonk it in a separate PR once I get around to it.

return errors.Wrap(r.SetConfig(rc), "error setting IPFS repo config")
}

// An Option configures the IPFS ChunkStore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is cool, I was not familiar with this pattern.

}

// SetLocal makes the ChunkStore only use the local IPFS blockstore for both reads and writes.
func SetLocal() Option {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the distinction between optFunc and Option? Why not just return optFunc?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, why not:

func SetLocal(c *config) error {
...
}

...rather than returning a new function?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Option should actually be refactored into type Option func(*config) error as I didn't end up needing anything but plain 'ol functions for the options.

All the option-setting functions return Option to have them all appear under the Option type in the documentation; see e.g. zap's docs for an example

//
// The external protocol is implemented by the Protocol type.
func RegisterProtocols(opts ...Option) error {
if err := register(NetworkProtoName, false, opts...); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that there is any value to using external protocols system if the code is going to be linked into the binary anyway. Might as well just put it back as a peer to the other protocols.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, started out with the protocol before realizing I have no way of sanely registering it from the noms binary. Might as well yank it out.

@ORBAT
Copy link
Author

ORBAT commented Mar 26, 2018

Hi, thanks for the reply and the feedback; no worries about how long it took.

I'll check the current IPFS dependencies and see if #3786 is still an issue. Getting rid of gx dependencies may or may not be an option, dunno yet.

@ORBAT
Copy link
Author

ORBAT commented Apr 3, 2018

Hi @aboodman, I've got a couple of questions (with likely more incoming since I'm trying to learn the ropes of noms internals, so please bear with me.)

Blockstore / Blocks

Regarding the reasoning behind having ipfs-local use Blockstore, instead of creating an offline node with core.BuildCfg{Online: false}. Was the intention to have the node always be online so the pubsub subsystem will still work regardless of whether the user wants the actual blocks to be exposed to the wider IPFS network?

Do blocks stored directly via the Blockstore API never get exposed to the network? To me it seems like the BlockService always gets initialized with a reference to the Blockstore, and the the Exchange is networked if cfg.Offline is true. So if I've understood this correctly, adding blocks via Blockstore does expose them to the network; it's just that the current IPFS chunkstore code limits the its ability to get blocks from the network by only using Blockstore in "local mode."

The reason I'm asking this is that it'd be neat to have just one protocol, ipfs, for both the local and networked use cases, with the configuration defining whether to enable networking or not. As far as I can tell, this'd remove the need for the different code paths for local vs networked chunk stores (if cs.c.local { ... } else { ... }). Of course, having a non-networked node would mean the pubsub system wouldn't work, so that'd require some refactoring of the IPFS example.

Concurrency

What was the reasoning behind limiting concurrency to 1 in the chat example? Is it mainly to limit the number of concurrent readers/writers of the root hash file? Or is the Blockstore not thread-safe? I haven't looked into that yet, and IPFS's docs didn't seem to mention anything about thread-safety at a quick glance.

Chunkstore roots / Rebase

I'm a bit unclear on the concept of root and Rebase.

I sort of understand root's meaning for datasets and databases, but what's its meaning in the chunkstore context? ** The database root is Map<String, Ref<Commit>> pointing to datasets, but if I've understood correctly, multiple databases can be backed by the same chunkstore. Does a chunkstore root point to 1+ database roots?

What is Rebase used for, in both the database and chunkstore contexts? As in, in what sort of situations would one need to do a rebase? I've seen it used in e.g. the memory stores with views, which is understandable (brings the view in line with the "top level" memory store), but I'm not clear on its use in other cases.

@aboodman
Copy link
Contributor

aboodman commented Jun 8, 2018

@ORBAT - I'm so sorry (again).

I'm not sure why I keep missing these mails. If I'm unresponsive, can you please just ping the thread again. I'm trying to spend more time on Noms the last few weeks, and I'd love to have contributions. I think right now the traffic is just low enough that it's easy to miss in my inbox.

You can also just shake my tree at any other channel (e.g., aboodman on Twitter). But hopefully now that I'm spending time on this again, it won't continue to happen.

Regarding your questions:

more incoming since I'm trying to learn the ropes of noms internals

Awesome.

Regarding the reasoning behind having ipfs-local use Blockstore, instead of creating an offline node with core.BuildCfg{Online: false}. Was the intention to have the node always be online so the pubsub subsystem will still work regardless of whether the user wants the actual blocks to be exposed to the wider IPFS network?

It's hard to remember, and it's possible we didn't even explore the Online flag. But it's true that we would have wanted the pubsub system to work in both modes.

The ipfs/ipfs-local distinction was a bit of a quick hack and with more time I hope we could do something better. IIRC (and my memory is very foggy) the issue was that Noms has a deep assumption that a dag is always complete. Meaning that if a particular chunk is present, then all chunks reachable from it will also be present.

Because of the magic of the IPFS network, with ipfs we would frequently end up with chunks present locally but not their children. This was bad news for Noms.

I don't think we ever really got the ipfs scheme working well.

Do blocks stored directly via the Blockstore API never get exposed to the network? To me it seems like the BlockService always gets initialized with a reference to the Blockstore, and the the Exchange is networked if cfg.Offline is true. So if I've understood this correctly, adding blocks via Blockstore does expose them to the network; it's just that the current IPFS chunkstore code limits the its ability to get blocks from the network by only using Blockstore in "local mode."

The latter was the goal IIRC. Perhaps the special case in the put path is not necessary!

The reason I'm asking this is that it'd be neat to have just one protocol, ipfs, for both the local and networked use cases, with the configuration defining whether to enable networking or not. As far as I can tell, this'd remove the need for the different code paths for local vs networked chunk stores (if cs.c.local { ... } else { ... }). Of course, having a non-networked node would mean the pubsub system wouldn't work, so that'd require some refactoring of the IPFS example.

I actually came to the conclusion over time that the ipfs path was not very useful. If I were doing this in a principled way I would just be using pubsub and libp2p directly, and then having them talk to a normal nbs database.

Concurrency

What was the reasoning behind limiting concurrency to 1 in the chat example? Is it mainly to limit the number of concurrent readers/writers of the root hash file? Or is the Blockstore not thread-safe? I haven't looked into that yet, and IPFS's docs didn't seem to mention anything about thread-safety at a quick glance.

No idea! Can you give me a pointer to the code in question?

Chunkstore roots / Rebase

I'm a bit unclear on the concept of root and Rebase.

I'm so sad I didn't catch this question faster. Cool to have someone digging in.

I sort of understand root's meaning for datasets and databases, but what's its meaning in the chunkstore context? ** The database root is Map<String, Ref<Commit>> pointing to datasets, but if I've understood correctly, multiple databases can be backed by the same chunkstore. Does a chunkstore root point to 1+ database roots?

It's not the case that multiple databases can be backed by one ChunkStore.

The difference between ChunkStore and Database is basically that Database knows about types.Value and ChunkStore doesn't.

So when you write something to Database the thing you write is a types.Value. Database does all kinds of validation to ensure the system at the end of the day still looks like a valid database. It makes sure that when you commit a value, all the values it depends on are present.

ChunkStore doesn't do any of that. You can, erm, put chunks in it, and get them out. The only other thing of significance that you can do with it is read/write to a single unnamed mutable register. Writes to this register have certain concurrency guarantees on which the entire system depends.

Basically if you think of a stack of components, with your code at the top, and the OS at the bottom, Noms is in the middle. The ChunkStore interface is the boundary between the OS and Noms. It represents the minimum requirements Noms has on the underlying storage system in order to function properly.

Database is the boundary between Noms and you. So whereas ChunkStore's design goal is to be absolutely minimal, Database's design goal is to be high-level, useful, ergonomic, etc.

What is Rebase used for, in both the database and chunkstore contexts? As in, in what sort of situations would one need to do a rebase? I've seen it used in e.g. the memory stores with views, which is understandable (brings the view in line with the "top level" memory store), but I'm not clear on its use in other cases.

Rebase() is useful occasionally when you know that the underlying storage may have changed and you want to bring yourself up to date with it.

Think of like a webserver that is servicing requests. If this server caches a single database instance across all requests (which would be a reasonable thing to do) then the server would want to rebase the db before each request to ensure it is seeing the latest state of the world.

The reason it is called Rebase is because like git rebase it brings along its current state. So for example if you do Put a few times, then all Rebase, then call Commit, the chunks that you put will still be visible to commit. Whereas if you just dropped the db and built a new one, you'd lose that state. I think there is also some cached state that comes along for the ride too, which makes Rebase more efficient that simply building a new database instance, but can't recall the details.

===

HTH, and hope you're still interested.

@ORBAT
Copy link
Author

ORBAT commented Mar 5, 2019

Closing this for now since I've moved onto other projects

@ORBAT ORBAT closed this Mar 5, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants