Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request - allow metadata in processing #14

Closed
dbinetti opened this issue Aug 21, 2014 · 6 comments
Closed

Feature Request - allow metadata in processing #14

dbinetti opened this issue Aug 21, 2014 · 6 comments

Comments

@dbinetti
Copy link

Thanks for a great library!

I do have one feature request, however. It would be incredibly useful to allow for the addition of metadata that could be passed through during processing.

As a specific use case, I'd like to be able to input data along with an ID field (for instance, as a list of tuples such as [(uid, data), (uid, data), ...] with the resulting clusters referring to the UID and not the data (again, such as: [[uid, uid, uid,], [ uid, uid, uid, uid,] ... ]) which would allow me to easily manipulate, store, and process the objects themselves. As it stands, I get the clusters I want but I can no longer identify them from the original data and so i'm stuck... :-(

I've looked at the core code and unfortunately making such a change and offering as a pull-request is beyond my abilities. I am hoping this might be a (somewhat) straightforward thing to do. And I'd be happy to help in any way I can.

@exhuma
Copy link
Owner

exhuma commented Aug 22, 2014

I am not 100% certain if I understood your question. The way I understood it, is that you want to be able to cluster more complex objects than simple numeric values.

This is already possible if you specify a distance function. The distance function takes two arguments and returns a float representing the relative distance between both objects. For example the Euclidian distance.

If you want to run the clustering on objects where you cannot simply do a-b, then you can pass in the distance function. In your example above this would be: a[1]-b[1] for example (assuming you want to cluster on data, and not on uid). Of course you can pass in tuples with more elements as well.

As a more practical example, assume that we have an object encapsulating the values as members uid and data. Further assume that we want to cluster using the data field, by calculating the Euclidian distance. For the sake of this example, we use random integers as data and random strings as UIDs.

Then the code would become:

from cluster import HierarchicalClustering
from os import urandom
from pprint import pprint
from random import randint


class ObjectWithMetadata(object):

    def __init__(self, value):
        self.value = value
        self.uid = urandom(10).encode('base64').strip()

    def __repr__(self):
        return 'ObjectWithMetadata({!r}, {!r}'.format(self.value, self.uid)


data = [ObjectWithMetadata(randint(0, 1000))
        for _ in range(200)]

cl = HierarchicalClustering(data, lambda x, y: float(abs(x.value-y.value)))
clustered = cl.getlevel(10)
pprint(clustered)

Does this help you? Does it answer the question? If not, let met know.

@exhuma
Copy link
Owner

exhuma commented Aug 22, 2014

While working on this, I have run into an issue with KMeansClustering (#15).

This seems to be a regression, because I think that worked in the past. So, if you are planning to use KMeans, you will still be blocked :(

I will have a look at this as quickly as possible.

@dbinetti
Copy link
Author

Thank you so much for responding.

Unfortunately, I'm asking something different. Allow to explain a bit more.

Assume I have a table of data that looks like this:

ID DATA
1 34
2 78
3 77
4 35
5 35
6 22

Of course, I want to cluster based on the data, and so I'd create a list of said data as:
data = [34, 78, 77, 35, 35, 22] and process it with python-cluster, resulting in a list of lists: [[34, 35, 35],[78,77], [22]] So far, great. But now what if I want to write to my data table with a cluster identifier for each datum? I no longer have the ID, and so I have to find another way to update table.

The obvious answer is to apply the cluster based on the data itself UPDATE row where DATA in [result], but this can actually be more difficult than it seems if you're doing data transformations along the way and need to back everything out to the original raw value.

The ideal situation (for me, at least) would be to pass a list of tuples to the processor, as [(1,34), (2,78), (3,77), (4,35), (5,35), (6, 22)], cluster based on the second item in each tuple (the data) and then have the output grouped by the first item in each tuple (the ID value). For example, as [[1, 4, 5], [2,3],[6]] Then, I can easily refer to the original value (or any other metadata related to that datum.)

OR perhaps I'm just looking at this all the wrong way... I am quite the newbie...

@dbinetti
Copy link
Author

Wait, I think I need to re-read your answer more closely -- perhaps you are answering it for me? I don't intuitively understand the code. I'll try your example and see if that helps me understand...

@dbinetti
Copy link
Author

OK, yes, this works. I do understand now how to do it. Sorry for the confusion, and thanks!

@exhuma
Copy link
Owner

exhuma commented Aug 22, 2014

No problem. Always glad to help :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants