Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remarks/issues with current weighted exploitation-exploration strategy #1

Open
Malnammi opened this issue Jan 6, 2019 · 3 comments

Comments

@Malnammi
Copy link
Collaborator

Malnammi commented Jan 6, 2019

The current strategy assigns exploitation and exploration weights to clusters in the following manner:

image
favors clusters with high activity and high density of labeled data.

image
favors clusters with low coverage and high uncertainty. We also have the option of selecting exploration clusters randomly or a set of dissimilar clusters.

  1. Consider only exploitation clusters with weights >= exploitation-threshold. Count total number of unlabeled molecules here M1 (all are predicted as highly active).
  2. Similarly for exploration clusters with weights >= exploration-threshold. Count total number of unlabeled molecules here M2.
  3. Based on ratio of M1 and M2 allocate percentage of batch-size towards exploration and exploitation.
  4. Now we sample clusters. The first cluster selected is the highest weighted one. Then we sort remaining clusters by λ * disimi + (1− λ) (𝑾_𝒊𝒋 ) where disim_i denote avg cluster dissimilarity to other selected cluster. I.e. each time we select a cluster, we select one that is dissimilar to what was already selected. Alternative, we select a cluster that is dissimilar to ALL other clusters in our data. Can results in very different space-coverage bias.
  5. For each sampled cluster, we then sample instances within that cluster either randomly or selecting a set of dissimilar instances within some vicinity.
  6. Note that we might not be able to sample from all qualifying clusters because of budget constraints. We compute an estimate of per cluster budget based on instance counts, but we set either an equal or proportional budget towards clusters.

The current code for this method is here: link
Hyperparameter configs are here: link

Here are some pending issues with this:

  1. What do in case we have no qualifying exploitation or exploration (or both) clusters in step 1 and 2? Should we just select top 50% of clusters based on weights.
  2. In step 2, should we exclude qualifying exploitation clusters from exploration clusters. That is, once a cluster becomes a candidate exploitation cluster, it is removed from being considered an exploration cluster. The alternative would be to still allow a very uncovered cluster with a single highly predicted active molecule to be considered for both exploitation and exploration.
  3. How to incorporate costs? issue Incorporating cost #2
This was referenced Jan 9, 2019
@agitter
Copy link
Member

agitter commented Jan 18, 2019

@Malnammi I have a question about the exploitation weight. The weight increases as the cluster coverage increases. At some point, wouldn't we want there to be diminishing returns for an active cluster?

@Malnammi
Copy link
Collaborator Author

Malnammi commented Jan 19, 2019

@agitter my idea for the exploitation weight was:

  • Activity_i: This is the mean of highly active predictions within that cluster. High activity predictions are defined as those exceeding some threshold.
  • Coverage_i: This is the fraction of labeled/unlabled molecules in the cluster. Clusters with more coverage (labeled molecules), then the model might be more confident/robust in that part of the space.

During the computation of exploitation weights for the clusters, if the cluster has no highly active predictions (exceeding the threshold), then its default Activity_i will be zero. In other words, it will be completely weighted by its coverage (i.e. W_i_exploit <= 0.5). It will be outranked by any cluster with Activity_i > 0.

This also begs another issue: what do we do if all the clusters have Activity_i = 0? Do we want to weigh based on Coverage_i alone? Or stop exploiting and focus more on exploration till our model becomes more confident?

We discussed that activity predictions ranges are model dependent; i.e. small datasets typically give low range of predictions [0,0.4] for random forest. In the current implementation we have a temporary remedy for this where we set the parameter for thresholding using a quantile rather than an absolute. Specifically, using a quantile of 0.5, then the threshold for highly active unlabeled molecules are those >= median of unlabeled prediction.

@agitter
Copy link
Member

agitter commented Jan 19, 2019

I see, so the coverage is used to estimate confidence, not diminishing returns.

what do we do if all the clusters have Activity_i = 0?

My initial thought is that it would make sense to focus on exploration, as you suggested.

For the activity prediction ranges, this temperature scaling method is the one Jay tested: https://arxiv.org/pdf/1706.04599.pdf I'm not certain that it is relevant for us.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants