Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add details to README.md about how exactly this extension works #13

Open
Korb opened this issue Aug 4, 2024 · 1 comment
Open

Add details to README.md about how exactly this extension works #13

Korb opened this issue Aug 4, 2024 · 1 comment

Comments

@Korb
Copy link

Korb commented Aug 4, 2024

At the moment, the information provided is not enough to understand what exactly is meant by the wording "the important content", and based on what criteria this content will be searched for in the text of web pages.

@dstein64
Copy link
Owner

dstein64 commented Aug 9, 2024

I implemented this a long time ago (over 9 years ago), and don't recall the details.

I browsed the code to review the algorithm.

A sentence's importance is calculated by assigning a score for each word in the sentence, and summing the scores. A word's score is based on its frequency throughout the document (higher scores for higher frequency). The score of long sentences is reduced, to account for having a higher score from more words.

highlight/src/content.js

Lines 1067 to 1186 in 3bf1319

// return the candidates to highlight
// cth = candidates to highlight
const cth = function(highlightState, numHighlightStates) {
// a candidate may be a TextBlock or a Sentence.
const candidates = getCandidates();
const scores = [];
let _tohighlight = [];
const cstems = candidates.map(function(c){
return NLP.tokenormalize(c.text);});
// term sentence frequency
// (how many times a term appears in a sentence)
const tsf = new Map();
for (let i = 0; i < cstems.length; i++) {
const stems = cstems[i];
const _set = new Set(stems.keys());
_set.forEach(function(stem) {
if (tsf.has(stem)) {
tsf.set(stem, tsf.get(stem) + 1);
} else {
tsf.set(stem, 1);
}
});
}
for (let i = 0; i < candidates.length; i++) {
const candidate = candidates[i];
const stems = cstems[i];
let score = 0;
stems.forEach(function(count, stem) {
const tsfScore = Math.log2(tsf.get(stem)) + 1;
score += (count * tsfScore);
});
// reduce score of long sentences
// (being long will give them more weight above)
let size = 0;
for (c of stems.values()) {
size += c;
}
const factor = 1.0 / (Math.log2(size) + 1);
score *= factor;
scores.push(new ScoredCandidate(candidate, score, i, null));
}
// calculating percentile based on ratio, and filtering could be more
// elegant than sorting... and also wouldn't require sorting by index
// at the end
scores.sort(function(a, b) {
return b.score - a.score;
});
// Maps number of highlight states to a map of highlight states to coverage ratios
const ratio_lookup = {
2: {1: 0.25},
3: {1: 0.15, 2: 0.30},
4: {1: 0.10, 2: 0.20, 3: 0.40}
};
let ratio = ratio_lookup[numHighlightStates][highlightState];
if (HIGHLIGHT_ALL)
ratio = 1; // debugging
let totalChars = 0;
for (let i = 0; i < scores.length; i++) {
const scored = scores[i];
const candidate = scored.candidate;
totalChars += candidate.textLength;
}
let highlightCharCounter = 0;
for (let i = 0; i < scores.length; i++) {
const scored = scores[i];
if (HIGHLIGHT_ALL) {
scored.importance = 1;
} else {
for (let j = 1; j < numHighlightStates; ++j) {
if (highlightCharCounter <= ratio_lookup[numHighlightStates][j] * totalChars) {
scored.importance = j;
break;
}
}
}
_tohighlight.push(scored);
highlightCharCounter += scored.candidate.textLength;
if (!HIGHLIGHT_ALL && highlightCharCounter > ratio * totalChars)
break;
}
// highlighting breaks if not in pre-order traversal order
_tohighlight.sort(function(a, b) {
return a.index - b.index;
});
// if we're highlighting sentences, make sure we got at least one
// sentence. Otherwise, we're probably on a navigational page. You
// were looping over _tohighlight, but changed to looping over
// everything. since you may only have one _tohighlight candidate,
// which may not have a period.
let haveOne = false;
for (let i = 0; i < scores.length; i++) {
const scored = scores[i];
const cand = scored.candidate;
if (cand instanceof Sentence && cand.hasEnd) {
haveOne = true;
break;
}
}
if (!haveOne)
_tohighlight = [];
return _tohighlight;
};

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants