Skip to content

Commit

Permalink
Merge pull request #259 from nickmorri/nickmorri/handle-invalid-json-…
Browse files Browse the repository at this point in the history
…parsing

fix: Add `OpenGraphScraperOptions.jsonLDOptions.throwOnJSONParseError` and change default behavior to not throw on JSON-LD string parse errors
  • Loading branch information
jshemas authored Jan 19, 2025
2 parents 686fcc5 + a55a6e1 commit 8380c04
Show file tree
Hide file tree
Showing 4 changed files with 119 additions and 11 deletions.
41 changes: 31 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,16 +54,17 @@ Check the return for a ```success``` flag. If success is set to true, then the u

## Options

| Name | Info | Default Value | Required |
|----------------------|----------------------------------------------------------------------------|---------------|----------|
| url | URL of the site. | | x |
| html | You can pass in an HTML string to run ogs on it. (use without options.url) | | |
| fetchOptions | Options that are used by the Fetch API | {} | |
| timeout | Request timeout for Fetch (Default is 10 seconds) | 10 | |
| blacklist | Pass in an array of sites you don't want ogs to run on. | [] | |
| onlyGetOpenGraphInfo | Only fetch open graph info and don't fall back on anything else. Also accepts an array of properties for which no fallback should be used | false | |
| customMetaTags | Here you can define custom meta tags you want to scrape. | [] | |
| urlValidatorSettings | Sets the options used by validator.js for testing the URL | [Here](https://github.com/jshemas/openGraphScraper/blob/master/lib/utils.ts#L4-L17) | |
| Name | Info | Default Value | Required |
|----------------------|-------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------|----------|
| url | URL of the site. | | x |
| html | You can pass in an HTML string to run ogs on it. (use without options.url) | | |
| fetchOptions | Options that are used by the Fetch API | {} | |
| timeout | Request timeout for Fetch (Default is 10 seconds) | 10 | |
| blacklist | Pass in an array of sites you don't want ogs to run on. | [] | |
| onlyGetOpenGraphInfo | Only fetch open graph info and don't fall back on anything else. Also accepts an array of properties for which no fallback should be used | false | |
| customMetaTags | Here you can define custom meta tags you want to scrape. | [] | |
| urlValidatorSettings | Sets the options used by validator.js for testing the URL | [Here](https://github.com/jshemas/openGraphScraper/blob/master/lib/utils.ts#L4-L17) | |
| jsonLDOptions | Sets the options used when parsing JSON-LD data | | |

Note: `open-graph-scraper` uses the [Fetch API](https://nodejs.org/dist/latest-v18.x/docs/api/globals.html#fetch) for requests and most of [Fetch's options](https://developer.mozilla.org/en-US/docs/Web/API/fetch#options) should work as `open-graph-scraper`'s `fetchOptions` options.

Expand Down Expand Up @@ -159,6 +160,26 @@ ogs({ url: 'https://www.wikipedia.org/', fetchOptions: { headers: { 'user-agent'
})
```

## JSON-LD Parsing Options Example

`throwOnJSONParseError` and `logOnJSONParseError` properties control what happens if `JSON.parse`
throws an error when parsing JSON-LD data.
If `throwOnJSONParseError` is set to `true`, then the error will be thrown.
If `logOnJSONParseError` is set to `true`, then the error will be logged to the console.

```javascript
const ogs = require("open-graph-scraper");
const userAgent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36';
ogs({ url: 'https://www.wikipedia.org/', jsonLDOptions: { throwOnJSONParseError: true } })
.then((data) => {
const { error, html, result, response } = data;
console.log('error:', error); // This returns true or false. True if there was an error. The error itself is inside the result object.
console.log('html:', html); // This contains the HTML of page
console.log('result:', result); // This contains all of the Open Graph results
console.log('response:', response); // This contains response from the Fetch API
})
```

## Running the example app

Inside the `example` folder contains a simple express app where you can run `npm ci && npm run start` to spin up. Once the app is running, open a web browser and go to `http://localhost:3000/scraper?url=http://ogp.me/` to test it out. There is also a `Dockerfile` if you want to run this example app in a docker container.
11 changes: 10 additions & 1 deletion lib/extract.ts
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,16 @@ export default function extractMetaTags(body: string, options: OpenGraphScraperO
if (scriptText) {
scriptText = scriptText.replace(/(\r\n|\n|\r)/gm, ''); // remove newlines
scriptText = unescapeScriptText(scriptText);
ogObject.jsonLD.push(JSON.parse(scriptText));
try {
ogObject.jsonLD.push(JSON.parse(scriptText));

Check warning on line 103 in lib/extract.ts

View workflow job for this annotation

GitHub Actions / buildAndTest (18)

Unsafe argument of type `any` assigned to a parameter of type `object`

Check warning on line 103 in lib/extract.ts

View workflow job for this annotation

GitHub Actions / buildAndTest (20)

Unsafe argument of type `any` assigned to a parameter of type `object`

Check warning on line 103 in lib/extract.ts

View workflow job for this annotation

GitHub Actions / buildAndTest (22)

Unsafe argument of type `any` assigned to a parameter of type `object`
} catch (error: unknown) {
if (options.jsonLDOptions?.logOnJSONParseError) {
console.error('Error parsing JSON-LD script tag:', error);

Check warning on line 106 in lib/extract.ts

View workflow job for this annotation

GitHub Actions / buildAndTest (18)

Unexpected console statement

Check warning on line 106 in lib/extract.ts

View workflow job for this annotation

GitHub Actions / buildAndTest (20)

Unexpected console statement

Check warning on line 106 in lib/extract.ts

View workflow job for this annotation

GitHub Actions / buildAndTest (22)

Unexpected console statement
}
if (options.jsonLDOptions?.throwOnJSONParseError) {
throw error;
}
}
}
}
});
Expand Down
9 changes: 9 additions & 0 deletions lib/types.ts
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ export interface OpenGraphScraperOptions {
timeout?: number;
url?: string;
urlValidatorSettings?: ValidatorSettings;
jsonLDOptions?: JSONLDOptions;
}

/**
Expand Down Expand Up @@ -67,6 +68,14 @@ export interface ValidatorSettings {
validate_length: boolean;
}

/**
* Options for the JSON-LD parser
*/
export interface JSONLDOptions {
throwOnJSONParseError?: boolean;
logOnJSONParseError?: boolean;
}

/**
* The type for user defined custom meta tags you want to scrape.
*
Expand Down
69 changes: 69 additions & 0 deletions tests/unit/static.spec.ts
Original file line number Diff line number Diff line change
Expand Up @@ -279,6 +279,75 @@ describe('static check meta tags', function () {
});
});

it('jsonLD - invalid JSON string that cannot be parsed does not throw error', function () {
const metaHTML = `<html><head>
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "Organization",
"name": "Blah ",
"sameAs": [
"https:\\\\/\\\\/twitter.com\\\\/blah?lang=en"
"https:\\\\/\\\\/www.facebook.com\\\\/blah\\\\/"
""
"https:\\\\/\\\\/www.instagram.com\\\\/blah\\\\/"
""
""
"https:\\\\/\\\\/www.youtube.com\\\\/@blah"
""
],
"url": "https:\\\\/\\\\/blah.com"
}
</script>
</head></html>`;

mockAgent.get('http://www.test.com')
.intercept({ path: '/' })
.reply(200, metaHTML);

return ogs({ url: 'www.test.com' })
.then(function (data) {
expect(data.result.success).to.be.eql(true);
expect(data.result.requestUrl).to.be.eql('http://www.test.com');
expect(data.result.jsonLD).to.be.eql([]);
expect(data.html).to.be.eql(metaHTML);
expect(data.response).to.be.a('response');
});
});

it('jsonLD - invalid JSON string that cannot be parsed throws error when options.jsonLDOptions.throwOnJSONParseError = true', function () {
const metaHTML = `<html><head>
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "Organization",
"name": "Blah ",
"sameAs": [
"https:\\\\/\\\\/twitter.com\\\\/blah?lang=en"
"https:\\\\/\\\\/www.facebook.com\\\\/blah\\\\/"
""
"https:\\\\/\\\\/www.instagram.com\\\\/blah\\\\/"
""
""
"https:\\\\/\\\\/www.youtube.com\\\\/@blah"
""
],
"url": "https:\\\\/\\\\/blah.com"
}
</script>
</head></html>`;

mockAgent.get('http://www.test.com')
.intercept({ path: '/' })
.reply(200, metaHTML);

return ogs({ url: 'www.test.com', jsonLDOptions: { throwOnJSONParseError: true } }).catch((data) => {
expect(data.result.success).to.be.eql(false);
});
});

it('encoding - utf-8', function () {
/* eslint-disable max-len */
const metaHTML = `<html><head>
Expand Down

0 comments on commit 8380c04

Please sign in to comment.