Skip to content

Commit

Permalink
Merge pull request #3 from MisskeyIO/merge-upstream
Browse files Browse the repository at this point in the history
Merge tag '5.1.0'
  • Loading branch information
u1-liquid authored Mar 20, 2024
2 parents 6d77ddd + b479764 commit 790e000
Show file tree
Hide file tree
Showing 9 changed files with 360 additions and 94 deletions.
9 changes: 9 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,12 @@
5.1.0 / 2024-03-18
* GETリクエストよりも前にHEADリクエストを送信し、その結果を使用して検証するように (#22)
* 下記のパラメータを`summaly`メソッドのオプションに追加
- userAgent
- responseTimeout
- operationTimeout
- contentLengthLimit
- contentLengthRequired

5.0.3 / 2023-12-30
------------------
* Fix .github/workflows/npm-publish.yml
Expand Down
49 changes: 27 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,12 +43,17 @@ npm run serve

#### opts (SummalyOptions)

| Property | Type | Description | Default |
| :------------------ | :--------------------- | :------------------------------ | :------ |
| **lang** | *string* | Accept-Language for the request | `null` |
| **followRedirects** | *boolean* | Whether follow redirects | `true` |
| **plugins** | *plugin[]* (see below) | Custom plugins | `null` |
| **agent** | *Got.Agents* | Custom HTTP agent (see below) | `null` |
| Property | Type | Description | Default |
|:--------------------------|:-----------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-----------------------|
| **lang** | *string* | Accept-Language for the request | `null` |
| **followRedirects** | *boolean* | Whether follow redirects | `true` |
| **plugins** | *plugin[]* (see below) | Custom plugins | `null` |
| **agent** | *Got.Agents* | Custom HTTP agent (see below) | `null` |
| **userAgent** | *string* | User-Agent for the request | `SummalyBot/[version]` |
| **responseTimeout** | *number* | Set timeouts for each phase, such as host name resolution and socket communication. | `20000` |
| **operationTimeout** | *number* | Set the timeout from the start to the end of the request. | `60000` |
| **contentLengthLimit** | *number* | If set to true, an error will occur if the content-length value returned from the other server is larger than this parameter (or if the received body size exceeds this parameter). | `10485760` |
| **contentLengthRequired** | *boolean* | If set to true, it will be an error if the other server does not return content-length. | `false` |

#### Plugin

Expand Down Expand Up @@ -78,30 +83,30 @@ A Promise of an Object that contains properties below:

#### SummalyResult

| Property | Type | Description |
| :-------------- | :------- | :------------------------------------------ |
| **title** | *string* \| *null* | The title of the web page |
| **icon** | *string* \| *null* | The url of the icon of the web page |
| **description** | *string* \| *null* | The description of the web page |
| **thumbnail** | *string* \| *null* | The url of the thumbnail of the web page |
| **sitename** | *string* \| *null* | The name of the web site |
| **player** | *Player* | The player of the web page |
| **sensitive** | *boolean* | Whether the url is sensitive |
| Property | Type | Description |
|:----------------|:-------------------|:-----------------------------------------------------------|
| **title** | *string* \| *null* | The title of the web page |
| **icon** | *string* \| *null* | The url of the icon of the web page |
| **description** | *string* \| *null* | The description of the web page |
| **thumbnail** | *string* \| *null* | The url of the thumbnail of the web page |
| **sitename** | *string* \| *null* | The name of the web site |
| **player** | *Player* | The player of the web page |
| **sensitive** | *boolean* | Whether the url is sensitive |
| **activityPub** | *string* \| *null* | The url of the ActivityPub representation of that web page |
| **url** | *string* | The url of the web page |
| **url** | *string* | The url of the web page |

#### Summary

`Omit<SummalyResult, "url">`

#### Player

| Property | Type | Description |
| :-------------- | :--------- | :---------------------------------------------- |
| **url** | *string* \| *null* | The url of the player |
| **width** | *number* \| *null* | The width of the player |
| **height** | *number* \| *null* | The height of the player |
| **allow** | *string[]* | The names of the allowed permissions for iframe |
| Property | Type | Description |
|:-----------|:-------------------|:------------------------------------------------|
| **url** | *string* \| *null* | The url of the player |
| **width** | *number* \| *null* | The width of the player |
| **height** | *number* \| *null* | The height of the player |
| **allow** | *string[]* | The names of the allowed permissions for iframe |

Currently the possible items in `allow` are:

Expand Down
2 changes: 1 addition & 1 deletion package.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "@misskey-dev/summaly",
"version": "5.0.4",
"version": "5.1.0",
"description": "Get web page's summary",
"author": "syuilo <[email protected]>",
"license": "MIT",
Expand Down
21 changes: 19 additions & 2 deletions src/general.ts
Original file line number Diff line number Diff line change
Expand Up @@ -130,13 +130,30 @@ async function getOEmbedPlayer($: cheerio.CheerioAPI, pageUrl: string): Promise<
};
}

export default async (_url: URL | string, lang: string | null = null): Promise<Summary | null> => {
export type GeneralScrapingOptions = {
lang?: string | null;
userAgent?: string;
responseTimeout?: number;
operationTimeout?: number;
contentLengthLimit?: number;
contentLengthRequired?: boolean;
}

export default async (_url: URL | string, opts?: GeneralScrapingOptions): Promise<Summary | null> => {
let lang = opts?.lang;
// eslint-disable-next-line no-param-reassign
if (lang && !lang.match(/^[\w-]+(\s*,\s*[\w-]+)*$/)) lang = null;

const url = typeof _url === 'string' ? new URL(_url) : _url;

const res = await scpaping(url.href, { lang: lang || undefined });
const res = await scpaping(url.href, {
lang: lang || undefined,
userAgent: opts?.userAgent,
responseTimeout: opts?.responseTimeout,
operationTimeout: opts?.operationTimeout,
contentLengthLimit: opts?.contentLengthLimit,
contentLengthRequired: opts?.contentLengthRequired,
});
const $ = res.$;
const twitterCard =
$('meta[name="twitter:card"]').attr('content') ||
Expand Down
42 changes: 40 additions & 2 deletions src/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ import * as Got from 'got';
import { SummalyResult } from './summary.js';
import { SummalyPlugin } from './iplugin.js';
export * from './iplugin.js';
import general from './general.js';
import general, { GeneralScrapingOptions } from './general.js';
import { setAgent } from './utils/got.js';
import { plugins as builtinPlugins } from './plugins/index.js';
import type { FastifyInstance } from 'fastify';
Expand All @@ -34,6 +34,35 @@ export type SummalyOptions = {
* Custom HTTP agent
*/
agent?: Got.Agents;

/**
* User-Agent for the request
*/
userAgent?: string;

/**
* Response timeout.
* Set timeouts for each phase, such as host name resolution and socket communication.
*/
responseTimeout?: number;

/**
* Operation timeout.
* Set the timeout from the start to the end of the request.
*/
operationTimeout?: number;

/**
* Maximum content length.
* If set to true, an error will occur if the content-length value returned from the other server is larger than this parameter (or if the received body size exceeds this parameter).
*/
contentLengthLimit?: number;

/**
* Content length required.
* If set to true, it will be an error if the other server does not return content-length.
*/
contentLengthRequired?: boolean;
};

export const summalyDefaultOptions = {
Expand Down Expand Up @@ -68,8 +97,17 @@ export const summaly = async (url: string, options?: SummalyOptions): Promise<Su
const match = plugins.filter(plugin => plugin.test(_url))[0];

// Get summary
const scrapingOptions: GeneralScrapingOptions = {
lang: opts.lang,
userAgent: opts.userAgent,
responseTimeout: opts.responseTimeout,
operationTimeout: opts.operationTimeout,
contentLengthLimit: opts.contentLengthLimit,
contentLengthRequired: opts.contentLengthRequired,
};

// eslint-disable-next-line @typescript-eslint/no-unnecessary-condition
const summary = await (match ? match.summarize : general)(_url, opts.lang || undefined);
const summary = await (match ? match.summarize : general)(_url, scrapingOptions);

if (summary == null) {
throw new Error('failed summarize');
Expand Down
3 changes: 2 additions & 1 deletion src/iplugin.ts
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
import Summary from './summary.js';
import type { URL } from 'node:url';
import { GeneralScrapingOptions } from '@/general';

export interface SummalyPlugin {
test: (url: URL) => boolean;
summarize: (url: URL, lang?: string) => Promise<Summary | null>;
summarize: (url: URL, opts?: GeneralScrapingOptions) => Promise<Summary | null>;
}
6 changes: 3 additions & 3 deletions src/plugins/branchio-deeplinks.ts
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
import { URL } from 'node:url';
import general from '../general.js';
import general, { GeneralScrapingOptions } from '../general.js';
import Summary from '../summary.js';

export function test(url: URL): boolean {
Expand All @@ -8,10 +8,10 @@ export function test(url: URL): boolean {
url.hostname === 'spotify.link';
}

export async function summarize(url: URL, lang: string | null = null): Promise<Summary | null> {
export async function summarize(url: URL, opts?: GeneralScrapingOptions): Promise<Summary | null> {
// https://help.branch.io/using-branch/docs/creating-a-deep-link#redirections
// Web版に強制リダイレクトすることでbranch.ioの独自ページが開くのを防ぐ
url.searchParams.append('$web_only', 'true');

return await general(url, lang);
return await general(url, opts);
}
110 changes: 71 additions & 39 deletions src/utils/got.ts
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ const _filename = fileURLToPath(import.meta.url);
const _dirname = dirname(_filename);

export let agent: Got.Agents = {};

export function setAgent(_agent: Got.Agents) {
// eslint-disable-next-line @typescript-eslint/no-unnecessary-condition
agent = _agent || {};
Expand All @@ -22,34 +23,60 @@ export type GotOptions = {
body?: string;
headers: Record<string, string | undefined>;
typeFilter?: RegExp;
responseTimeout?: number;
operationTimeout?: number;
contentLengthLimit?: number;
contentLengthRequired?: boolean;
}

const repo = JSON.parse(readFileSync(`${_dirname}/../../package.json`, 'utf8'));

const RESPONSE_TIMEOUT = 20 * 1000;
const OPERATION_TIMEOUT = 60 * 1000;
const MAX_RESPONSE_SIZE = 10 * 1024 * 1024;
const BOT_UA = `SummalyBot/${repo.version}`;

export async function scpaping(url: string, opts?: { lang?: string; }) {
const response = await getResponse({
const DEFAULT_RESPONSE_TIMEOUT = 20 * 1000;
const DEFAULT_OPERATION_TIMEOUT = 60 * 1000;
const DEFAULT_MAX_RESPONSE_SIZE = 10 * 1024 * 1024;
const DEFAULT_BOT_UA = `SummalyBot/${repo.version}`;

export async function scpaping(
url: string,
opts?: {
lang?: string;
userAgent?: string;
responseTimeout?: number;
operationTimeout?: number;
contentLengthLimit?: number;
contentLengthRequired?: boolean;
},
) {
const args: Omit<GotOptions, 'method'> = {
url,
method: 'GET',
headers: {
'accept': 'text/html,application/xhtml+xml',
'user-agent': BOT_UA,
'user-agent': opts?.userAgent ?? DEFAULT_BOT_UA,
'accept-language': opts?.lang,
},
typeFilter: /^(text\/html|application\/xhtml\+xml)/,
responseTimeout: opts?.responseTimeout,
operationTimeout: opts?.operationTimeout,
contentLengthLimit: opts?.contentLengthLimit,
contentLengthRequired: opts?.contentLengthRequired,
};

const headResponse = await getResponse({
...args,
method: 'HEAD',
});

// SUMMALY_ALLOW_PRIVATE_IPはテスト用
const allowPrivateIp = process.env.SUMMALY_ALLOW_PRIVATE_IP === 'true' || Object.keys(agent).length > 0;

if (!allowPrivateIp && response.ip && PrivateIp(response.ip)) {
throw new StatusError(`Private IP rejected ${response.ip}`, 400, 'Private IP Rejected');
if (!allowPrivateIp && headResponse.ip && PrivateIp(headResponse.ip)) {
throw new StatusError(`Private IP rejected ${headResponse.ip}`, 400, 'Private IP Rejected');
}

const response = await getResponse({
...args,
method: 'GET',
});

const encoding = detectEncoding(response.rawBody);
const body = toUtf8(response.rawBody, encoding);
const $ = cheerio.load(body);
Expand All @@ -70,24 +97,22 @@ export async function get(url: string) {
},
});

return await res.body;
return res.body;
}

export async function head(url: string) {
const res = await getResponse({
return await getResponse({
url,
method: 'HEAD',
headers: {
'accept': '*/*',
},
});

return await res;
}

async function getResponse(args: GotOptions) {
const timeout = RESPONSE_TIMEOUT;
const operationTimeout = OPERATION_TIMEOUT;
const timeout = args.responseTimeout ?? DEFAULT_RESPONSE_TIMEOUT;
const operationTimeout = args.operationTimeout ?? DEFAULT_OPERATION_TIMEOUT;

const req = got<string>(args.url, {
method: args.method,
Expand All @@ -109,30 +134,37 @@ async function getResponse(args: GotOptions) {
},
});

return await receiveResponse({ req, typeFilter: args.typeFilter });
}
const res = await receiveResponse({ req, opts: args });

async function receiveResponse<T>(args: { req: Got.CancelableRequest<Got.Response<T>>, typeFilter?: RegExp }) {
const req = args.req;
const maxSize = MAX_RESPONSE_SIZE;

req.on('response', (res: Got.Response) => {
// Check html
if (args.typeFilter && !res.headers['content-type']?.match(args.typeFilter)) {
// console.warn(res.headers['content-type']);
req.cancel(`Rejected by type filter ${res.headers['content-type']}`);
return;
}
// Check html
const contentType = res.headers['content-type'];
if (args.typeFilter && !contentType?.match(args.typeFilter)) {
throw new Error(`Rejected by type filter ${contentType}`);
}

// 応答ヘッダでサイズチェック
const contentLength = res.headers['content-length'];
if (contentLength != null) {
const size = Number(contentLength);
if (size > maxSize) {
req.cancel(`maxSize exceeded (${size} > ${maxSize}) on response`);
}
// 応答ヘッダでサイズチェック
const contentLength = res.headers['content-length'];
if (contentLength) {
const maxSize = args.contentLengthLimit ?? DEFAULT_MAX_RESPONSE_SIZE;
const size = Number(contentLength);
if (size > maxSize) {
throw new Error(`maxSize exceeded (${size} > ${maxSize}) on response`);
}
});
} else {
if (args.contentLengthRequired) {
throw new Error('content-length required');
}
}

return res;
}

async function receiveResponse<T>(args: {
req: Got.CancelableRequest<Got.Response<T>>,
opts: GotOptions,
}) {
const req = args.req;
const maxSize = args.opts.contentLengthLimit ?? DEFAULT_MAX_RESPONSE_SIZE;

// 受信中のデータでサイズチェック
req.on('downloadProgress', (progress: Got.Progress) => {
Expand Down
Loading

0 comments on commit 790e000

Please sign in to comment.