thegruber/linkpeek
GitHub: thegruber/linkpeek
一个轻量高效的链接预览元数据提取库,支持 Node.js、Bun 和 Deno,通过流式解析在仅下载必要数据的同时获取网页的标题、描述、图片等 19 个字段。
Stars: 0 | Forks: 0
# linkpeek
**适用于 Node.js、Bun 和 Deno 的链接预览提取工具。仅一个依赖。**
[](https://www.npmjs.com/package/linkpeek)
[](https://bundlephobia.com/package/linkpeek)
[](https://github.com/thegruber/linkpeek/actions/workflows/ci.yml)
[](https://www.npmjs.com/package/linkpeek)
[](LICENSE)
```
import { preview } from "linkpeek";
const result = await preview("https://www.youtube.com/watch?v=dQw4w9WgXcQ");
result.title; // "Rick Astley - Never Gonna Give You Up"
result.image; // "https://i.ytimg.com/vi/dQw4w9WgXcQ/maxresdefault.jpg"
result.siteName; // "YouTube"
result.favicon; // "https://www.youtube.com/favicon.ico"
result.description; // "The official video for \"Never Gonna Give You Up\"..."
```
## 安装
```
npm install linkpeek
```
也支持通过 `bun add linkpeek` 和 `import { preview } from "npm:linkpeek"` (Deno) 使用。
## 为什么选择 linkpeek
- **1 个依赖** (htmlparser2) — 不是 4 个,也不是插件树
- **在 `` 处停止** — 仅下载 30 KB,而不是完整的 2 MB 页面
- **SAX 流式处理** — 无 DOM 构建,解析时间约 2 ms
- **SSRF 防护** — 默认拦截私有/内部 IP
- **支持 Node.js 20+、Bun 和 Deno** (已在 CI 中测试) — 仅使用 Web 标准 API
## 完整结果
从单个 URL 提取的所有 19 个字段:
```
{
url: "https://www.youtube.com/watch?v=dQw4w9WgXcQ",
title: "Rick Astley - Never Gonna Give You Up (Official Video) (4K Remaster)",
description: "The official video for \"Never Gonna Give You Up\" by Rick Astley...",
image: "https://i.ytimg.com/vi/dQw4w9WgXcQ/maxresdefault.jpg",
imageWidth: 1280,
imageHeight: 720,
siteName: "YouTube",
favicon: "https://www.youtube.com/favicon.ico",
mediaType: "video.other",
author: null,
canonicalUrl: "https://www.youtube.com/watch?v=dQw4w9WgXcQ",
locale: null,
publishedDate: null,
video: "https://www.youtube.com/embed/dQw4w9WgXcQ",
twitterCard: "player",
twitterSite: "@youtube",
themeColor: null,
keywords: ["rick astley", "Never Gonna Give You Up", "rick roll"],
oEmbedUrl: "https://www.youtube.com/oembed?format=json&url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DdQw4w9WgXcQ"
}
```
## 预设
```
import { preview, presets } from "linkpeek";
// Default: fast (30 KB limit, head only, no meta-refresh)
const result = await preview(url);
// Quality: body JSON-LD + image fallback + meta-refresh
const result = await preview(url, presets.quality);
// Custom: spread a preset and override
const result = await preview(url, { ...presets.quality, timeout: 3000 });
```
| Preset | What it enables |
| ----------------- | ------------------------------------------- |
| `presets.fast` | Default behavior — explicit version of `{}` |
| `presets.quality` | Body JSON-LD, image fallback, meta-refresh |
## 错误处理
`preview()` 针对无效输入和被阻止的 URL 会抛出异常:
```
try {
const result = await preview(url);
} catch (err) {
// "Invalid URL"
// "Only http and https URLs are supported"
// "URLs pointing to private/internal networks are not allowed"
console.error(err.message);
}
```
## API
### `preview(url, options?)`
获取 URL 并提取链接预览元数据。返回 `Promise`。
#### 选项
| Option | Type | Default | Description |
| -------------------- | ------------------------ | ------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `timeout` | `number` | `8000` | Request timeout in milliseconds |
| `maxBytes` | `number` | `30_000` | Max bytes to stream |
| `userAgent` | `string` | `"Twitterbot/1.0"` | User-Agent sent with requests. Twitterbot gets pre-rendered HTML from most platforms |
| `followRedirects` | `boolean` | `true` | Follow HTTP 3xx redirects |
| `headers` | `Record` | `{}` | Extra request headers (e.g. cookies, auth tokens) |
| `allowPrivateIPs` | `boolean` | `false` | Allow fetching private/internal IPs. Keep `false` in production to prevent SSRF attacks |
| `followMetaRefresh` | `boolean` | `false` | Follow `` redirects when no title is found. Enable to handle Cloudflare-challenged pages at the cost of an extra HTTP round-trip |
| `includeBodyContent` | `boolean` | `false` | Continue scanning `` for JSON-LD scripts and `
` fallbacks after ``. Enable together with a higher `maxBytes` for best quality |
#### 结果字段
| Field | Type | Description |
| --------------- | ------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `url` | `string` | Final resolved URL |
| `title` | `string \| null` | Page title (`og:title` → `twitter:title` → JSON-LD → ``) |
| `description` | `string \| null` | Description (`og:description` → `twitter:description` → `meta[name=description]` → JSON-LD) |
| `image` | `string \| null` | Preview image (`og:image` → `twitter:image` → JSON-LD → `itemprop=image` → first `
`) |
| `imageWidth` | `number \| null` | Image width from `og:image:width` |
| `imageHeight` | `number \| null` | Image height from `og:image:height` |
| `siteName` | `string` | Site name (`og:site_name` → JSON-LD publisher → hostname fallback) |
| `favicon` | `string \| null` | Favicon URL (largest `apple-touch-icon` → `link[rel=icon]` → `/favicon.ico`) |
| `mediaType` | `string` | Content type from `og:type`, defaults to `"website"` |
| `author` | `string \| null` | Author name (JSON-LD author → `meta[name=author]` → Dublin Core) |
| `canonicalUrl` | `string` | Canonical URL (`link[rel=canonical]` → `og:url` → request URL) |
| `locale` | `string \| null` | Locale from `og:locale` |
| `publishedDate` | `string \| null` | Published date (`article:published_time` → JSON-LD `datePublished` → Dublin Core) |
| `video` | `string \| null` | Video URL from `og:video` |
| `twitterCard` | `string \| null` | Twitter card type (`summary`, `player`, `summary_large_image`) |
| `twitterSite` | `string \| null` | Twitter @handle from `twitter:site` |
| `themeColor` | `string \| null` | Theme color from `meta[name=theme-color]` |
| `keywords` | `string[] \| null` | Keywords from `meta[name=keywords]` |
| `oEmbedUrl` | `string \| null` | Discovered oEmbed endpoint URL from ``. Not fetched — returned for the caller to resolve if needed |
### `parseHTML(html, baseUrl, options?)`
直接解析 HTML 字符串。当你已经拥有 HTML 内容时使用此方法。
```
import { parseHTML } from "linkpeek";
const result = parseHTML(
"Hello ",
"https://example.com",
);
console.log(result.title); // "Hello"
```
**参数:**
- `html` (`string`) — 要解析的 HTML 内容
- `baseUrl` (`string`) — 用于解析相对 URL 的基础 URL
- `options?` (`{ includeBodyContent?: boolean }`) — 传入 `{ includeBodyContent: true }` 以扫描 ` ` 中的 JSON-LD 和图像回退
返回 `PreviewResult`。
## 工作原理
1. **Twitterbot User-Agent** — 从大多数平台获取预渲染的 HTML,完全跳过客户端渲染
2. **带字节限制的流式下载** — 30 KB 后中止(默认)。OG 标签位于前 10-30 KB;YouTube 页面超过 2 MB,但我们只下载所需的部分
3. **SAX 解析** — 将 HTML 作为字符流处理,无需构建 DOM。解析时间约 2 ms
4. **头部优先解析** — 所有标准元数据都在 `` 中。通过 `includeBodyContent: true` (或 `presets.quality`) 可选启用 Body 扫描以获取 JSON-LD 和 `
` 回退
5. **零额外 HTTP 调用** — 默认不从 Google API 获取 favicon,不进行 oEmbed 解析
## 框架示例
| Example | Runtime | Description |
| ------- | ------- | ----------- |
| [Next.js API Route](./examples/nextjs-app-router) | Node | App Router route handler |
| [Express](./examples/express-api) | Node | `/api/preview?url=` endpoint |
| [Cloudflare Worker](./examples/cloudflare-worker) | Edge | Deploy link previews to the edge |
| [React Component](./examples/react-preview-card) | Browser | `` card component |
| [Supabase Edge Function](./examples/supabase-edge-function) | Deno | Edge function for Supabase projects |
| [Bun Server](./examples/bun-server) | Bun | Minimal Bun.serve() example |
## 许可证
MIT
## 贡献
欢迎贡献。请在提交 Pull Request 前阅读 `CONTRIBUTING.md`。
## 安全
如果发现漏洞,请遵循 `SECURITY.md` 中的报告流程。
标签:BeEF, Bun, Deno, GNU通用公共许可证, MITM代理, Node.js, Open Graph, SAX流式解析, SEO, SSRF防护, Twitter Cards, TypeScript, Web Scraping, 元数据提取, 元标签, 单依赖, 后端开发, 安全插件, 爬虫, 网页摘要, 轻量级库, 进程保护, 链接预览