thegruber/linkpeek

GitHub: thegruber/linkpeek

一个轻量高效的链接预览元数据提取库,支持 Node.js、Bun 和 Deno,通过流式解析在仅下载必要数据的同时获取网页的标题、描述、图片等 19 个字段。

Stars: 0 | Forks: 0

# linkpeek **适用于 Node.js、Bun 和 Deno 的链接预览提取工具。仅一个依赖。** [![npm](https://img.shields.io/npm/v/linkpeek)](https://www.npmjs.com/package/linkpeek) [![bundle size](https://img.shields.io/bundlephobia/minzip/linkpeek)](https://bundlephobia.com/package/linkpeek) [![CI](https://static.pigsec.cn/wp-content/uploads/repos/2026/03/c5231eeb62154900.svg)](https://github.com/thegruber/linkpeek/actions/workflows/ci.yml) [![types](https://img.shields.io/npm/types/linkpeek)](https://www.npmjs.com/package/linkpeek) [![license](https://img.shields.io/npm/l/linkpeek)](LICENSE) ``` import { preview } from "linkpeek"; const result = await preview("https://www.youtube.com/watch?v=dQw4w9WgXcQ"); result.title; // "Rick Astley - Never Gonna Give You Up" result.image; // "https://i.ytimg.com/vi/dQw4w9WgXcQ/maxresdefault.jpg" result.siteName; // "YouTube" result.favicon; // "https://www.youtube.com/favicon.ico" result.description; // "The official video for \"Never Gonna Give You Up\"..." ``` ## 安装 ``` npm install linkpeek ``` 也支持通过 `bun add linkpeek` 和 `import { preview } from "npm:linkpeek"` (Deno) 使用。 ## 为什么选择 linkpeek - **1 个依赖** (htmlparser2) — 不是 4 个,也不是插件树 - **在 `` 处停止** — 仅下载 30 KB,而不是完整的 2 MB 页面 - **SAX 流式处理** — 无 DOM 构建,解析时间约 2 ms - **SSRF 防护** — 默认拦截私有/内部 IP - **支持 Node.js 20+、Bun 和 Deno** (已在 CI 中测试) — 仅使用 Web 标准 API ## 完整结果 从单个 URL 提取的所有 19 个字段: ``` { url: "https://www.youtube.com/watch?v=dQw4w9WgXcQ", title: "Rick Astley - Never Gonna Give You Up (Official Video) (4K Remaster)", description: "The official video for \"Never Gonna Give You Up\" by Rick Astley...", image: "https://i.ytimg.com/vi/dQw4w9WgXcQ/maxresdefault.jpg", imageWidth: 1280, imageHeight: 720, siteName: "YouTube", favicon: "https://www.youtube.com/favicon.ico", mediaType: "video.other", author: null, canonicalUrl: "https://www.youtube.com/watch?v=dQw4w9WgXcQ", locale: null, publishedDate: null, video: "https://www.youtube.com/embed/dQw4w9WgXcQ", twitterCard: "player", twitterSite: "@youtube", themeColor: null, keywords: ["rick astley", "Never Gonna Give You Up", "rick roll"], oEmbedUrl: "https://www.youtube.com/oembed?format=json&url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DdQw4w9WgXcQ" } ``` ## 预设 ``` import { preview, presets } from "linkpeek"; // Default: fast (30 KB limit, head only, no meta-refresh) const result = await preview(url); // Quality: body JSON-LD + image fallback + meta-refresh const result = await preview(url, presets.quality); // Custom: spread a preset and override const result = await preview(url, { ...presets.quality, timeout: 3000 }); ``` | Preset | What it enables | | ----------------- | ------------------------------------------- | | `presets.fast` | Default behavior — explicit version of `{}` | | `presets.quality` | Body JSON-LD, image fallback, meta-refresh | ## 错误处理 `preview()` 针对无效输入和被阻止的 URL 会抛出异常: ``` try { const result = await preview(url); } catch (err) { // "Invalid URL" // "Only http and https URLs are supported" // "URLs pointing to private/internal networks are not allowed" console.error(err.message); } ``` ## API ### `preview(url, options?)` 获取 URL 并提取链接预览元数据。返回 `Promise`。 #### 选项 | Option | Type | Default | Description | | -------------------- | ------------------------ | ------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------- | | `timeout` | `number` | `8000` | Request timeout in milliseconds | | `maxBytes` | `number` | `30_000` | Max bytes to stream | | `userAgent` | `string` | `"Twitterbot/1.0"` | User-Agent sent with requests. Twitterbot gets pre-rendered HTML from most platforms | | `followRedirects` | `boolean` | `true` | Follow HTTP 3xx redirects | | `headers` | `Record` | `{}` | Extra request headers (e.g. cookies, auth tokens) | | `allowPrivateIPs` | `boolean` | `false` | Allow fetching private/internal IPs. Keep `false` in production to prevent SSRF attacks | | `followMetaRefresh` | `boolean` | `false` | Follow `` redirects when no title is found. Enable to handle Cloudflare-challenged pages at the cost of an extra HTTP round-trip | | `includeBodyContent` | `boolean` | `false` | Continue scanning `` for JSON-LD scripts and `` fallbacks after ``. Enable together with a higher `maxBytes` for best quality | #### 结果字段 | Field | Type | Description | | --------------- | ------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------- | | `url` | `string` | Final resolved URL | | `title` | `string \| null` | Page title (`og:title` → `twitter:title` → JSON-LD → ``) | | `description` | `string \| null` | Description (`og:description` → `twitter:description` → `meta[name=description]` → JSON-LD) | | `image` | `string \| null` | Preview image (`og:image` → `twitter:image` → JSON-LD → `itemprop=image` → first `<img>`) | | `imageWidth` | `number \| null` | Image width from `og:image:width` | | `imageHeight` | `number \| null` | Image height from `og:image:height` | | `siteName` | `string` | Site name (`og:site_name` → JSON-LD publisher → hostname fallback) | | `favicon` | `string \| null` | Favicon URL (largest `apple-touch-icon` → `link[rel=icon]` → `/favicon.ico`) | | `mediaType` | `string` | Content type from `og:type`, defaults to `"website"` | | `author` | `string \| null` | Author name (JSON-LD author → `meta[name=author]` → Dublin Core) | | `canonicalUrl` | `string` | Canonical URL (`link[rel=canonical]` → `og:url` → request URL) | | `locale` | `string \| null` | Locale from `og:locale` | | `publishedDate` | `string \| null` | Published date (`article:published_time` → JSON-LD `datePublished` → Dublin Core) | | `video` | `string \| null` | Video URL from `og:video` | | `twitterCard` | `string \| null` | Twitter card type (`summary`, `player`, `summary_large_image`) | | `twitterSite` | `string \| null` | Twitter @handle from `twitter:site` | | `themeColor` | `string \| null` | Theme color from `meta[name=theme-color]` | | `keywords` | `string[] \| null` | Keywords from `meta[name=keywords]` | | `oEmbedUrl` | `string \| null` | Discovered oEmbed endpoint URL from `<link rel="alternate" type="application/json+oembed">`. Not fetched — returned for the caller to resolve if needed | ### `parseHTML(html, baseUrl, options?)` 直接解析 HTML 字符串。当你已经拥有 HTML 内容时使用此方法。 ``` import { parseHTML } from "linkpeek"; const result = parseHTML( "<html><head><title>Hello", "https://example.com", ); console.log(result.title); // "Hello" ``` **参数:** - `html` (`string`) — 要解析的 HTML 内容 - `baseUrl` (`string`) — 用于解析相对 URL 的基础 URL - `options?` (`{ includeBodyContent?: boolean }`) — 传入 `{ includeBodyContent: true }` 以扫描 `` 中的 JSON-LD 和图像回退 返回 `PreviewResult`。 ## 工作原理 1. **Twitterbot User-Agent** — 从大多数平台获取预渲染的 HTML,完全跳过客户端渲染 2. **带字节限制的流式下载** — 30 KB 后中止(默认)。OG 标签位于前 10-30 KB;YouTube 页面超过 2 MB,但我们只下载所需的部分 3. **SAX 解析** — 将 HTML 作为字符流处理,无需构建 DOM。解析时间约 2 ms 4. **头部优先解析** — 所有标准元数据都在 `` 中。通过 `includeBodyContent: true` (或 `presets.quality`) 可选启用 Body 扫描以获取 JSON-LD 和 `` 回退 5. **零额外 HTTP 调用** — 默认不从 Google API 获取 favicon,不进行 oEmbed 解析 ## 框架示例 | Example | Runtime | Description | | ------- | ------- | ----------- | | [Next.js API Route](./examples/nextjs-app-router) | Node | App Router route handler | | [Express](./examples/express-api) | Node | `/api/preview?url=` endpoint | | [Cloudflare Worker](./examples/cloudflare-worker) | Edge | Deploy link previews to the edge | | [React Component](./examples/react-preview-card) | Browser | `` card component | | [Supabase Edge Function](./examples/supabase-edge-function) | Deno | Edge function for Supabase projects | | [Bun Server](./examples/bun-server) | Bun | Minimal Bun.serve() example | ## 许可证 MIT ## 贡献 欢迎贡献。请在提交 Pull Request 前阅读 `CONTRIBUTING.md`。 ## 安全 如果发现漏洞,请遵循 `SECURITY.md` 中的报告流程。
标签:BeEF, Bun, Deno, GNU通用公共许可证, MITM代理, Node.js, Open Graph, SAX流式解析, SEO, SSRF防护, Twitter Cards, TypeScript, Web Scraping, 元数据提取, 元标签, 单依赖, 后端开发, 安全插件, 爬虫, 网页摘要, 轻量级库, 进程保护, 链接预览