html-get
html-get returns reliable HTML for a URL, choosing fetch or prerender depending on page needs.
Quick Start
Install:
bash
npm install html-get browserless puppeteer
Minimal usage:
js
const createBrowserless = require('browserless')
const getHTML = require('html-get')
const browser = createBrowserless()
const context = browser.createContext()
const result = await getHTML('https://example.com', {
getBrowserless: () => context
})
console.log(result.html)
await context((browserless) => browserless.destroyContext())
await browser.close()
Recommended Workflow
- •Start with default
prerender: 'auto'. - •Set
prerender: falsefor static pages when speed is priority. - •Enable
rewriteUrls: truewhen downstream parsing needs absolute links. - •Enable
rewriteHtml: truewhen source pages have broken meta tags. - •Reuse one browser process and create/destroy contexts per request.
CLI
One-off usage:
bash
npx -y html-get https://example.com
Debug output with mode, timing, and headers:
bash
npx -y html-get https://example.com --debug
Core Options
- •
getBrowserless(function): required unlessprerender: false. - •
prerender('auto' | true | false): mode selector. - •
rewriteUrls(boolean): rewrite relative HTML/CSS URLs to absolute. - •
rewriteHtml(boolean): normalize common meta-tag mistakes. - •
headers(object): request headers for fetch/prerender. - •
gotOpts(object): extra options forgotin fetch mode. - •
puppeteerOpts(object): options passed to browserless evaluate flow. - •
serializeHtml(function): custom output serializer from Cheerio instance. - •
encoding(string): output encoding, defaultutf-8.
Output Shape
getHTML(url, opts) resolves to:
- •
html: serialized HTML (or custom serializer output fields). - •
url: final URL. - •
statusCode: HTTP status. - •
headers: response headers. - •
redirects: redirect chain. - •
stats:{ mode, timing }.
Common Patterns
Force fast fetch mode for known static targets:
js
const result = await getHTML(url, {
prerender: false,
rewriteUrls: true
})
Prepare HTML for metadata extraction:
js
const page = await getHTML(url, {
getBrowserless,
rewriteUrls: true,
rewriteHtml: true
})
const metadata = await metascraper({ url: page.url, html: page.html })
Custom serializer (avoid returning full HTML):
js
const result = await getHTML(url, {
getBrowserless,
serializeHtml: ($) => ({
html: $.html(),
title: $('title').first().text()
})
})
Reliability Notes
- •If
getBrowserlessis missing andprerenderis notfalse,html-getthrows. - •PDF URLs are fetched and can be converted via
mutoolwhen available. - •Media URLs are normalized to HTML wrappers (
img,video,audio) for consistent downstream parsing. - •For large batch jobs, control concurrency outside
html-getand always clean up browser contexts.