Skip to content

Commit

Permalink
Merge pull request #305 from ndaidong/7.2.2
Browse files Browse the repository at this point in the history
v7.2.2
  • Loading branch information
ndaidong authored Sep 23, 2022
2 parents 64e308d + d96bddc commit 22f4dab
Show file tree
Hide file tree
Showing 28 changed files with 584 additions and 38,617 deletions.
180 changes: 118 additions & 62 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ Extract main article, main image and meta data from URL.
## Demo

- [Give it a try!](https://demos.pwshub.com/article-parser)
- [Example FaaS](https://extract-article.deta.dev/?url=https://dev.to/ndaidong/how-to-make-your-mongodb-container-more-secure-1646)
- [Example FaaS](https://extract-article.deta.dev/?url=https://www.freethink.com/technology/virtual-world)

## Install & Usage

Expand All @@ -30,87 +30,153 @@ yarn add article-parser
```

```js
// es6 module
import { extract } from 'article-parser'

// with CommonJS environments
// const { extract } = require('article-parser/dist/cjs/article-parser.js')
// CommonJS
const { extract } = require('article-parser')

const url = 'https://www.freethink.com/technology/virtual-world'

extract(url).then((article) => {
console.log(article)
}).catch((err) => {
console.trace(err)
})
// or specify exactly path to CommonJS variant
const { extract } = require('article-parser/dist/cjs/article-parser.js')
```

### Deno

```ts
import { extract } from 'https://esm.sh/article-parser'
```

### Browser

(async () => {
const data = await extract('https://www.freethink.com/technology/virtual-world')
console.log(data)
})();
```js
import { extract } from 'https://unpkg.com/article-parser@latest/dist/article-parser.esm.js'
```

View [more examples](https://github.com/ndaidong/article-parser/tree/main/examples).
Please check [the examples](https://github.com/ndaidong/article-parser/tree/main/examples) for reference.


## APIs

- [.extract(String url | String html)](#extractstring-url--string-html)
- [.extract()](#extract)
- [Transformations](#transformations)
- [`transformation` object](#transformation-object)
- [.addTransformations](#addtransformationsobject-transformation--array-transformations)
- [.removeTransformations](#removetransformationsarray-patterns)
- [Priority order](#priority-order)
- [Configuration methods](#configuration-methods)
- [`sanitize-html`'s options](#sanitize-htmls-options)

---

### extract(String url | String html)
### `extract()`

Load and extract article data. Return a Promise object.

Example:
#### Syntax

```ts
extract(String input)
extract(String input, Object parserOptions)
extract(String input, Object parserOptions, Object fetchOptions)
```

#### Parameters

##### `input` *required*

URL string links to the article or HTML content of that web page.

For example:

```js
import { extract } from 'article-parser'

const getArticle = async (url) => {
try {
const article = await extract(url)
return article
} catch (err) {
console.trace(err)
return null
}
}

getArticle('https://domain.com/path/to/article')
const input = 'https://www.cnbc.com/2022/09/21/what-another-major-rate-hike-by-the-federal-reserve-means-to-you.html'
extract(input)
.then(article => console.log(article))
.catch(err => console.error(err))
```

If the extraction works well, you should get an `article` object with the structure as below:
The result - `article` - can be `null` or an object with the following structure:

```json
```ts
{
"url": URI String,
"title": String,
"description": String,
"image": URI String,
"author": String,
"content": HTML String,
"published": Date String,
"source": String, // original publisher
"links": Array, // list of alternative links
"ttr": Number, // time to read in second, 0 = unknown
url: String,
title: String,
description: String,
image: String,
author: String,
content: String,
published: Date String,
source: String, // original publisher
links: Array, // list of alternative links
ttr: Number, // time to read in second, 0 = unknown
}
```

[Click here](https://extract-article.deta.dev/?url=https://www.freethink.com/technology/virtual-world) for seeing an actual result.


##### `parserOptions` *optional*

Object with all or several of the following properties:

- `wordsPerMinute`: Number, to estimate time to read. Default `300`.
- `descriptionTruncateLen`: Number, max num of chars generated for description. Default `210`.
- `descriptionLengthThreshold`: Number, min num of chars required for description. Default `180`.
- `contentLengthThreshold`: Number, min num of chars required for content. Default `200`.

For example:

```js
import { extract } from 'article-parser'

extract('https://www.cnbc.com/2022/09/21/what-another-major-rate-hike-by-the-federal-reserve-means-to-you.html', {
descriptionLengthThreshold: 120,
contentLengthThreshold: 500
})
```

##### `fetchOptions` *optional*

You can use this param to set request headers to [fetch](https://developer.mozilla.org/en-US/docs/Web/API/Fetch_API/Using_Fetch).

For example:

```js
import { extract } from 'article-parser'

const url = 'https://www.cnbc.com/2022/09/21/what-another-major-rate-hike-by-the-federal-reserve-means-to-you.html'
extract(url, null, {
headers: {
'user-agent': 'Opera/9.60 (Windows NT 6.0; U; en) Presto/2.1.1'
}
})
```

You can also specify a proxy endpoint to load remote content, instead of fetching directly.

For example:

```js
import { extract } from 'article-parser'

const url = 'https://www.cnbc.com/2022/09/21/what-another-major-rate-hike-by-the-federal-reserve-means-to-you.html'

extract(url, null, {
headers: {
'user-agent': 'Opera/9.60 (Windows NT 6.0; U; en) Presto/2.1.1'
},
proxy: {
target: 'https://your-secret-proxy.io/loadXml?url=',
headers: {
'Proxy-Authorization': 'Bearer YWxhZGRpbjpvcGVuc2VzYW1l...'
}
}
})
```

Passing requests to proxy is useful while running `article-parser` on browser. View `examples/browser-article-parser` as reference example.

---

### Transformations
Expand All @@ -119,9 +185,7 @@ Sometimes the default extraction algorithm may not work well. That is the time w

By adding some functions before and after the main extraction step, we aim to come up with a better result as much as possible.

`transformation` is available since `[email protected]`, as the improvement of `queryRule` in the older versions.

To play with transformations, `article-parser` provides 2 public methods as below:
There are 2 methods to play with transformations:

- `addTransformations(Object transformation | Array transformations)`
- `removeTransformations(Array patterns)`
Expand All @@ -134,7 +198,7 @@ In `article-parser`, `transformation` is an object with the following properties

- `patterns`: required, a list of regexps to match the URLs
- `pre`: optional, a function to process raw HTML
- `post`: optional, a function to proces extracted article
- `post`: optional, a function to process extracted article

Basically, the meaning of `transformation` can be interpreted like this:

Expand Down Expand Up @@ -264,7 +328,7 @@ While processing an article, more than one transformation can be applied.

Suppose that we have the following transformations:

```js
```ts
[
{
patterns: [
Expand Down Expand Up @@ -293,26 +357,18 @@ In this scenario, `article-parser` will execute both transformations, one by one

---

### Configuration methods

In addition, this lib provides some methods to customize default settings. Don't touch them unless you have reason to do that.
- getParserOptions()
- setParserOptions(Object parserOptions)
- getSanitizeHtmlOptions()
- setSanitizeHtmlOptions(Object sanitizeHtmlOptions)
Here are default properties/values:
### `sanitize-html`'s options
#### Object `parserOptions`:
`article-parser` uses [sanitize-html](https://www.npmjs.com/package/sanitize-html) to make a clean sweep of HTML content.
View [default options](https://github.com/ndaidong/article-parser/blob/main/src/config.js#L51)
Here is the [default options](https://github.com/ndaidong/article-parser/blob/main/src/config.js#L5)
Depending on the needs of your content system, you might want to gather some HTML tags/attributes, while ignoring others.
#### Object `sanitizeHtmlOptions`:
There are 2 methods to access and modify these options in `article-parser`.
View [default options](https://github.com/ndaidong/article-parser/blob/main/src/config.js#L5)
- `getSanitizeHtmlOptions()`
- `setSanitizeHtmlOptions(Object sanitizeHtmlOptions)`
Read [sanitize-html](https://www.npmjs.com/package/sanitize-html#what-are-the-default-options) docs for more info.
Expand Down
2 changes: 1 addition & 1 deletion build.js
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ const baseOpt = {
target: ['es2020', 'node14'],
pure: ['console.log', 'debug', 'alert'],
legalComments: 'none',
minify: false,
minify: true,
sourcemap: false,
write: true
}
Expand Down
Loading

0 comments on commit 22f4dab

Please sign in to comment.