Skip to content

Commit

Permalink
Merge pull request #323 from extractus/7.2.6
Browse files Browse the repository at this point in the history
v7.2.6 - Migrate to extractus org
  • Loading branch information
ndaidong authored Nov 30, 2022
2 parents 8acc3d3 + edfcc1d commit f31c80f
Show file tree
Hide file tree
Showing 21 changed files with 145 additions and 136 deletions.
4 changes: 2 additions & 2 deletions .github/workflows/ci-test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,8 @@ jobs:
npm run build --if-present
npm run test
- name: sync to coveralls
uses: coverallsapp/github-action@v1.1.2
- name: Coveralls GitHub Action
uses: coverallsapp/github-action@1.1.3
with:
github-token: ${{ secrets.GITHUB_TOKEN }}

Expand Down
34 changes: 19 additions & 15 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1,20 +1,24 @@
# Contributing to article-parser
# Contributing to `@extractus/article-extractor`

While `article-parser` is just a simple library with personal purpose, I'm happy if it can be useful for you too.
Glad to see you here.

Anyway, I hope it always gets better, so pull requests are welcome, though larger proposals should be discussed first.
Collaborations and pull requests are always welcomed, though larger proposals should be discussed first.

As an OSS, it should follow the Unix philosophy: "do one thing and do it well".
As an OSS, it's better to follow the Unix philosophy: "do one thing and do it well".

## Installation
## What you can contribute?

- Ensure you have `node` and `npm` installed.
- After cloning the repository, run `npm install` in the root of the repository.
- Run `npm test` to ensure that everything works correctly in your environment.
We are planing to re-write this tool in TypeScript and make it Deno-first library.
If you are interested, please join our team.

If it works well, you can start modifying your fork.
Besides that, you can also:

In this process, you can use [`npm run eval` command](https://github.com/ndaidong/article-parser#quick-evaluation) to evaluate your changes.
- Test and report bugs
- Fix unresolved issues
- Improve performance
- Write better documentation
- Create examples or build demos
- Feedback on software design and APIs


## Third-party libraries
Expand All @@ -32,7 +36,7 @@ Make sure your code lints before opening a pull request.


```bash
cd article-parser
cd article-extractor

# check coding convention issue
npm run lint
Expand All @@ -49,18 +53,18 @@ npm run lint:fix
Be sure to run the unit test suite before opening a pull request. An example test run is shown below.

```bash
cd article-parser
cd article-extractor
npm test
```

![feed-reader unit test](https://i.imgur.com/1ycj7Ks.png)
![article-extractor unit test](https://i.imgur.com/1ycj7Ks.png)

If test coverage decreased, please check test scripts and try to improve this number.


## Documentation

If you've changed APIs, please update README and [the examples](https://github.com/ndaidong/article-parser/tree/main/examples).
If you've changed APIs, please update README and [the examples](examples).


## Clean commit histories
Expand All @@ -79,6 +83,6 @@ For people new to git, please refer the following guides:

## License

By contributing to `article-parser`, you agree that your contributions will be licensed under its [MIT license](https://github.com/ndaidong/article-parser/blob/main/LICENSE).
By contributing to `@extractus/article-extractor`, you agree that your contributions will be licensed under its [MIT license](LICENSE).

---
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
The MIT License (MIT)

Copyright (c) 2016 Dong Nguyen
Copyright (c) 2016 Extractus

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
Expand Down
98 changes: 49 additions & 49 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,35 +1,27 @@
# article-parser
# @extractus/article-extractor

Extract main article, main image and meta data from URL.

[![NPM](https://badge.fury.io/js/article-parser.svg)](https://badge.fury.io/js/article-parser)
![CI test](https://github.com/ndaidong/article-parser/workflows/ci-test/badge.svg)
[![Coverage Status](https://coveralls.io/repos/github/ndaidong/article-parser/badge.svg)](https://coveralls.io/github/ndaidong/article-parser)
![CodeQL](https://github.com/ndaidong/article-parser/workflows/CodeQL/badge.svg)
[![npm version](https://badge.fury.io/js/@extractus%2Farticle-extractor.svg)](https://badge.fury.io/js/@extractus%2Farticle-extractor)
![CI test](https://github.com/extractus/article-extractor/workflows/ci-test/badge.svg)
[![Coverage Status](https://img.shields.io/coveralls/github/extractus/article-extractor)](https://coveralls.io/github/extractus/article-extractor?branch=main)
![CodeQL](https://github.com/extractus/article-extractor/workflows/CodeQL/badge.svg)
[![JavaScript Style Guide](https://img.shields.io/badge/code_style-standard-brightgreen.svg)](https://standardjs.com)


## Intro

*article-parser* is a part of tool sets for content builder:
*article-extractor* is a part of tool sets for content builder:

- [feed-reader](https://github.com/ndaidong/feed-reader): extract & normalize RSS/ATOM/JSON feed
- [article-parser](https://github.com/ndaidong/article-parser): extract main article from given URL
- [oembed-parser](https://github.com/ndaidong/oembed-parser): extract oEmbed data from supported providers
- [feed-extractor](https://github.com/extractus/feed-extractor): extract & normalize RSS/ATOM/JSON feed
- [article-extractor](https://github.com/extractus/article-extractor): extract main article from given URL
- [oembed-extractor](https://github.com/extractus/oembed-extractor): extract oEmbed data from supported providers

You can use one or combination of these tools to build news sites, create automated content systems for marketing campaign or gather dataset for NLP projects...

```
┌────────────────┐
┌───────► article-parser ├──────────┐
│ └────────────────┘ │
┌─────────────┐ ┌─────────┴────┐ ┌────────▼─────────┐ ┌─────────────┐
│ feed-reader ├───► feed entries │ │ content database ├───► public APIs │
└─────────────┘ └─────────┬────┘ └────────▲─────────┘ └─────────────┘
│ ┌────────────────┐ │
└───────► oembed-parser ├──────────┘
└────────────────┘
```
### Attention

`article-parser` has been renamed to `@extractus/article-extractor` since v7.2.5

## Demo

Expand All @@ -42,39 +34,43 @@ You can use one or combination of these tools to build news sites, create automa
### Node.js

```bash
npm i article-parser
npm i @extractus/article-extractor

# pnpm
pnpm i article-parser
pnpm i @extractus/article-extractor

# yarn
yarn add article-parser
yarn add @extractus/article-extractor
```

```ts
// es6 module
import { extract } from 'article-parser'
import { extract } from '@extractus/article-extractor'

// CommonJS
const { extract } = require('article-parser')
const { extract } = require('@extractus/article-extractor')

// or specify exactly path to CommonJS variant
const { extract } = require('article-parser/dist/cjs/article-parser.js')
const { extract } = require('@extractus/article-extractor/dist/cjs/article-extractor.js')
```

### Deno

```ts
import { extract } from 'https://esm.sh/article-parser'
// deno > 1.28
import { extract } from 'npm:@extractus/article-extractor'

// deno < 1.28
// import { extract } from 'https://esm.sh/@extractus/article-extractor'
```

### Browser

```ts
import { extract } from 'https://unpkg.com/article-parser@latest/dist/article-parser.esm.js'
import { read } from 'https://unpkg.com/@extractus/article-extractor@latest/dist/article-extractor.esm.js'
```

Please check [the examples](https://github.com/ndaidong/article-parser/tree/main/examples) for reference.
Please check [the examples](examples) for reference.


### Deta cloud
Expand Down Expand Up @@ -117,7 +113,7 @@ URL string links to the article or HTML content of that web page.
For example:

```js
import { extract } from 'article-parser'
import { extract } from '@extractus/article-extractor'

const input = 'https://www.cnbc.com/2022/09/21/what-another-major-rate-hike-by-the-federal-reserve-means-to-you.html'
extract(input)
Expand Down Expand Up @@ -157,12 +153,14 @@ Object with all or several of the following properties:
For example:

```js
import { extract } from 'article-parser'
import { extract } from '@extractus/article-extractor'

extract('https://www.cnbc.com/2022/09/21/what-another-major-rate-hike-by-the-federal-reserve-means-to-you.html', {
const article = await extract('https://www.cnbc.com/2022/09/21/what-another-major-rate-hike-by-the-federal-reserve-means-to-you.html', {
descriptionLengthThreshold: 120,
contentLengthThreshold: 500
})

console.log(article)
```

##### `fetchOptions` *optional*
Expand All @@ -172,26 +170,28 @@ You can use this param to set request headers to [fetch](https://developer.mozil
For example:

```js
import { extract } from 'article-parser'
import { extract } from '@extractus/article-extractor'

const url = 'https://www.cnbc.com/2022/09/21/what-another-major-rate-hike-by-the-federal-reserve-means-to-you.html'
extract(url, null, {
const article = await extract(url, null, {
headers: {
'user-agent': 'Opera/9.60 (Windows NT 6.0; U; en) Presto/2.1.1'
}
})

console.log(article)
```

You can also specify a proxy endpoint to load remote content, instead of fetching directly.

For example:

```js
import { extract } from 'article-parser'
import { extract } from '@extractus/article-extractor'

const url = 'https://www.cnbc.com/2022/09/21/what-another-major-rate-hike-by-the-federal-reserve-means-to-you.html'

extract(url, null, {
await extract(url, null, {
headers: {
'user-agent': 'Opera/9.60 (Windows NT 6.0; U; en) Presto/2.1.1'
},
Expand All @@ -204,7 +204,7 @@ extract(url, null, {
})
```

Passing requests to proxy is useful while running `article-parser` on browser. View [examples/browser-article-parser](https://github.com/ndaidong/article-parser/tree/main/examples/browser-article-parser) as reference example.
Passing requests to proxy is useful while running `@extractus/article-extractor` on browser. View [examples/browser-article-parser](examples/browser-article-parser) as reference example.

For more info about proxy authentication, please refer [HTTP authentication](https://developer.mozilla.org/en-US/docs/Web/HTTP/Authentication)

Expand All @@ -227,7 +227,7 @@ At first, let's talk about `transformation` object.

#### `transformation` object

In `article-parser`, `transformation` is an object with the following properties:
In `@extractus/article-extractor`, `transformation` is an object with the following properties:

- `patterns`: required, a list of regexps to match the URLs
- `pre`: optional, a function to process raw HTML
Expand All @@ -240,11 +240,11 @@ Basically, the meaning of `transformation` can be interpreted like this:
> then extract main article content with normalized HTML, and if success <br>
> let's run `post` function to normalize extracted article content
![article-parser extraction process](https://res.cloudinary.com/pwshub/image/upload/v1657336822/documentation/article-parser_extraction_process.png)
![article-extractor extraction process](https://res.cloudinary.com/pwshub/image/upload/v1657336822/documentation/article-parser_extraction_process.png)

Here is an example transformation:

```js
```ts
{
patterns: [
/([\w]+.)?domain.tld\/*/,
Expand Down Expand Up @@ -288,8 +288,8 @@ Here is an example transformation:

Add a single transformation or a list of transformations. For example:

```js
import { addTransformations } from 'article-parser'
```ts
import { addTransformations } from '@extractus/article-extractor'

addTransformations({
patterns: [
Expand Down Expand Up @@ -344,7 +344,7 @@ To remove transformations that match the specific patterns.
For example, we can remove all added transformations above:

```js
import { removeTransformations } from 'article-parser'
import { removeTransformations } from '@extractus/article-extractor'
removeTransformations([
/([\w]+.)?abc.tld\/*/,
Expand Down Expand Up @@ -384,21 +384,21 @@ Suppose that we have the following transformations:

As you can see, an article from `goo.gl` certainly matches both them.

In this scenario, `article-parser` will execute both transformations, one by one:
In this scenario, `@extractus/article-extractor` will execute both transformations, one by one:

`function_one` -> `function_three` -> extraction -> `function_two` -> `function_four`

---

### `sanitize-html`'s options

`article-parser` uses [sanitize-html](https://www.npmjs.com/package/sanitize-html) to make a clean sweep of HTML content.
`@extractus/article-extractor` uses [sanitize-html](https://www.npmjs.com/package/sanitize-html) to make a clean sweep of HTML content.

Here is the [default options](https://github.com/ndaidong/article-parser/blob/main/src/config.js#L5)
Here is the [default options](src/config.js#L5)

Depending on the needs of your content system, you might want to gather some HTML tags/attributes, while ignoring others.

There are 2 methods to access and modify these options in `article-parser`.
There are 2 methods to access and modify these options in `@extractus/article-extractor`.

- `getSanitizeHtmlOptions()`
- `setSanitizeHtmlOptions(Object sanitizeHtmlOptions)`
Expand All @@ -410,8 +410,8 @@ Read [sanitize-html](https://www.npmjs.com/package/sanitize-html#what-are-the-de
## Quick evaluation

```bash
git clone https://github.com/ndaidong/article-parser.git
cd article-parser
git clone https://github.com/extractus/article-extractor.git
cd article-extractor
pnpm i
npm run eval {URL_TO_PARSE_ARTICLE}
Expand Down
2 changes: 1 addition & 1 deletion SECURITY.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,6 @@ Description above is a general rule and may be altered on case by case basis.

You can report low severity vulnerabilities as GitHub issues.

More severe vulnerabilities should be reported to my email [email protected] or Twitter [@ndaidong](https://twitter.com/ndaidong).
More severe vulnerabilities should be reported to email extractus.security@skiff.com.

---
6 changes: 3 additions & 3 deletions build.js
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ const esmVersion = {
...baseOpt,
platform: 'browser',
format: 'esm',
outfile: `dist/${pkg.name}.esm.js`,
outfile: 'dist/article-extractor.esm.js',
banner: {
js: comment
}
Expand All @@ -50,7 +50,7 @@ const cjsVersion = {
platform: 'node',
format: 'cjs',
mainFields: ['main'],
outfile: `dist/cjs/${pkg.name}.js`,
outfile: 'dist/cjs/article-extractor.js',
banner: {
js: comment
}
Expand All @@ -60,7 +60,7 @@ buildSync(cjsVersion)
const cjspkg = {
name: pkg.name,
version: pkg.version,
main: `./${pkg.name}.js`
main: './article-extractor.js'
}

writeFileSync(
Expand Down
4 changes: 2 additions & 2 deletions build.test.js
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,8 @@ import {

const pkg = JSON.parse(readFileSync('./package.json'))

const esmFile = `./dist/${pkg.name}.esm.js`
const cjsFile = `./dist/cjs/${pkg.name}.js`
const esmFile = './dist/article-extractor.esm.js'
const cjsFile = './dist/cjs/article-extractor.js'
const cjsPkg = JSON.parse(readFileSync('./dist/cjs/package.json'))
const cjsType = './dist/cjs/index.d.ts'

Expand Down
4 changes: 2 additions & 2 deletions dist/article-parser.esm.js → dist/article-extractor.esm.js

Large diffs are not rendered by default.

64 changes: 32 additions & 32 deletions dist/cjs/article-parser.js → dist/cjs/article-extractor.js

Large diffs are not rendered by default.

6 changes: 3 additions & 3 deletions dist/cjs/package.json
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
{
"name": "article-parser",
"version": "7.2.5",
"main": "./article-parser.js"
"name": "@extractus/article-extractor",
"version": "7.2.6",
"main": "./article-extractor.js"
}
Loading

0 comments on commit f31c80f

Please sign in to comment.