Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix multi bytes characters in some regex of RegexFilter #306

Merged
merged 1 commit into from
Sep 20, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions src/Transformers/RegexFilter.php
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ class RegexFilter implements Transformer
*
* @var literal-string
*/
public const EXTRA_CHARACTERS = '/([^\w\s])(?=[^\w\s]*\1)/';
public const EXTRA_CHARACTERS = '/([^\w\s])(?=[^\w\s]*\1)/u';

/**
* Matches consecutively repeated words.
Expand All @@ -73,7 +73,7 @@ class RegexFilter implements Transformer
*
* @var literal-string
*/
public const EXTRA_WHITESPACE = '/\s(?=\s+)/';
public const EXTRA_WHITESPACE = '/\s(?=\s+)/u';

/**
* A pattern to match unicode emojis.
Expand All @@ -87,14 +87,14 @@ class RegexFilter implements Transformer
*
* @var literal-string
*/
public const MENTION = '/(@\w+)/';
public const MENTION = '/(@\w+)/u';

/**
* A pattern to match Twitter-style hashtags (ex. #MachineLearning).
*
* @var literal-string
*/
public const HASHTAG = '/(#\w+)/';
public const HASHTAG = '/(#\w+)/u';

/**
* A list of regular expression patterns used to filter the text columns of the dataset.
Expand Down
4 changes: 2 additions & 2 deletions tests/Transformers/RegexFilterTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ protected function setUp() : void
['Too weird to live, [email protected] too rare to die https://rubixml.com'],
['A man who procrastinates in @his choosing will inevitably have his choice made for him by #circumstance'],
['The quick quick brown fox jumped over the lazy man sitting at a bus stop drinking a can of Cola cola'],
['Diese äpfel Äpfel schmecken sehr gut'],
['Diese äpfel Äpfel schmecken sehr gut'],
]);

$this->transformer = new RegexFilter([
Expand Down Expand Up @@ -68,7 +68,7 @@ public function transform() : void
['Too weird to live, too rare to die '],
['A man who procrastinates in choosing will inevitably have his choice made for him by '],
['The quick brown fox jumped over the lazy man sitting at a bus stop drinking a can of cola'],
['Diese Äpfel schmecken sehr gut'],
['Diese Äpfel schmecken sehr gut'],
];

$this->assertEquals($expected, $this->dataset->samples());
Expand Down
Loading