Skip to content

Commit

Permalink
data/original/terminology/facebook: replace non-standard language cod…
Browse files Browse the repository at this point in the history
…e suffix XX (fititnt/hxltm-action#5, #1)
  • Loading branch information
fititnt committed Nov 11, 2021
1 parent 4c946e2 commit 65cb847
Show file tree
Hide file tree
Showing 10 changed files with 31 additions and 27 deletions.
8 changes: 8 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,14 @@
### Added
- TODO: Fix Facebook terminology usage of "_XX" as suffix

### Changed
- `data/original/terminology/facebook/{.*-XX.csv -> .*.csv}`: Removed unknown
language suffix `_XX` / `-XX` used on filenames for Facebook terminology.
_If this is to mean "no specific region" can be simply omitted when
exchanging data._:
- `en_es-XX`, `en_fr-XX`, `en_ja-XX`, `en_nl-XX`, `en_no-XX`, `en_pt-XX`,
`en_tl-XX`

## [1.0.0] - 2021-11-11
### Added
- **Fiat lux!**
Expand Down
17 changes: 5 additions & 12 deletions scripts/data-original-download.sh
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,8 @@
#
# OPTIONS: ---
#
# REQUIREMENTS: - git
# REQUIREMENTS: - POSIX Shell or better
# - git
# BUGS: ---
# NOTES: ---
# AUTHORS: Emerson Rocha <rocha[at]ieee.org>
Expand All @@ -24,26 +25,18 @@

set -e

# PWD_NOW=$(pwd)
TMP_DIR="tmp"
DATA_DIR="data"
DATA_ORIGINAL_DIR="data/original"
# DATA_DIR="data"
# DATA_ORIGINAL_DIR="data/original"
DATA_ORIGINAL_GIT_DIR="tmp/original-git"

if [ ! -d "$TMP_DIR" ]; then
mkdir "$TMP_DIR"
fi

# wget https://github.com/tico-19/tico-19.github.io/archive/refs/heads/master.zip

# if [ ! -f ${TMP_DIR}/tico-19.github.io.zip ]; then
# curl -L "https://github.com/tico-19/tico-19.github.io/archive/refs/heads/master.zip" \
# --output "${TMP_DIR}/tico-19.github.io.zip"
# fi

if [ ! -d "$DATA_ORIGINAL_GIT_DIR" ]; then
git clone --depth 1 https://github.com/tico-19/tico-19.github.io "$DATA_ORIGINAL_GIT_DIR"
else
echo "Already cloned. Skipping..."
fi
# unzip "${TMP_DIR}/tico-19.github.io.zip" -d "$TMP_DIR"

33 changes: 18 additions & 15 deletions scripts/data-original-prepare.sh
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,10 @@
#
# OPTIONS: ---
#
# REQUIREMENTS: - rename
# REQUIREMENTS: - POSIX Shell or better
# - rsync
# - miller
# - See <https://github.com/johnkerl/miller>
# BUGS: ---
# NOTES: ---
# AUTHORS: Emerson Rocha <rocha[at]ieee.org>
Expand All @@ -23,9 +25,9 @@
# ==============================================================================
set -e

PWD_NOW=$(pwd)
TMP_DIR="tmp"
DATA_DIR="data"
# PWD_NOW=$(pwd)
# TMP_DIR="tmp"
# DATA_DIR="data"
DATA_ORIGINAL_DIR="data/original"
DATA_ORIGINAL_GIT_DIR="tmp/original-git"

Expand All @@ -44,29 +46,28 @@ ls "${DATA_ORIGINAL_DIR}"/terminology/facebook/
### Rename
# data/original/terminology/facebook/f_en-pt_XX.csv -> data/original/terminology/facebook/f-en-pt-XX.csv
for i in "${DATA_ORIGINAL_DIR}"/terminology/facebook/f_en-*; do
# echo "$i" "$(echo "$i" | sed "s/f_//")";
# echo "$i" "$(echo "$i" | sed "s/f_//")";
mv "$i" "$(echo "$i" | sed "s/_/-/g")";
done

ls "${DATA_ORIGINAL_DIR}"/terminology/facebook/

# data/original/terminology/facebook/f-en-pt-XX.csv -> data/original/terminology/facebook/f-en-pt-XX.csv
# data/original/terminology/facebook/f-en-pt-XX.csv -> data/original/terminology/facebook/en_pt-XX.csv
for i in "${DATA_ORIGINAL_DIR}"/terminology/facebook/f-en-*; do
# echo "$i" "$(echo "$i" | sed "s/f_//")";
# echo "$i" "$(echo "$i" | sed "s/f_//")";
mv "$i" "$(echo "$i" | sed "s/f-en-/en_/g")";
done

# for i in "${DATA_ORIGINAL_DIR}"/terminology/facebook/f_*; do
# # echo "$i" "$(echo "$i" | sed "s/f_//")";
# mv "$i" "$(echo "$i" | sed "s/f_//")";
# done
# data/original/terminology/facebook/en_pt-XX.csv-> data/original/terminology/facebook/en_pt.csv
for i in "${DATA_ORIGINAL_DIR}"/terminology/facebook/*-XX.csv; do
# echo "$i" "$(echo "$i" | sed "s/-XX.csv/.csv/")";
mv "$i" "$(echo "$i" | sed "s/-XX.csv/.csv/")";
done

# exit 1

## ls
ls "${DATA_ORIGINAL_DIR}"/terminology/facebook/

echo "TODO: normalize _XX"
# echo "TODO: normalize _XX"

#### terminology/google ______________________________________________________

Expand All @@ -88,5 +89,7 @@ done
ls "${DATA_ORIGINAL_DIR}"/terminology/google/


#### terminology/google ______________________________________________________
#### validate __________________________________________________________________

# mlr --icsv check data/original/terminology/google/*.csv
# mlr --icsv check data/original/terminology/facebook/*.csv

0 comments on commit 65cb847

Please sign in to comment.