Merge branch 'huggingface:main' into main

huggingface · Mar 28, 2023 · 510cb73 · 510cb73
2 parents bfb85a3 + 0bc7dc0
commit 510cb73
Show file tree

Hide file tree

Showing 114 changed files with 25,121 additions and 450 deletions.
diff --git a/README.md b/README.md
@@ -18,7 +18,7 @@ This repo contains the content that's used to create the **[Hugging Face course]
 | [Bahasa Indonesia](https://huggingface.co/course/id/chapter1/1) (WIP)                   | [`chapters/id`](https://github.com/huggingface/course/tree/main/chapters/id)       | [@gstdl](https://github.com/gstdl)                                                                                                                                                                                                                                                                                                           |
 | [Italian](https://huggingface.co/course/it/chapter1/1) (WIP)                  | [`chapters/it`](https://github.com/huggingface/course/tree/main/chapters/it)       | [@CaterinaBi](https://github.com/CaterinaBi), [@ClonedOne](https://github.com/ClonedOne),    [@Nolanogenn](https://github.com/Nolanogenn), [@EdAbati](https://github.com/EdAbati), [@gdacciaro](https://github.com/gdacciaro)                                                                                                                                                                  |
 | [Japanese](https://huggingface.co/course/ja/chapter1/1) (WIP)                 | [`chapters/ja`](https://github.com/huggingface/course/tree/main/chapters/ja)       | [@hiromu166](https://github.com/@hiromu166), [@younesbelkada](https://github.com/@younesbelkada), [@HiromuHota](https://github.com/@HiromuHota)                                                                                                                                                                                                       |
-| [Korean](https://huggingface.co/course/ko/chapter1/1) (WIP)                   | [`chapters/ko`](https://github.com/huggingface/course/tree/main/chapters/ko)       | [@Doohae](https://github.com/Doohae), [@wonhyeongseo](https://github.com/wonhyeongseo), [@dlfrnaos19](https://github.com/dlfrnaos19)                                                                                                                                                                                                                                                                                                                     |
+| [Korean](https://huggingface.co/course/ko/chapter1/1) (WIP)                   | [`chapters/ko`](https://github.com/huggingface/course/tree/main/chapters/ko)       | [@Doohae](https://github.com/Doohae), [@wonhyeongseo](https://github.com/wonhyeongseo), [@dlfrnaos19](https://github.com/dlfrnaos19), [@nsbg](https://github.com/nsbg)                                                                                                                                                                                                                                                                                                                     |
 | [Portuguese](https://huggingface.co/course/pt/chapter1/1) (WIP)               | [`chapters/pt`](https://github.com/huggingface/course/tree/main/chapters/pt)       | [@johnnv1](https://github.com/johnnv1), [@victorescosta](https://github.com/victorescosta), [@LincolnVS](https://github.com/LincolnVS)                                                                                                                                                                                                                   |
 | [Russian](https://huggingface.co/course/ru/chapter1/1) (WIP)                  | [`chapters/ru`](https://github.com/huggingface/course/tree/main/chapters/ru)       | [@pdumin](https://github.com/pdumin), [@svv73](https://github.com/svv73)                                                                                                                                                                                                                                                                                 |
 | [Thai](https://huggingface.co/course/th/chapter1/1) (WIP)                     | [`chapters/th`](https://github.com/huggingface/course/tree/main/chapters/th)       | [@peeraponw](https://github.com/peeraponw), [@a-krirk](https://github.com/a-krirk), [@jomariya23156](https://github.com/jomariya23156), [@ckingkan](https://github.com/ckingkan)                                                                                                                                                                         |

diff --git a/chapters/en/chapter4/3.mdx b/chapters/en/chapter4/3.mdx
@@ -83,7 +83,7 @@ training_args = TrainingArguments(
 
 When you call `trainer.train()`, the `Trainer` will then upload your model to the Hub each time it is saved (here every epoch) in a repository in your namespace. That repository will be named like the output directory you picked (here `bert-finetuned-mrpc`) but you can choose a different name with `hub_model_id = "a_different_name"`.
 
-To upload you model to an organization you are a member of, just pass it with `hub_model_id = "my_organization/my_repo_name"`.
+To upload your model to an organization you are a member of, just pass it with `hub_model_id = "my_organization/my_repo_name"`.
 
 Once your training is finished, you should do a final `trainer.push_to_hub()` to upload the last version of your model. It will also generate a model card with all the relevant metadata, reporting the hyperparameters used and the evaluation results! Here is an example of the content you might find in a such a model card:
 

diff --git a/chapters/en/chapter5/3.mdx b/chapters/en/chapter5/3.mdx
@@ -387,7 +387,7 @@ ArrowInvalid: Column 1 named condition expected length 1463 but got length 1000
 
 Oh no! That didn't work! Why not? Looking at the error message will give us a clue: there is a mismatch in the lengths of one of the columns, one being of length 1,463 and the other of length 1,000. If you've looked at the `Dataset.map()` [documentation](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map), you may recall that it's the number of samples passed to the function that we are mapping; here those 1,000 examples gave 1,463 new features, resulting in a shape error.
 
-The problem is that we're trying to mix two different datasets of different sizes: the `drug_dataset` columns will have a certain number of examples (the 1,000 in our error), but the `tokenized_dataset` we are building will have more (the 1,463 in the error message). That doesn't work for a `Dataset`, so we need to either remove the columns from the old dataset or make them the same size as they are in the new dataset. We can do the former with the `remove_columns` argument:
+The problem is that we're trying to mix two different datasets of different sizes: the `drug_dataset` columns will have a certain number of examples (the 1,000 in our error), but the `tokenized_dataset` we are building will have more (the 1,463 in the error message; it is more than 1,000 because we are tokenizing long reviews into more than one example by using `return_overflowing_tokens=True`). That doesn't work for a `Dataset`, so we need to either remove the columns from the old dataset or make them the same size as they are in the new dataset. We can do the former with the `remove_columns` argument:
 
 ```py
 tokenized_dataset = drug_dataset.map(

diff --git a/chapters/en/chapter6/3.mdx b/chapters/en/chapter6/3.mdx
@@ -109,7 +109,7 @@ We can see that the tokenizer's special tokens `[CLS]` and `[SEP]` are mapped to
 
 <Tip>
 
-The notion of what a word is is complicated. For instance, does "I'll" (a contraction of "I will") count as one or two words? It actually depends on the tokenizer and the pre-tokenization operation it applies. Some tokenizers just split on spaces, so they will consider this as one word. Others use punctuation on top of spaces, so will consider it two words.
+The notion of what a word is complicated. For instance, does "I'll" (a contraction of "I will") count as one or two words? It actually depends on the tokenizer and the pre-tokenization operation it applies. Some tokenizers just split on spaces, so they will consider this as one word. Others use punctuation on top of spaces, so will consider it two words.
 
 ✏️ **Try it out!** Create a tokenizer from the `bert-base-cased` and `roberta-base` checkpoints and tokenize "81s" with them. What do you observe? What are the word IDs?
 

diff --git a/chapters/en/chapter7/3.mdx b/chapters/en/chapter7/3.mdx
@@ -1035,7 +1035,7 @@ Neat -- our model has clearly adapted its weights to predict words that are more
 
 <Youtube id="0Oxphw4Q9fo"/>
 
-This wraps up our first experiment with training a language model. In [section 6](/course/chapter7/section6) you'll learn how to train an auto-regressive model like GPT-2 from scratch; head over there if you'd like to see how you can pretrain your very own Transformer model!
+This wraps up our first experiment with training a language model. In [section 6](/course/en/chapter7/section6) you'll learn how to train an auto-regressive model like GPT-2 from scratch; head over there if you'd like to see how you can pretrain your very own Transformer model!
 
 <Tip>
 

diff --git a/chapters/it/chapter2/2.mdx b/chapters/it/chapter2/2.mdx
@@ -257,7 +257,7 @@ outputs = model(inputs)
 ```
 {/if}
 
-Ora, se osserviamo la forma dei nostri input, la dimensionalità sarà molto più bassa: la model head prende in input i vettori ad alta dimensionalità che abbiamo visto prima e produce vettori contenenti due valori (uno per etichetta):
+Ora, se osserviamo la forma dei nostri output, la dimensionalità sarà molto più bassa: la model head prende in input i vettori ad alta dimensionalità che abbiamo visto prima e produce vettori contenenti due valori (uno per etichetta):
 
 ```python
 print(outputs.logits.shape)

diff --git a/chapters/ko/chapter2/8.mdx b/chapters/ko/chapter2/8.mdx
@@ -88,16 +88,16 @@
 <Question
 	choices={[
 		{
-			text: "A component of the base Transformer network that redirects tensors to their correct layers",
+			text: "기본 Transformer 네트워크의 요소로, 텐서를 적합한 레이어로 리디렉션합니다.",
 			explain: "오답입니다! 그런 요소는 없습니다."
 		},
 		{
-			text: "Also known as the self-attention mechanism, it adapts the representation of a token according to the other tokens of the sequence",
-			explain: "Incorrect! The self-attention layer does contain attention \"heads,\" but these are not adaptation heads."
+			text: "셀프 어텐션 메커니즘이라고도 부르며, 시퀀스 내 다른 토큰에 따라 토큰의 표현을 조정합니다.",
+			explain: "오답입니다! 셀프 어텐션 레이어는 어텐션 \"헤드,\"를 포함하고 있지만 어텐션 헤드가 적응 헤드는 아닙니다."
 		},
 		{
-			text: "An additional component, usually made up of one or a few layers, to convert the transformer predictions to a task-specific output",
-			explain: "That's right. Adaptation heads, also known simply as heads, come up in different forms: language modeling heads, question answering heads, sequence classification heads... ",
+			text: "하나 또는 여러 개의 레이어로 이루어진 추가적인 요소로 트랜스포머의 예측 결과를 task-specific한 출력으로 변환합니다.",
+			explain: "정답입니다. 헤드라고 알려진 적응 헤드는 언어 모델링 헤드, 질의 응답 헤드, 순차 분류 헤드 등과 같이 다양한 형태로 나타납니다.",
 			correct: true
 		} 
 	]}

diff --git a/chapters/ru/TRANSLATING.txt b/chapters/ru/TRANSLATING.txt
@@ -0,0 +1,47 @@
+1. We use the formal "you" (i.e. "вы" instead of "ты") to keep the neutral tone.
+     However, don't make the text too formal to keep it more engaging.
+
+2. Don't translate industry-accepted acronyms. e.g. TPU or GPU.
+
+3. The Russian language accepts English words especially in modern contexts more than 
+    many other languages (i.e. Anglicisms). Check for the correct usage of terms in 
+    computer science and commonly used terms in other publications.
+
+4. Russian word order is often different from English. If after translating a sentence
+    it sounds unnatural try to change the word or clause order to make it more natural.
+
+5. Beware of "false friends" in Russian and English translations. Translators are trained 
+    for years to specifically avoid false English friends and avoid anglicised translations.
+    e.g. "точность" is "accuracy", but "carefulness" is "аккуратность". For more examples refer to:
+    http://falsefriends.ru/ffslovar.htm
+
+6. Keep voice active and consistent. Don't overdo it but try to avoid a passive voice.
+
+7. Refer and contribute to the glossary frequently to stay on top of the latest
+    choices we make. This minimizes the amount of editing that is required.
+
+8. Keep POV consistent.
+
+9. Smaller sentences are better sentences. Apply with nuance.
+
+10. If translating a technical word, keep the choice of Russian translation consistent.
+   This does not apply for non-technical choices, as in those cases variety actually
+   helps keep the text engaging.
+
+11. This is merely a translation. Don't add any technical/contextual information
+    not present in the original text. Also don't leave stuff out. The creative
+    choices in composing this information were the original authors' to make.
+    Our creative choices are in doing a quality translation.
+
+12. Be exact when choosing equivalents for technical words. Package is package.
+      Library is library. Don't mix and match. Also, since both "batch" and "package"
+      can be translated as "пакет", use "батч" for "batch" and "пакет" for "package" to
+      avoid ambiguity.
+
+13. Library names are kept in the original forms, e.g. "🤗 Datasets", however,
+      the word dataset in a sentence gets a translation to "датасет".
+
+14. As a style choice prefer the imperative over constructions with auxiliary words
+      to avoid unnecessary verbosity and addressing of the reader, which seems 
+      unnatural in Russian. e.g. "см. главу X" - "See chapter X" instead of 
+      "Вы можете найти это в главе X" - "You can see this in chapter X".
diff --git a/chapters/ru/_toctree.yml b/chapters/ru/_toctree.yml
@@ -8,21 +8,23 @@
   - local: chapter1/1
     title: Введение
   - local: chapter1/2
-    title: Обработка естесственного языка
+    title: Обработка естественного языка
   - local: chapter1/3
-    title: Трансформеры, на что они способны?
+    title: "Трансформеры: на что они способны?"
   - local: chapter1/4
     title: Как работают трансформеры?
   - local: chapter1/5
-    title: Модели энкодеров
+    title: Модели-кодировщики
   - local: chapter1/6
-    title: Модели декодеров
+    title: Модели-декодировщики
   - local: chapter1/7
     title: Модели "seq2seq"
   - local: chapter1/8
     title: Предвзятости и ограничения
   - local: chapter1/9
     title: Итоги
+  - local: chapter1/10
+    title: Проверка знаний
 
 - title: 2. Использование библиотеки 🤗 Transformers
   sections:
@@ -88,3 +90,7 @@
     title: Введение
   - local: chapter6/2
     title: Обучение токенизатора на основе существующего
+- title: Глоссарий
+  sections:
+  - local: glossary/1
+    title: Глоссарий
-Original file line number
+Diff line change
@@ Expand Up @@
     <Tip>
-    The notion of what a word is is complicated. For instance, does "I'll" (a contraction of "I will") count as one or two words? It actually depends on the tokenizer and the pre-tokenization operation it applies. Some tokenizers just split on spaces, so they will consider this as one word. Others use punctuation on top of spaces, so will consider it two words.
+    The notion of what a word is complicated. For instance, does "I'll" (a contraction of "I will") count as one or two words? It actually depends on the tokenizer and the pre-tokenization operation it applies. Some tokenizers just split on spaces, so they will consider this as one word. Others use punctuation on top of spaces, so will consider it two words.
     ✏️ **Try it out!** Create a tokenizer from the `bert-base-cased` and `roberta-base` checkpoints and tokenize "81s" with them. What do you observe? What are the word IDs?
@@ Expand Down @@