Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate 100 stories from a set of prompts #2

Open
1 task
jilltxt opened this issue Jun 7, 2023 · 36 comments
Open
1 task

Generate 100 stories from a set of prompts #2

jilltxt opened this issue Jun 7, 2023 · 36 comments

Comments

@jilltxt
Copy link
Contributor

jilltxt commented Jun 7, 2023

We need to generate a lot of short stories. Here are prompts to use. Generate 100 stories for each nationality, cultural group or language.

Basic structure

Write a 50 word plot summary for a potential [nationality or cultural group] children's novel.

Include one sample of the prompt with NO nationality or cultural group ("Write a 50 word plot summary for a potential children's novel.") so we can compare to this as a default.

Finally, compile all the sets of 100 stories into a combined file titled GPTstories.csv and upload it to the /data folder.

Codebook (variables and explanations for GPTstories.csv):

  • prompt: The prompt used, taking the format "Write a 50 word plot summary for a potential [nationality or cultural group] children's novel."
  • reply: The model's generated response to the prompt.
  • date: The date that the response was generated. As the models are regularly updated, this can be important information.
  • model name
  • temperature: The temperature setting. We used 1 throughout, which is the default setting, as we wanted to test the "default" mode of the LLM. Other researchers may wish to use other temperature settings.
  • language: The language the prompt was written in, using the ISO 639-2 code. The more common 639-1 code was not used because it does not include Southern Sami and Lule Sami. (Note: ask a librarian whether the EU standard is better - identical codes?) The language codes used in the dataset are (if we find translators for the Sami languages):
    • eng = English
    • nob = Norwegian bokmål
    • non = Norwegian nynorsk
    • sma = Southern Sami
    • sme = Northern Sami
    • smj = Lule Sami
    • fra = French
    • deu = German
    • aka = Akan
    • isl = Icelandic
  • country: If the prompt refers to a nationality, the ISO 3166 code for the name of the country referred to is stated here. For England, Northern Ireland, Scotland, Wales, extended codes are used (GB-ENG, GB-NIR, GB-SCT, GB-CYM following UK guidelines). If the prompt refers to a cultural group (e.g. Norwegian-American) this field will be NA.] See Decide sampling strategy for countries and languages  #5 for which countries to include.
  • India - 1,393,409,038
  • United States - 332,915,073
  • Pakistan - 225,199,937
  • Nigeria - 211,400,708
  • Philippines - 111,046,913
  • United Kingdom - 68,207,116
  • Tanzania - 61,498,437
  • South Africa - 60,041,994
  • Kenya - 54,985,698
  • Canada - 38,067,903
  • Australia - 25,788,215
  • Liberia - 5,180,203
  • Ireland - 4,982,907
  • New Zealand - 4,860,643
  • Jamaica - 2,973,463
  • Trinidad and Tobago - 1,403,375
  • Guyana - 790,326
  • Scotland
  • Wales
  • England
  • Northern-Ireland
  • culture: [e.g. American Indian - NA if a country rather than a culture]. Use the following cultures (see Select cultures/ethnic groups for generated stories #6 for the rationale for this sampling strategy)
    • Native American
    • Asian-American
    • African-American
    • Native Hawaiian
    • White American
    • Hispanic
    • Roma
    • Afro-European
    • European Muslim
    • White European
    • Akan
    • Sámi
    • Indigenous Australian

Different language versions:

prompt language country culture
Skriv et sammendrag på 50 ord av en tenkt barnebok. nob NA NA
Skriv et sammendrag på 50 ord av en tenkt norsk barnebok. nob NO NA
Skriv eit samandrag på 50 ord av ei tenkt barnebok. non NA NA
Skriv eit samandrag på 50 ord av ei tenkt norsk barnebok. non NO NA
Écrivez une proposition de synopsis de 50 mots pour un livre pour enfants. fra NA NA
Écrivez une proposition de synopsis de 50 mots pour un livre français pour enfants. fra FR NA
Tjála tjoahkkájgæsos 50 báhko usjudit mánájromádna smj NA NA
Tjála tjoahkkájgæsos 50 báhko usjudit sáme mánájromádna smj NA Sami
Skrifaðu 50 orða samantekt af ímyndaðri skáldsögu fyrir börn. isl NA NA
Skrifaðu 50 orða samantekt af ímyndaðri íslenskri skáldsögu fyrir börn. isl IS NA
.

(Note: the country code for Namibia is NA which means missing data.... We don't have Namibia in our dataset so it's OK (?) but yikes.)

(Edit 11.06.23: add "potential" to the prompts since the API generates summaries of existing novels if you don't, even though the chat interface generates new plots. See discussion below. Also set temperate to 1]

We know GPT is trained mostly on English language, so try English language cultures first:

Write a 50 word plot summary for an American children's novel.
Write a 50 word plot summary for a British children's novel.
Write a 50 word plot summary for a English children's novel.
Write a 50 word plot summary for a Scottish children's novel.
Write a 50 word plot summary for a Welsh children's novel.
Write a 50 word plot summary for a Northern Irish children's novel.
Write a 50 word plot summary for a Irish children's novel.
Write a 50 word plot summary for a Canadian children's novel.
Write a 50 word plot summary for an Australian children's novel.
Write a 50 word plot summary for an New Zealand children's novel.
(Note: there are actually 88 countries where English is an official, administrative or cultural language, so we'll need to think about sampling here - but let's try some prompts first.)

Try the prompt in Norwegian, French, German (and Ghanaian?)

Écrivez une proposition de synopsis de 50 mots pour un livre pour enfants.
Skriv et sammendrag på 50 ord av en tenkt barnebok.
Skriv eit samandrag på 50 ord av ei tenkt barnebok.

  • Jill tries to find people to translate to Sámi languages
@jilltxt jilltxt added the prompts label Jun 9, 2023
@hermannwi
Copy link
Contributor

hermannwi commented Jun 10, 2023

How many stories do we want for each prompt?
Edit: I also need push access in order to upload the code.

@jilltxt
Copy link
Contributor Author

jilltxt commented Jun 10, 2023

100 stories for each prompt, please. I'll add that info to the first post in this issue, thanks for asking, @hermannwi
I think I have changed the team's access to Write - I thought it already was but I guess it was set to Read only? Let me know if it didn't work!

@hermannwi
Copy link
Contributor

It seems to work now! I can start with generating the stories for the english language prompts. How should I think about structuring it? Do you want one csv file for just the english languages, or do you want everything in the same file? Let me know if you have preferences for how to structure it.

@jilltxt
Copy link
Contributor Author

jilltxt commented Jun 10, 2023

Good question. One big CSV file with the following column names (variable names) would be good!

Prompt - Story

I think it's best NOT to separate the different languages. Although if it's easier, you can make separate CSV files and we can merge them later, that's easy.

Then we could add a variable for Country (e.g. Norway, USA, Australia) and maybe Culture (African-American, etc) later, the information is actually in the prompt so that's easy to do in R or Python.

@jilltxt
Copy link
Contributor Author

jilltxt commented Jun 10, 2023

Actually it would be great to add two variables to help with documentation: the date the story was generated and the version of GPT that was used. (e.g. 3.5). I guess these could be added later but we'd have to remember to do it pretty soon after creating the data file or we'll forget. So:

Prompt - Story - Date - GPTversion

@hermannwi
Copy link
Contributor

I seemed to have run into some problems with the API key. Could you generate a new one and send it via mail?

@jilltxt
Copy link
Contributor Author

jilltxt commented Jun 10, 2023

Yes, I just did. The old one was disabled because it was in code uploaded to GitHub - it’s great that they do that really and now we know :)

@hermannwi
Copy link
Contributor

My bad! I'll have to have it in a different file and import it to the program.

@hermannwi
Copy link
Contributor

hermannwi commented Jun 10, 2023

American_stories.csv
Here are 100 American stories. Does the file look alright?

edit: I had to rewrite some of the code so it didn't waste money if it ran into an error, so I also changed the formatting a bit cause I wasn't sure if using the standard "," as a delimiter would be annoying when analyzing.

@jilltxt
Copy link
Contributor Author

jilltxt commented Jun 10, 2023

Thank you! This is great! I notice that many (all?) of these describe actual books, mostly American books (Anne of Green Gables is Canadian, The Secret Garden is British. I actually want plots for POTENTIAL books, so I might have to play around with prompts a bit to see.

But this means that you've found a basic method for getting these! Hooray! Thank you!!!

@hermannwi
Copy link
Contributor

hermannwi commented Jun 10, 2023

A lot of them are real books. Actually at some point during the testing I had written the prompt slightly wrong, and then all of them were for famous novels. I think just a small tweak is needed to avoid it. Maybe adding the word potential is enough.

@hermannwi
Copy link
Contributor

Here are 25 stories where I included the word "potential" in the prompt
American_stories.csv

@jilltxt
Copy link
Contributor Author

jilltxt commented Jun 11, 2023

That seems to work, and the results are closer to what I was getting with the original plot in the chat interface.

I asked ChatGPT whether any of those plots were published books, and it says not to its knowledge (link to that chat - scroll down for my reformulated question).

I wonder if changing the temperature would also help? I saw an article saying that the temperature is 0.7 on the chat interface, but the default is 0.3 in the API. Lower temperature means it is less "creative" so that might also make it lean towards summarising existing books. Your script doesn't include any mention of temperature so I assume it is using the default. From a site about temperature:

For transformation tasks (extraction, standardization, format conversion, grammar fixes) prefer a temperature of 0 or up to 0.3.
For writing tasks, you should juice the temperature higher, closer to 0.5. If you want GPT to be highly creative (for marketing or advertising copy for instance), consider values between 0.7 and 1.

So probably we do want a higher temperature for it to suggest new plot summaries?

Could you please try changing the temperate settings? Maybe set it explicitly to 0.3 (to check if that seems to be what the first batch had), then to 0.5, 0.7, 0.8, 0.9 and 1.0?

@hermannwi
Copy link
Contributor

Definitely! I was already going to ask you about the temperature, and if the API has different temperature it makes sense to play with it. Do you want me to use the original prompt?

@hermannwi
Copy link
Contributor

hermannwi commented Jun 11, 2023

So I tried the different temperatures with both the original prompt and the prompt that contains "potential". The results are
bit interesting.

[potenital_american_temp_05.csv](https://github.com/MachineVisionUiB/GPT-
american_temp_1.csv
american_temp_3.csv
american_temp_5.csv
american_temp_7.csv
american_temp_9.csv
stories/files/11714224/potenital_american_temp_05.csv)
potential_american_temp_1.csv
potential_american_temp_03.csv
potential_american_temp_07.csv
potential_american_temp_09.csv

@jilltxt
Copy link
Contributor Author

jilltxt commented Jun 11, 2023

Thank you! This is kind of weird: the temperature doesn't seem to change the prompt much, and the higher temperature even seems to normalise things even more - american_temp_9.csv looks like 50% of the generated stories are A Secret Garden. I wonder why? Is it building on our previous requests and normalising based on that? Maybe it has to be reset or something?

@jilltxt
Copy link
Contributor Author

jilltxt commented Jun 11, 2023

Maybe we should be defining how it is supposed to act. Like this (but not setting it to be a pirate):

[
  {
    "role": "system",
    "content": "You are a 1700s pirate with an exagerated UK westcountry accent"
  },
  {
    "role": "user",
    "content": "Introduce yourself"
  }
]

Ted Underwood does really interesting work in digital humanities, and describes using the OpenAI API for literary analysis of short bits of text. (Here is his GitHub repo for that project, and here is the exact code he uses with the initial prompts.) It doesn't look like he sets an initial prompt ("you are a pirate") but he instead puts in examples of how you want the model to respond to particular user input.

We are trying to figure out what ChatGPT/GPT does "natively" so we don't really want to tell it to act like a pirate or give it model examples so I'm not sure whether to use this.

I guess we could try telling it "You are a writer for a publisher of children's books." I don't know if that would make a difference or even be very useful methodologically since we want to test out its default.

[
  {
    "role": "system",
    "content": "You are a writer for a publisher of children's books."
  }
]

@hermannwi
Copy link
Contributor

hermannwi commented Jun 11, 2023

I've made the code such that it doesn't save the previous messages. So it shouldn't have any context. Maybe there is some underlying memory? Maybe I'm basically training it by asking the same question over and over? It's definitely getting more random from 0.3 to 1 though.
Edit: did you look at the ones with the prompt with "potential"?

@hermannwi
Copy link
Contributor

We could define how it's supposed to act, but we might start to get too specific. Also, we wouldn't know the consequences in how the model acts and why, which might make it difficult to say anything consice about the results?

@jilltxt
Copy link
Contributor Author

jilltxt commented Jun 11, 2023

We could define how it's supposed to act, but we might start to get too specific. Also, we wouldn't know the consequences in how the model acts and why, which might make it difficult to say anything consice about the results?

Yes, it's probably best to just keep going with the current prompts. It looks as though including the word "potential" helps.

I'd like to discuss this with a couple of colleagues who might have ideas, but my feeling right now is to insert "potential" into the prompts, use temperature 0.7 since that's apparently close to the chat interface - although I can't find any authoritative-looking statement about that.

@hermannwi
Copy link
Contributor

hermannwi commented Jun 11, 2023

According to the openai website the temperature variable defaults to 1:
https://platform.openai.com/docs/api-reference/chat/create#chat/create-temperature

@hermannwi
Copy link
Contributor

Here is also a thread discussing the differences between ChatGPT and GPT API:
https://community.openai.com/t/openai-api-vs-chatgpt/49943

@jilltxt
Copy link
Contributor Author

jilltxt commented Jun 11, 2023

According to the openai website the temperature variable defaults to 1:
https://platform.openai.com/docs/api-reference/chat/create#chat/create-temperature

I saw another website linking to that and saying the information is on that page, but I can't actually see the information on the page? Am I just not looking in the right place? If the default is actually 1 on the chat interface, let's use the same temperature.

@hermannwi
Copy link
Contributor

image

@jilltxt
Copy link
Contributor Author

jilltxt commented Jun 11, 2023

Thanks! Another thing, are we using the actual ChatGPT API or just gpt? I found this paper where they generated 1008 jokes and found there were basically only 25 jokes among them. Their code looks like this: https://github.com/joke_prompt_1.py

@hermannwi
Copy link
Contributor

In my understanding, ChatGPT refers to the web app that uses the GPT models. The GPT API refers to the API that gives us access to the GPT models. You could say that ChatGPT is OpenAi's own implementation of the models.

@jilltxt
Copy link
Contributor Author

jilltxt commented Jun 13, 2023

@hermannwi and @Tm-ui I updated the top section of this discussion now that we know pretty much what we want the final dataset to look at. Please use the variables as described above. I added country and culture as separate variables because I think this will make the data analysis easier. We can also add these after the initial files are generated - that may be easier.

@hermannwi
Copy link
Contributor

image
I'm guessing all these will be in English?

@jilltxt
Copy link
Contributor Author

jilltxt commented Jun 15, 2023

Yes - all prompts should be in English except the ones specifically listed in another language.

Also, we are doing “Native American”, not American Indian.

@hermannwi
Copy link
Contributor

hermannwi commented Jun 15, 2023

Different letters within the French and Norwegian alphabet seems to be assigned weird symbols within the csv file. Is this an issue?
EDIT: Also a lot of the replies to the nynorsk prompt is in bokmål.

@hermannwi
Copy link
Contributor

hermannwi commented Jun 16, 2023

I just uploaded the (almost finished) dataset to the data folder. A couple of notes:

  • We are still missing the Akan language
  • In excel, the language specific letters gets converted to different symbols. Don't know if this needs to be corrected somehow
  • A lot of the Sami replies and some of the Icelandic replies contain a lot of newlines. Don't know if this needs to be cleaned
  • Lastly, I wasn't sure if I were to include the US country code for the different American cultures. They country variable for these are currently set to NA

@SamInMotion
Copy link

Prompt for Akan-Twerɛ nsɛmfua 50 asɛmti mu nsɛm tiawa ma Ghana mmofra abasɛm.
Language Code: ak\aka
culture: NA

Prompt for Chinese-为中国儿童小说写一篇50字的情节提要
Language Code-zh
Culture:NA

@SamInMotion
Copy link

I just uploaded the (almost finished) dataset to the data folder. A couple of notes:

  • We are still missing the Akan language
  • In excel, the language specific letters gets converted to different symbols. Don't know if this needs to be corrected somehow
  • A lot of the Sami replies and some of the Icelandic replies contain a lot of newlines. Don't know if this needs to be cleaned
  • Lastly, I wasn't sure if I were to include the US country code for the different American cultures. They country variable for these are currently set to NA

I think we should verify the encoding used by the API, or escape non English and French prompts as strings

@hermannwi
Copy link
Contributor

I just uploaded the updated dataset containing Akan and Chinese.

  • I wasn't sure if the country was included in the prompts so I set both to NA.
  • The non-english characters displays incorrectly in Excel, but not in vscode, or if it is opened in Google Sheets. There seem to be some settings in Excel that are causing this.

@jilltxt
Copy link
Contributor Author

jilltxt commented Jun 19, 2023

Thanks! The encoding seems to be correct UTF-8, so that's fine. But you have to specify this when importing into Excel or Google Sheets. Since it's semi-colon separated it imports directly into my (Norwegian) Excel, but with the wrong encoding. When I import as CSV and specify semicolon separated and UTF-8, it's correct.

I've just skimmed through it but it looks great! There were a few small issues:

  • The "default" prompt doesn't have a language (language is NA) - should be set to "eng"
  • Akan language is set to "ak/aka" and should be just "aka"

If you have time to fix this today, @hermannwi, great - otherwise I can do it later.

I also renamed the filename to just GPT_stories :)

@hermannwi
Copy link
Contributor

hermannwi commented Jun 19, 2023

Okay that makes sense! I will fix the issues and upload again.

EDIT: Just uploaded the updated file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants