Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PhotoBooth] mode. Also, streaming capability added with setup script, readme updates, and narrator prompt update #37

Closed
wants to merge 8 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,5 @@
/venv
/narration
/frames/*
!/frames/.gitkeep
!/frames/.gitkeep
.trunk
32 changes: 29 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
# David Attenborough narrates your life.
# David Attenborough narrates your life.

https://twitter.com/charliebholtz/status/1724815159590293764

## Want to make your own AI app?

Check out [Replicate](https://replicate.com). We make it easy to run machine learning models with an API.

## Setup
Expand All @@ -20,26 +21,51 @@ Then, install the dependencies:

Make a [Replicate](https://replicate.com), [OpenAI](https://beta.openai.com/), and [ElevenLabs](https://elevenlabs.io) account and set your tokens:

```
```bash
export OPENAI_API_KEY=<token>
export ELEVENLABS_API_KEY=<eleven-token>
```

Make a new voice in Eleven and get the voice id of that voice using their [get voices](https://elevenlabs.io/docs/api-reference/voices) API, or by clicking the flask icon next to the voice in the VoiceLab tab.

```
```bash
export ELEVENLABS_VOICE_ID=<voice-id>
```

### Script

Alternative to running the commands above individually, one can use the `setup.sh` script to facilitate getting the two required shell envs ready to rock by updating the environment variable values in `setup.sh` and executing the script.

_Note: may have to manually run `source source venv/bin/activate` afterwards depending on shell env._

## Run it!

In on terminal, run the webcam capture:

```bash
python capture.py
```

In another terminal, run the narrator:

```bash
python narrator.py
```

## Options

### Streaming

If you would like the speech to start quicker via a streaming manner set the environment variable to enable. The concession is that the audio snippet is not saved in the `/narration` directory.

```bash
export ELEVENLABS_STREAMING=true
```

### PhotoBooth

The default behavior of this app will continually analyze images. If you would like to use in a mode more similar to a photo booth, set the environment variable. In this mode, the image will only be analyzed when the spacebar key is pressed.

```bash
export PHOTOBOOTH_MODE=true
```
9 changes: 5 additions & 4 deletions capture.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
import cv2
import os
import time
from PIL import Image

import cv2
import numpy as np
import os
from PIL import Image

# Folder
folder = "frames"
Expand Down Expand Up @@ -30,7 +31,7 @@
# Resize the image
max_size = 250
ratio = max_size / max(pil_img.size)
new_size = tuple([int(x*ratio) for x in pil_img.size])
new_size = tuple([int(x * ratio) for x in pil_img.size])
resized_img = pil_img.resize(new_size, Image.LANCZOS)

# Convert the PIL image back to an OpenCV image
Expand Down
105 changes: 79 additions & 26 deletions narrator.py
Original file line number Diff line number Diff line change
@@ -1,16 +1,45 @@
import os
from openai import OpenAI
import base64
import json
import time
import simpleaudio as sa
import errno
from elevenlabs import generate, play, set_api_key, voices
import os
import time

from elevenlabs import generate, play, set_api_key, stream
from openai import OpenAI
from pynput import ( # Using pynput to listen for a keypress instead of native keyboard module which was requiring admin privileges
keyboard,
)

client = OpenAI()

set_api_key(os.environ.get("ELEVENLABS_API_KEY"))

# Initializes the variables based their respective environment variable values, defaulting to false
isStreaming = os.environ.get("ELEVENLABS_STREAMING", "false") == "true"
isPhotoBooth = os.environ.get("PHOTOBOOTH_MODE", "false") == "true"

script = []
narrator = "Sir David Attenborough"


def on_press(key):
if key == keyboard.Key.space:
# When space bar is pressed, run the main function which analyzes the image and generates the audio
_main()


def on_release(key):
if key == keyboard.Key.esc:
# Stop listener
return False


# Create a listener
listener = keyboard.Listener(on_press=on_press, on_release=on_release)

# Start the listener
listener.start()


def encode_image(image_path):
while True:
try:
Expand All @@ -25,8 +54,19 @@ def encode_image(image_path):


def play_audio(text):
audio = generate(text, voice=os.environ.get("ELEVENLABS_VOICE_ID"))
audio = generate(
text,
voice=os.environ.get("ELEVENLABS_VOICE_ID"),
model="eleven_turbo_v2",
stream=isStreaming,
)

if isStreaming:
# Stream the audio for more real-time responsiveness
stream(audio)
return

# Save the audio to a file and play it
unique_id = base64.urlsafe_b64encode(os.urandom(30)).decode("utf-8").rstrip("=")
dir_path = os.path.join("narration", unique_id)
os.makedirs(dir_path, exist_ok=True)
Expand All @@ -43,7 +83,10 @@ def generate_new_line(base64_image):
{
"role": "user",
"content": [
{"type": "text", "text": "Describe this image"},
{
"type": "text",
"text": f"Describe this image as if you are {narrator}",
},
{
"type": "image_url",
"image_url": f"data:image/jpeg;base64,{base64_image}",
Expand All @@ -59,8 +102,8 @@ def analyze_image(base64_image, script):
messages=[
{
"role": "system",
"content": """
You are Sir David Attenborough. Narrate the picture of the human as if it is a nature documentary.
"content": f"""
You are {narrator}. Narrate the picture of the human as if it is a nature documentary.
Make it snarky and funny. Don't repeat yourself. Make it short. If I do anything remotely interesting, make a big deal about it!
""",
},
Expand All @@ -73,30 +116,40 @@ def analyze_image(base64_image, script):
return response_text


def main():
script = []
def _main():
global script

while True:
# path to your image
image_path = os.path.join(os.getcwd(), "./frames/frame.jpg")
# path to your image
image_path = os.path.join(os.getcwd(), "./frames/frame.jpg")

# getting the base64 encoding
base64_image = encode_image(image_path)

# analyze posture
print(f"👀 {narrator} is watching...")
analysis = analyze_image(base64_image, script=script)

# getting the base64 encoding
base64_image = encode_image(image_path)
print("🎙️ David says:")
print(analysis)

# analyze posture
print("👀 David is watching...")
analysis = analyze_image(base64_image, script=script)
play_audio(analysis)

print("🎙️ David says:")
print(analysis)
script = script + [{"role": "assistant", "content": analysis}]

play_audio(analysis)

script = script + [{"role": "assistant", "content": analysis}]
def main():
while True:
if isPhotoBooth:
pass
else:
_main()

# wait for 5 seconds
time.sleep(5)

# wait for 5 seconds
time.sleep(5)

if isPhotoBooth:
print(f"Press the spacebar to trigger {narrator}")

if __name__ == "__main__":
main()
3 changes: 2 additions & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ pure-eval==0.2.2
pydantic==2.4.2
pydantic_core==2.10.1
Pygments==2.16.1
pynput==1.7.6
requests==2.31.0
simpleaudio==1.0.4
six==1.16.0
Expand All @@ -38,4 +39,4 @@ traitlets==5.13.0
typing_extensions==4.8.0
urllib3==2.0.7
wcwidth==0.2.10
websockets==12.0
websockets==12.0
18 changes: 18 additions & 0 deletions setup.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
#!/bin/bash

# create a virtual environment
python3 -m pip install virtualenv
python3 -m virtualenv venv

# source the virtual environment
source venv/bin/activate

# install the dependencies
pip install -r requirements.txt

# set the environment variables
export ELEVENLABS_VOICE_ID=
export OPENAI_API_KEY=
export ELEVENLABS_API_KEY=

export ELEVENLABS_STREAMING=false