The Code: Call Gemini with Text and Image

In this section, we’ll use Google’s official google-generativeai library to call Gemini Pro Vision with both text and an image.

Setup

First, get a free API key:

Go to ai.google.dev
Click “Get API Key”
Choose or create a Google Cloud project
Copy your API key

Free tier: 60 requests per minute, enough for learning and testing.

Code: Multimodal Input

import google.generativeai as genai
from PIL import Image
import requests
from io import BytesIO

# Configure API (replace with your key)
genai.configure(api_key="YOUR_API_KEY_HERE")

# Load a sample image from the web (public image)
url = "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/1280px-Cat03.jpg"
response = requests.get(url)
img = Image.open(BytesIO(response.content))

# Create Gemini model instance (Gemini Pro Vision)
model = genai.GenerativeModel("gemini-pro-vision")

# Send image + text prompt together (native multimodality)
response = model.generate_content([
    "What is in this image? Describe the scene in 2 sentences.",
    img
])

print("Gemini's response:")
print(response.text)

Code: Text-Only Input (for comparison)

import google.generativeai as genai

# Configure API
genai.configure(api_key="YOUR_API_KEY_HERE")

# Create model instance
model = genai.GenerativeModel("gemini-pro")

# Send text-only prompt
response = model.generate_content(
    "Write a poem about a cat in 3 lines."
)

print("Gemini's response:")
print(response.text)

Expected Output

For the image prompt, Gemini might respond:

This image shows an orange tabby cat looking directly at the camera 
with a focused expression. The cat's fur is fluffy and well-groomed, 
and the background is blurred green foliage, emphasizing the cat's face.

For the poem prompt:

Whiskers twitch in the morning light,
A pounce, a leap, a tail held tight,
The cat rules all, from dawn to night.

Why This Matters

Notice that in the first code block, we pass both the image object and the text string in a list:

response = model.generate_content([
    "What is in this image? Describe the scene in 2 sentences.",
    img
])

This is native multimodality. Gemini processes the text and image together, not sequentially. Under the hood:

The image is split into patches (14×14 pixels)
The text is tokenised with SentencePiece
Both are embedded to the same dimension
The Transformer attends to all tokens (image patches + text) together
The response is generated

Limitations of the Free API

60 requests per minute (plenty for learning)
Gemini Pro (not Ultra — Ultra is not available via free API)
Limited to text and images (not audio or video via this endpoint)

Running on Google Colab

Copy the code above into a Colab cell and run it:

# Install library if needed
!pip install google-generativeai pillow requests

# Then paste the code above

Next: Limitations: What Gemini Can’t Do