Lecture 27 – Data 100, Spring 2025¶

Data 100, Spring 2025

Acknowledgments Page

Getting Setup¶

You can run this notebook on the Jupyter Hub machines but you will need to setup an OpenAI account. Alternatively, if you are running on your own computer you can also try to run a model locally.

Step 1. Create an OpenAI account¶

You can create a free account which has some initial free credits by going here:

https://platform.openai.com

You will the need to get an API Key. Save that api key to a local file called openai.key:

In [1]:
# with open("openai.key", "w") as f:
#     f.write("YOUR KEY")

Step 2. Install Python Tools¶

Uncomment the following line.

In [2]:
!pip install -U openai langchain langchain-openai
Collecting openai
  Downloading openai-1.76.2-py3-none-any.whl.metadata (25 kB)
Collecting langchain
  Downloading langchain-0.3.24-py3-none-any.whl.metadata (7.8 kB)
Collecting langchain-openai
  Downloading langchain_openai-0.3.15-py3-none-any.whl.metadata (2.3 kB)
Requirement already satisfied: anyio<5,>=3.5.0 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from openai) (4.3.0)
Collecting distro<2,>=1.7.0 (from openai)
  Downloading distro-1.9.0-py3-none-any.whl.metadata (6.8 kB)
Requirement already satisfied: httpx<1,>=0.23.0 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from openai) (0.27.0)
Collecting jiter<1,>=0.4.0 (from openai)
  Downloading jiter-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.2 kB)
Requirement already satisfied: pydantic<3,>=1.9.0 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from openai) (2.9.2)
Requirement already satisfied: sniffio in /srv/conda/envs/notebook/lib/python3.11/site-packages (from openai) (1.3.1)
Requirement already satisfied: tqdm>4 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from openai) (4.67.1)
Collecting typing-extensions<5,>=4.11 (from openai)
  Downloading typing_extensions-4.13.2-py3-none-any.whl.metadata (3.0 kB)
Collecting langchain-core<1.0.0,>=0.3.55 (from langchain)
  Downloading langchain_core-0.3.56-py3-none-any.whl.metadata (5.9 kB)
Collecting langchain-text-splitters<1.0.0,>=0.3.8 (from langchain)
  Downloading langchain_text_splitters-0.3.8-py3-none-any.whl.metadata (1.9 kB)
Collecting langsmith<0.4,>=0.1.17 (from langchain)
  Downloading langsmith-0.3.39-py3-none-any.whl.metadata (15 kB)
Requirement already satisfied: SQLAlchemy<3,>=1.4 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from langchain) (2.0.16)
Requirement already satisfied: requests<3,>=2 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from langchain) (2.32.3)
Requirement already satisfied: PyYAML>=5.3 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from langchain) (6.0.1)
Collecting tiktoken<1,>=0.7 (from langchain-openai)
  Downloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Requirement already satisfied: idna>=2.8 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from anyio<5,>=3.5.0->openai) (3.6)
Requirement already satisfied: certifi in /srv/conda/envs/notebook/lib/python3.11/site-packages (from httpx<1,>=0.23.0->openai) (2025.1.31)
Requirement already satisfied: httpcore==1.* in /srv/conda/envs/notebook/lib/python3.11/site-packages (from httpx<1,>=0.23.0->openai) (1.0.5)
Requirement already satisfied: h11<0.15,>=0.13 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from httpcore==1.*->httpx<1,>=0.23.0->openai) (0.14.0)
Requirement already satisfied: tenacity!=8.4.0,<10.0.0,>=8.1.0 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from langchain-core<1.0.0,>=0.3.55->langchain) (9.1.2)
Collecting jsonpatch<2.0,>=1.33 (from langchain-core<1.0.0,>=0.3.55->langchain)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl.metadata (3.0 kB)
Requirement already satisfied: packaging<25,>=23.2 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from langchain-core<1.0.0,>=0.3.55->langchain) (24.0)
Collecting orjson<4.0.0,>=3.9.14 (from langsmith<0.4,>=0.1.17->langchain)
  Downloading orjson-3.10.18-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (41 kB)
Collecting requests-toolbelt<2.0.0,>=1.0.0 (from langsmith<0.4,>=0.1.17->langchain)
  Downloading requests_toolbelt-1.0.0-py2.py3-none-any.whl.metadata (14 kB)
Collecting zstandard<0.24.0,>=0.23.0 (from langsmith<0.4,>=0.1.17->langchain)
  Downloading zstandard-0.23.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.0 kB)
Requirement already satisfied: annotated-types>=0.6.0 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from pydantic<3,>=1.9.0->openai) (0.7.0)
Requirement already satisfied: pydantic-core==2.23.4 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from pydantic<3,>=1.9.0->openai) (2.23.4)
Requirement already satisfied: charset_normalizer<4,>=2 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from requests<3,>=2->langchain) (3.3.2)
Requirement already satisfied: urllib3<3,>=1.21.1 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from requests<3,>=2->langchain) (2.2.1)
Requirement already satisfied: greenlet!=0.4.17 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from SQLAlchemy<3,>=1.4->langchain) (3.2.0)
Requirement already satisfied: regex>=2022.1.18 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from tiktoken<1,>=0.7->langchain-openai) (2024.11.6)
Requirement already satisfied: jsonpointer>=1.9 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from jsonpatch<2.0,>=1.33->langchain-core<1.0.0,>=0.3.55->langchain) (2.4)
Downloading openai-1.76.2-py3-none-any.whl (661 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 661.3/661.3 kB 8.8 MB/s eta 0:00:00
Downloading langchain-0.3.24-py3-none-any.whl (1.0 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.0/1.0 MB 16.4 MB/s eta 0:00:00
Downloading langchain_openai-0.3.15-py3-none-any.whl (62 kB)
Downloading distro-1.9.0-py3-none-any.whl (20 kB)
Downloading jiter-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (351 kB)
Downloading langchain_core-0.3.56-py3-none-any.whl (437 kB)
Downloading langchain_text_splitters-0.3.8-py3-none-any.whl (32 kB)
Downloading langsmith-0.3.39-py3-none-any.whl (359 kB)
Downloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 22.1 MB/s eta 0:00:00
Downloading typing_extensions-4.13.2-py3-none-any.whl (45 kB)
Downloading jsonpatch-1.33-py2.py3-none-any.whl (12 kB)
Downloading orjson-3.10.18-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (132 kB)
Downloading requests_toolbelt-1.0.0-py2.py3-none-any.whl (54 kB)
Downloading zstandard-0.23.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.4 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5.4/5.4 MB 59.6 MB/s eta 0:00:00
Installing collected packages: zstandard, typing-extensions, orjson, jsonpatch, jiter, distro, tiktoken, requests-toolbelt, openai, langsmith, langchain-core, langchain-text-splitters, langchain-openai, langchain
  Attempting uninstall: typing-extensions
    Found existing installation: typing_extensions 4.10.0
    Uninstalling typing_extensions-4.10.0:
      Successfully uninstalled typing_extensions-4.10.0
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torch 2.5.1 requires sympy==1.13.1; python_version >= "3.9", but you have sympy 1.13.3 which is incompatible.
Successfully installed distro-1.9.0 jiter-0.9.0 jsonpatch-1.33 langchain-0.3.24 langchain-core-0.3.56 langchain-openai-0.3.15 langchain-text-splitters-0.3.8 langsmith-0.3.39 openai-1.76.2 orjson-3.10.18 requests-toolbelt-1.0.0 tiktoken-0.9.0 typing-extensions-4.13.2 zstandard-0.23.0

Using OpenAI with LangChain¶

In [3]:
from langchain_openai import OpenAI
import pandas as pd
In [4]:
openai_key = open("openai.key", "r").readline()
llm = OpenAI(openai_api_key=openai_key,
             model_name="gpt-3.5-turbo-instruct")
In [5]:
llm.invoke("What is the capital of California? Provide a short answer.")
Out[5]:
'\n\nSacramento. '
In [6]:
for chunk in llm.stream("Write a short song about data science and large language models."):
    print(chunk, end="", flush=True)

Verse 1:
Data science, the magic of numbers
Uncovering secrets, we all wonder
From mountains of data, we find the truth
And with each discovery, we gain new proof

Chorus:
Data science, it's the key
To unlock mysteries, we can't see
With large language models, we can explore
The depths of knowledge, never seen before

Verse 2:
Language models, a powerful tool
Processing words, making sense of it all
From speech to text, they can understand
And help us communicate, in a whole new land

Chorus:
Data science, it's the key
To unlock mysteries, we can't see
With large language models, we can explore
The depths of knowledge, never seen before

Bridge:
 of data, we learn more
And with each model, we open new doors
The possibilities, they seem endless
 large language models, we are fearless

Chorus:
Data science, it's the key
To unlock mysteries, we can't see
With large language models, we can explore
The depths of knowledge, never seen before

Outro:
's embrace this world of data and code
And together, let's unlock

Data Analytics¶

We can use LLMs to help in analyzing data

In [7]:
tweets = pd.read_json("AOC_recent_tweets.txt")
list(tweets['full_text'][0:10])
Out[7]:
['RT @RepEscobar: Our country has the moral obligation and responsibility to reunite every single family separated at the southern border.\n\nT…',
 'RT @RoKhanna: What happens when we guarantee $15/hour?\n\n💰 31% of Black workers and 26% of Latinx workers get raises.\n😷 A majority of essent…',
 '(Source: https://t.co/3o5JEr6zpd)',
 'Joe Cunningham pledged to never take corporate PAC money, and he never did. Mace said she’ll cash every check she gets. Yet another way this is a downgrade. https://t.co/DytsQXKXgU',
 'What’s even more gross is that Mace takes corporate PAC money.\n\nShe’s already funded by corporations. Now she’s choosing to swindle working people on top of it.\n\nPeak scam artistry. Caps for cash 💰 https://t.co/CcVxgDF6id',
 'Joe Cunningham already proving to be leagues more decent + honest than Mace seems capable of.\n\nThe House was far better off w/ Cunningham. It’s sad to see Mace diminish the representation of her community by launching a reputation of craven dishonesty right off the bat.',
 'Pretty horrible.\n\nWell, it’s good to know what kind of person she is early. Also good to know that Mace is cut from the same Trump cloth of dishonesty and opportunism.\n\nSad to see a colleague intentionally hurt other women and survivors to make a buck. Thought she’d be better. https://t.co/CcVxgDF6id',
 'RT @jaketapper: .@RepNancyMace fundraising off the false smear that @AOC misrepresented her experience during the insurrection. She didn’t.…',
 'RT @RepMcGovern: One reason Washington can’t “come together” is because of people like her sending out emails like this.\n\nShe should apolog…',
 'RT @JoeNeguse: Just to be clear, “targeting” stimulus checks means denying them to some working families who would otherwise receive them.']




Suppose I wanted to evaluate whether a tweet is attacking someone

In [8]:
prompt = """
Is the following text making a statement about minimum wage? You should answer either Yes or No.

{text}

Answer:
"""
questions = [prompt.format_map(dict(text=t)) for t in tweets['full_text'].head(20)]

Ask each of the LLMs to answer the questions:

In [9]:
open_ai_answers = llm.batch(questions)
open_ai_answers
Out[9]:
['No ',
 '\nYes',
 '\nNo',
 '\nNo',
 '\nNo',
 '\nNo',
 '\nNo',
 '\nNo',
 '\nNo',
 '\nYes',
 '\nYes',
 '\nNo',
 '\nNo',
 '\nYes',
 '\nNo',
 '\nYes',
 '\nNo',
 '\nNo',
 '\nYes',
 '\nNo']
In [10]:
pd.set_option('display.max_colwidth', None)
df = pd.DataFrame({"OpenAI": open_ai_answers, 
                   "Text": tweets['full_text'].head(20)})
df["OpenAI"] = df["OpenAI"].str.contains("Y")
df
Out[10]:
OpenAI Text
0 False RT @RepEscobar: Our country has the moral obligation and responsibility to reunite every single family separated at the southern border.\n\nT…
1 True RT @RoKhanna: What happens when we guarantee $15/hour?\n\n💰 31% of Black workers and 26% of Latinx workers get raises.\n😷 A majority of essent…
2 False (Source: https://t.co/3o5JEr6zpd)
3 False Joe Cunningham pledged to never take corporate PAC money, and he never did. Mace said she’ll cash every check she gets. Yet another way this is a downgrade. https://t.co/DytsQXKXgU
4 False What’s even more gross is that Mace takes corporate PAC money.\n\nShe’s already funded by corporations. Now she’s choosing to swindle working people on top of it.\n\nPeak scam artistry. Caps for cash 💰 https://t.co/CcVxgDF6id
5 False Joe Cunningham already proving to be leagues more decent + honest than Mace seems capable of.\n\nThe House was far better off w/ Cunningham. It’s sad to see Mace diminish the representation of her community by launching a reputation of craven dishonesty right off the bat.
6 False Pretty horrible.\n\nWell, it’s good to know what kind of person she is early. Also good to know that Mace is cut from the same Trump cloth of dishonesty and opportunism.\n\nSad to see a colleague intentionally hurt other women and survivors to make a buck. Thought she’d be better. https://t.co/CcVxgDF6id
7 False RT @jaketapper: .@RepNancyMace fundraising off the false smear that @AOC misrepresented her experience during the insurrection. She didn’t.…
8 False RT @RepMcGovern: One reason Washington can’t “come together” is because of people like her sending out emails like this.\n\nShe should apolog…
9 True RT @JoeNeguse: Just to be clear, “targeting” stimulus checks means denying them to some working families who would otherwise receive them.
10 True Amazon workers have the right to form a union.\n\nAnti-union tactics like these, especially from a trillion-dollar company trying to disrupt essential workers from organizing for better wages and dignified working conditions in a pandemic, are wrong. https://t.co/nTDqMUapYs
11 False RT @WorkingFamilies: Voters elected Democrats to deliver more relief, not less.
12 False We should preserve what was there and not peg it to outdated 2019 income. People need help!
13 True If conservative Senate Dems institute a lower income threshold in the next round of checks, that could potentially mean the first round of checks under Trump help more people than the first round under Biden.\n\nDo we want to do that? No? Then let’s stop playing &amp; just help people.
14 False @iamjoshfitz 😂 call your member of Congress, they can help track it down
15 True All Dems need for the slam dunk is to do what people elected us to do: help as many people as possible.\n\nIt’s not hard. Let’s not screw it up with austerity nonsense that squeezes the working class yet never makes a peep when tax cuts for yachts and private jets are proposed.
16 False It should be $2000 to begin w/ anyway. Brutally means-testing a $1400 round is going to hurt so many people. THAT is the risk we can’t afford.\n\nIncome thresholds already work in reverse &amp; lag behind reality. Conservative Dems can ask to tax $ back later if they’re so concerned.
17 False We cannot cut off relief at $50k. It is shockingly out of touch to assert that $50k is “too wealthy” to receive relief.\n\nMillions are on the brink of eviction. Give too little and they’re devastated. Give “too much” and a single mom might save for a rainy day. This isn’t hard. https://t.co/o14r3phJeH
18 True Imagine being a policymaker in Washington, having witnessed the massive economic, social, and health destruction over the last year, and think that the greatest policy risk we face is providing *too much* relief.\n\nSounds silly, right?\n\n$1.9T should be a floor, not a ceiling.
19 False @AndrewYang @TweetBenMax @RitchieTorres Thanks @AndrewYang! Happy to chat about the plan details and the community effort that’s gone into this legislation. 🌃🌎

Working with Google Gemini Models¶

You will need to install Gemini API to use the code below. You can install these APIs by uncommenting and running the following command:

In [11]:
!pip install -q -U google-generativeai

You will need to obtain an API key. Unfortunately, UC Berkeley has not yet enabled access to the Gemini API for Berkeley accounts but you can use any free Google account to obtain an API key. You can obtain an API key by following the instructions here.

Once you get an API Key you can put it here:

In [14]:
# with open("gemini_key.txt", "w") as f:
#     f.write("YOUR KEY")
In [15]:
GEMINI_API_KEY = None
if not GEMINI_API_KEY:
    with open("gemini_key.txt", "r") as f:
        GEMINI_API_KEY = f.read().strip()

We can then connect to the Gemini API using the following code:

In [16]:
import google.generativeai as genai
genai.configure(api_key=GEMINI_API_KEY)

models_df = pd.DataFrame(genai.list_models())
models_df
Out[16]:
name base_model_id version display_name description input_token_limit output_token_limit supported_generation_methods temperature max_temperature top_p top_k
0 models/chat-bison-001 001 PaLM 2 Chat (Legacy) A legacy text-only model optimized for chat conversations 4096 1024 [generateMessage, countMessageTokens] 0.25 NaN 0.95 40.0
1 models/text-bison-001 001 PaLM 2 (Legacy) A legacy model that understands text and generates text as an output 8196 1024 [generateText, countTextTokens, createTunedTextModel] 0.70 NaN 0.95 40.0
2 models/embedding-gecko-001 001 Embedding Gecko Obtain a distributed representation of a text. 1024 1 [embedText, countTextTokens] NaN NaN NaN NaN
3 models/gemini-1.0-pro-vision-latest 001 Gemini 1.0 Pro Vision The original Gemini 1.0 Pro Vision model version which was optimized for image understanding. Gemini 1.0 Pro Vision was deprecated on July 12, 2024. Move to a newer Gemini version. 12288 4096 [generateContent, countTokens] 0.40 NaN 1.00 32.0
4 models/gemini-pro-vision 001 Gemini 1.0 Pro Vision The original Gemini 1.0 Pro Vision model version which was optimized for image understanding. Gemini 1.0 Pro Vision was deprecated on July 12, 2024. Move to a newer Gemini version. 12288 4096 [generateContent, countTokens] 0.40 NaN 1.00 32.0
5 models/gemini-1.5-pro-latest 001 Gemini 1.5 Pro Latest Alias that points to the most recent production (non-experimental) release of Gemini 1.5 Pro, our mid-size multimodal model that supports up to 2 million tokens. 2000000 8192 [generateContent, countTokens] 1.00 2.0 0.95 40.0
6 models/gemini-1.5-pro-001 001 Gemini 1.5 Pro 001 Stable version of Gemini 1.5 Pro, our mid-size multimodal model that supports up to 2 million tokens, released in May of 2024. 2000000 8192 [generateContent, countTokens, createCachedContent] 1.00 2.0 0.95 64.0
7 models/gemini-1.5-pro-002 002 Gemini 1.5 Pro 002 Stable version of Gemini 1.5 Pro, our mid-size multimodal model that supports up to 2 million tokens, released in September of 2024. 2000000 8192 [generateContent, countTokens, createCachedContent] 1.00 2.0 0.95 40.0
8 models/gemini-1.5-pro 001 Gemini 1.5 Pro Stable version of Gemini 1.5 Pro, our mid-size multimodal model that supports up to 2 million tokens, released in May of 2024. 2000000 8192 [generateContent, countTokens] 1.00 2.0 0.95 40.0
9 models/gemini-1.5-flash-latest 001 Gemini 1.5 Flash Latest Alias that points to the most recent production (non-experimental) release of Gemini 1.5 Flash, our fast and versatile multimodal model for scaling across diverse tasks. 1000000 8192 [generateContent, countTokens] 1.00 2.0 0.95 40.0
10 models/gemini-1.5-flash-001 001 Gemini 1.5 Flash 001 Stable version of Gemini 1.5 Flash, our fast and versatile multimodal model for scaling across diverse tasks, released in May of 2024. 1000000 8192 [generateContent, countTokens, createCachedContent] 1.00 2.0 0.95 64.0
11 models/gemini-1.5-flash-001-tuning 001 Gemini 1.5 Flash 001 Tuning Version of Gemini 1.5 Flash that supports tuning, our fast and versatile multimodal model for scaling across diverse tasks, released in May of 2024. 16384 8192 [generateContent, countTokens, createTunedModel] 1.00 2.0 0.95 64.0
12 models/gemini-1.5-flash 001 Gemini 1.5 Flash Alias that points to the most recent stable version of Gemini 1.5 Flash, our fast and versatile multimodal model for scaling across diverse tasks. 1000000 8192 [generateContent, countTokens] 1.00 2.0 0.95 40.0
13 models/gemini-1.5-flash-002 002 Gemini 1.5 Flash 002 Stable version of Gemini 1.5 Flash, our fast and versatile multimodal model for scaling across diverse tasks, released in September of 2024. 1000000 8192 [generateContent, countTokens, createCachedContent] 1.00 2.0 0.95 40.0
14 models/gemini-1.5-flash-8b 001 Gemini 1.5 Flash-8B Stable version of Gemini 1.5 Flash-8B, our smallest and most cost effective Flash model, released in October of 2024. 1000000 8192 [createCachedContent, generateContent, countTokens] 1.00 2.0 0.95 40.0
15 models/gemini-1.5-flash-8b-001 001 Gemini 1.5 Flash-8B 001 Stable version of Gemini 1.5 Flash-8B, our smallest and most cost effective Flash model, released in October of 2024. 1000000 8192 [createCachedContent, generateContent, countTokens] 1.00 2.0 0.95 40.0
16 models/gemini-1.5-flash-8b-latest 001 Gemini 1.5 Flash-8B Latest Alias that points to the most recent production (non-experimental) release of Gemini 1.5 Flash-8B, our smallest and most cost effective Flash model, released in October of 2024. 1000000 8192 [createCachedContent, generateContent, countTokens] 1.00 2.0 0.95 40.0
17 models/gemini-1.5-flash-8b-exp-0827 001 Gemini 1.5 Flash 8B Experimental 0827 Experimental release (August 27th, 2024) of Gemini 1.5 Flash-8B, our smallest and most cost effective Flash model. Replaced by Gemini-1.5-flash-8b-001 (stable). 1000000 8192 [generateContent, countTokens] 1.00 2.0 0.95 40.0
18 models/gemini-1.5-flash-8b-exp-0924 001 Gemini 1.5 Flash 8B Experimental 0924 Experimental release (September 24th, 2024) of Gemini 1.5 Flash-8B, our smallest and most cost effective Flash model. Replaced by Gemini-1.5-flash-8b-001 (stable). 1000000 8192 [generateContent, countTokens] 1.00 2.0 0.95 40.0
19 models/gemini-2.5-pro-exp-03-25 2.5-exp-03-25 Gemini 2.5 Pro Experimental 03-25 Experimental release (March 25th, 2025) of Gemini 2.5 Pro 1048576 65536 [generateContent, countTokens, createCachedContent] 1.00 2.0 0.95 64.0
20 models/gemini-2.5-pro-preview-03-25 2.5-preview-03-25 Gemini 2.5 Pro Preview 03-25 Gemini 2.5 Pro Preview 03-25 1048576 65536 [generateContent, countTokens, createCachedContent] 1.00 2.0 0.95 64.0
21 models/gemini-2.5-flash-preview-04-17 2.5-preview-04-17 Gemini 2.5 Flash Preview 04-17 Preview release (April 17th, 2025) of Gemini 2.5 Flash 1048576 65536 [generateContent, countTokens, createCachedContent] 1.00 2.0 0.95 64.0
22 models/gemini-2.0-flash-exp 2.0 Gemini 2.0 Flash Experimental Gemini 2.0 Flash Experimental 1048576 8192 [generateContent, countTokens, bidiGenerateContent] 1.00 2.0 0.95 40.0
23 models/gemini-2.0-flash 2.0 Gemini 2.0 Flash Gemini 2.0 Flash 1048576 8192 [generateContent, countTokens, createCachedContent] 1.00 2.0 0.95 40.0
24 models/gemini-2.0-flash-001 2.0 Gemini 2.0 Flash 001 Stable version of Gemini 2.0 Flash, our fast and versatile multimodal model for scaling across diverse tasks, released in January of 2025. 1048576 8192 [generateContent, countTokens, createCachedContent] 1.00 2.0 0.95 40.0
25 models/gemini-2.0-flash-exp-image-generation 2.0 Gemini 2.0 Flash (Image Generation) Experimental Gemini 2.0 Flash (Image Generation) Experimental 1048576 8192 [generateContent, countTokens, bidiGenerateContent] 1.00 2.0 0.95 40.0
26 models/gemini-2.0-flash-lite-001 2.0 Gemini 2.0 Flash-Lite 001 Stable version of Gemini 2.0 Flash Lite 1048576 8192 [generateContent, countTokens, createCachedContent] 1.00 2.0 0.95 40.0
27 models/gemini-2.0-flash-lite 2.0 Gemini 2.0 Flash-Lite Gemini 2.0 Flash-Lite 1048576 8192 [generateContent, countTokens, createCachedContent] 1.00 2.0 0.95 40.0
28 models/gemini-2.0-flash-lite-preview-02-05 preview-02-05 Gemini 2.0 Flash-Lite Preview 02-05 Preview release (February 5th, 2025) of Gemini 2.0 Flash Lite 1048576 8192 [generateContent, countTokens, createCachedContent] 1.00 2.0 0.95 40.0
29 models/gemini-2.0-flash-lite-preview preview-02-05 Gemini 2.0 Flash-Lite Preview Preview release (February 5th, 2025) of Gemini 2.0 Flash Lite 1048576 8192 [generateContent, countTokens, createCachedContent] 1.00 2.0 0.95 40.0
30 models/gemini-2.0-pro-exp 2.5-exp-03-25 Gemini 2.0 Pro Experimental Experimental release (March 25th, 2025) of Gemini 2.5 Pro 1048576 65536 [generateContent, countTokens, createCachedContent] 1.00 2.0 0.95 64.0
31 models/gemini-2.0-pro-exp-02-05 2.5-exp-03-25 Gemini 2.0 Pro Experimental 02-05 Experimental release (March 25th, 2025) of Gemini 2.5 Pro 1048576 65536 [generateContent, countTokens, createCachedContent] 1.00 2.0 0.95 64.0
32 models/gemini-exp-1206 2.5-exp-03-25 Gemini Experimental 1206 Experimental release (March 25th, 2025) of Gemini 2.5 Pro 1048576 65536 [generateContent, countTokens, createCachedContent] 1.00 2.0 0.95 64.0
33 models/gemini-2.0-flash-thinking-exp-01-21 2.5-preview-04-17 Gemini 2.5 Flash Preview 04-17 Preview release (April 17th, 2025) of Gemini 2.5 Flash 1048576 65536 [generateContent, countTokens, createCachedContent] 1.00 2.0 0.95 64.0
34 models/gemini-2.0-flash-thinking-exp 2.5-preview-04-17 Gemini 2.5 Flash Preview 04-17 Preview release (April 17th, 2025) of Gemini 2.5 Flash 1048576 65536 [generateContent, countTokens, createCachedContent] 1.00 2.0 0.95 64.0
35 models/gemini-2.0-flash-thinking-exp-1219 2.5-preview-04-17 Gemini 2.5 Flash Preview 04-17 Preview release (April 17th, 2025) of Gemini 2.5 Flash 1048576 65536 [generateContent, countTokens, createCachedContent] 1.00 2.0 0.95 64.0
36 models/learnlm-1.5-pro-experimental 001 LearnLM 1.5 Pro Experimental Alias that points to the most recent stable version of Gemini 1.5 Pro, our mid-size multimodal model that supports up to 2 million tokens. 32767 8192 [generateContent, countTokens] 1.00 2.0 0.95 64.0
37 models/learnlm-2.0-flash-experimental 2.0 LearnLM 2.0 Flash Experimental LearnLM 2.0 Flash Experimental 1048576 32768 [generateContent, countTokens] 1.00 2.0 0.95 64.0
38 models/gemma-3-1b-it 001 Gemma 3 1B 32768 8192 [generateContent, countTokens] 1.00 NaN 0.95 64.0
39 models/gemma-3-4b-it 001 Gemma 3 4B 32768 8192 [generateContent, countTokens] 1.00 NaN 0.95 64.0
40 models/gemma-3-12b-it 001 Gemma 3 12B 32768 8192 [generateContent, countTokens] 1.00 NaN 0.95 64.0
41 models/gemma-3-27b-it 001 Gemma 3 27B 131072 8192 [generateContent, countTokens] 1.00 NaN 0.95 64.0
42 models/embedding-001 001 Embedding 001 Obtain a distributed representation of a text. 2048 1 [embedContent] NaN NaN NaN NaN
43 models/text-embedding-004 004 Text Embedding 004 Obtain a distributed representation of a text. 2048 1 [embedContent] NaN NaN NaN NaN
44 models/gemini-embedding-exp-03-07 exp-03-07 Gemini Embedding Experimental 03-07 Obtain a distributed representation of a text. 8192 1 [embedContent, countTextTokens] NaN NaN NaN NaN
45 models/gemini-embedding-exp exp-03-07 Gemini Embedding Experimental Obtain a distributed representation of a text. 8192 1 [embedContent, countTextTokens] NaN NaN NaN NaN
46 models/aqa 001 Model that performs Attributed Question Answering. Model trained to return answers to questions that are grounded in provided sources, along with estimating answerable probability. 7168 1024 [generateAnswer] 0.20 NaN 1.00 40.0
47 models/imagen-3.0-generate-002 002 Imagen 3.0 002 model Vertex served Imagen 3.0 002 model 480 8192 [predict] NaN NaN NaN NaN
48 models/gemini-2.0-flash-live-001 001 Gemini 2.0 Flash 001 Gemini 2.0 Flash 001 131072 8192 [bidiGenerateContent, countTokens] 1.00 2.0 0.95 64.0

We can obtain a model and use it to make a prediction. Here we will use the "gemini-2.5-flash" model, which is generally pretty good for a wide range of tasks.

In [17]:
from IPython.display import Markdown
display(Markdown(models_df[models_df["name"] == "models/gemini-2.5-flash-preview-04-17"]['description'].values[0]))

Preview release (April 17th, 2025) of Gemini 2.5 Flash

In [18]:
model = genai.GenerativeModel("gemini-2.5-flash-preview-04-17")

Use the model to generate text

In [19]:
response = model.generate_content("Why is Data 100 great?")
Markdown(response.text)
Out[19]:

Okay, let's talk about why Data 100 (specifically the UC Berkeley version, which is arguably the most famous and influential one) is widely considered a great course.

Here are the key reasons:

  1. Comprehensive and Integrated Curriculum: Data 100 doesn't just teach isolated concepts. It brilliantly integrates programming, data manipulation, visualization, statistical thinking (inference), and fundamental machine learning algorithms into a cohesive workflow. It shows students how these pieces fit together to solve real data problems.
  2. Builds a Strong Foundation: Building upon introductory concepts (like those from Data 8 or a stats/programming prerequisite), Data 100 provides a solid, technical base for more advanced data science topics. It teaches why things work, not just how to use a library function.
  3. Hands-on and Practical: The course heavily emphasizes practical application through labs and assignments using industry-standard tools like Python, Pandas, NumPy, Matplotlib, and scikit-learn. Students spend a significant amount of time coding and manipulating real-world(ish) datasets.
  4. Rigorous and Challenging (in a good way): Data 100 is known for being demanding. It requires students to think critically, debug complex code, and understand the underlying principles of the algorithms they use. This rigor leads to deep learning and prepares students for the challenges of real data science work.
  5. Project-Based Learning: A significant portion of the course is dedicated to larger projects where students apply everything they've learned – from data cleaning and visualization to model building and evaluation – to a substantial problem. This mimics real-world data science workflows and helps solidify understanding.
  6. Focus on the Entire Data Lifecycle: It doesn't just focus on modeling. It covers essential skills like data cleaning ("data wrangling"), exploratory data analysis (EDA), and communicating results, which are crucial but often overlooked in more algorithm-focused courses.
  7. Emphasis on Understanding Principles: While it teaches how to use powerful libraries, Data 100 spends time explaining the mechanics behind algorithms like linear regression, logistic regression, and k-nearest neighbors. This conceptual understanding makes students more adaptable when facing new problems or technologies.
  8. Real-World Tools: Students gain proficiency in tools and libraries (like Pandas for data manipulation, Matplotlib/Seaborn for visualization, Scikit-learn for ML) that are standard in the data science industry and research.
  9. Prepares for Future Opportunities: The skills and knowledge gained in Data 100 are highly valuable for internships, research positions, and entry-level data science roles, as well as for pursuing more specialized upper-division courses.
  10. Strong Community and Resources (typically): As a popular, large course, it usually has extensive resources, including helpful TAs, detailed documentation, and a large student community for support.

In short, Data 100 is great because it's a comprehensive, challenging, and practical course that effectively bridges the gap between introductory statistics/programming and more advanced machine learning/data science topics, equipping students with the skills and understanding needed to tackle real-world data problems.

Working with images¶

In [20]:
from IPython.display import Image
from IPython.core.display import HTML
img = Image("data100_logo.png", width=200, height=200)
img
Out[20]:
No description has been provided for this image
In [21]:
response = model.generate_content([
    """What is going on in this picture I downloaded from 
    the Berkeley Data100 Course Website? 
    How does it related to Data Science""", img])
Markdown(response.text)
Out[21]:

Okay, let's break down the image and its relation to Data Science, especially in the context of Berkeley's Data100 course.

  1. What is going on in the picture? The image is a logo for the Berkeley Data100 course.

    • It clearly displays the text "DATA 100", which is the name of the course.
    • There are curved white lines that could represent the flow of data, statistical curves, or perhaps the process of data manipulation and analysis.
    • Most importantly, there is a cartoon panda bear resting comfortably on these lines.
  2. How does it relate to Data Science? This logo is a visual representation of a key tool used in Data Science, particularly in introductory courses like Data100: the Pandas library in Python.

    • Pandas: Pandas is a fundamental and widely-used open-source Python library for data manipulation and analysis. It provides data structures (like DataFrames) and functions needed to efficiently work with structured data (like tables).
    • The Panda Mascot: The panda bear is the unofficial (but very common) mascot of the Pandas library. Using the panda in the Data100 logo is a direct and clear visual reference to this essential tool that students will learn and use extensively in the course.
    • The Data & Curves: The "DATA" text and the curved lines represent the subject matter itself – data and potentially the patterns, transformations, or analysis performed on it.

In summary, the image is the logo for UC Berkeley's Data100 course. It prominently features the course name ("DATA 100") and a panda bear, which is the widely recognized mascot for the Pandas library. Pandas is a core tool for data manipulation and analysis taught and used in Data100, making the panda a very relevant and symbolic representation of the skills and tools learned in the course. The curved lines represent the data itself that students will be working with using tools like Pandas.

You can stream content back which could be useful for interacting with the model.

In [22]:
from IPython.display import clear_output

response = model.generate_content("Write a poem about Data Science.", stream=True)

output = ""
for chunk in response:
    output += chunk.text
    clear_output(wait=True)
    display(Markdown(output))

From server racks to digital streams, A restless tide, a sea of dreams. Where every click and scroll and trace, Leaves whispers floating time and space.

A wild expanse, unshaped, untamed, Raw numbers waiting to be claimed. A chaos vast, a silent roar, Data piles upon the shore.

Then comes the mind, with patient grace, To tame the mess, prepare the space. With code and tool, a steady hand, To clean the noise, to understand.

They filter out the dust and blur, Make scattered data now cohere. Like sculpting clay or polishing stone, A structured beauty starts is shown.

With charts that bloom and graphs that gleam, They paint a visual, vivid dream. Exploring paths, both wide and deep, Unlocking secrets numbers keep.

Then algorithms take their flight, Mathematical engines, burning bright. To seek the patterns, weave the thread, Connect the living and the dead.

They build their models, sharp and keen, To learn from what the past has been. To find the links, the hidden art, The beating pulse, the data's heart.

From insights won, a path is shown, Predictions whisper, softly blown. Guiding decisions, large and small, Preventing failure, standing tall.

It's more than math, beyond the code, It's knowledge rising, lifting load. To see the future, clearer sight, And flood the world with data's light.

So hail the science, sharp and new, That finds the meaning, fresh and true. In bytes and bits, a story lies, Reflected in intelligent eyes.

Using Gen AI for EDA¶

We could use the model to help analyze our data.

In [23]:
df = pd.read_html("https://en.wikipedia.org/wiki/List_of_colleges_and_universities_in_California")[1]
df
Out[23]:
Name City County Enrollment[1] Fall 2022 Founded Athletics
0 University of California, Berkeley Berkeley Alameda 45307 1869 NCAA Div. I (ACC, MPSF, America East)
1 University of California, Davis Davis Yolo 39679 1905 NCAA Div. I (Big Sky, MPSF, Big West, America East)
2 University of California, Irvine Irvine Orange 35937 1965 NCAA Div. I (Big West, MPSF, GCC)
3 University of California, Los Angeles Los Angeles Los Angeles 46430 1882* NCAA Div. I (Big Ten, MPSF)
4 University of California, Merced Merced Merced 9103 2005 NAIA (Cal Pac)
5 University of California, Riverside Riverside Riverside 26809 1954 NCAA Div. I (Big West)
6 University of California, San Diego San Diego San Diego 42006 1960 NCAA Div. I (Big West, MPSF)
7 University of California, Santa Barbara Santa Barbara Santa Barbara 26420 1891** NCAA Div. I (Big West, MPSF, GCC)
8 University of California, Santa Cruz Santa Cruz Santa Cruz 19478 1965 NCAA Div. III (C2C, ASC)
In [24]:
fast_model = genai.GenerativeModel("gemini-1.5-flash-8b")
In [25]:
prompt = "What is the mascot of {school}? Answer by only providing the mascot."
df['mascot'] = df['Name'].apply(
    lambda x: fast_model.generate_content(prompt.format(school=x)).text)
df
Out[25]:
Name City County Enrollment[1] Fall 2022 Founded Athletics mascot
0 University of California, Berkeley Berkeley Alameda 45307 1869 NCAA Div. I (ACC, MPSF, America East) Grizzly Bear\n
1 University of California, Davis Davis Yolo 39679 1905 NCAA Div. I (Big Sky, MPSF, Big West, America East) Aggie\n
2 University of California, Irvine Irvine Orange 35937 1965 NCAA Div. I (Big West, MPSF, GCC) Anteater\n
3 University of California, Los Angeles Los Angeles Los Angeles 46430 1882* NCAA Div. I (Big Ten, MPSF) Bruin\n
4 University of California, Merced Merced Merced 9103 2005 NAIA (Cal Pac) Merced Miner\n
5 University of California, Riverside Riverside Riverside 26809 1954 NCAA Div. I (Big West) Big Red\n
6 University of California, San Diego San Diego San Diego 42006 1960 NCAA Div. I (Big West, MPSF) Triton\n
7 University of California, Santa Barbara Santa Barbara Santa Barbara 26420 1891** NCAA Div. I (Big West, MPSF, GCC) Gaucho\n
8 University of California, Santa Cruz Santa Cruz Santa Cruz 19478 1965 NCAA Div. III (C2C, ASC) Banana Slug\n

More EDA with Open AI¶

In [26]:
from langchain_openai import OpenAI
openai_key = open("openai.key", "r").readline()
client = OpenAI(openai_api_key=openai_key,
             model_name="gpt-3.5-turbo-instruct")
In [27]:
# Simulating student feedback data
feedback_data = {
    'StudentID': [1, 2, 3, 4, 5],
    'Feedback': [
        'Great class, learned a lot! But I really did not like PCA.',
        'The course was very informative and well-structured. Would prefer if lectures went faster. ',
        'I found the assignments challenging but rewarding. But the midterm was brutal.',
        'The lectures were engaging and the instructor was very knowledgeable.',
        'I struggled with the linear algebra. I would recommend this class to anyone interested in data science.'
    ],
    'Rating': [5, 4, 4, 5, 5]
}
feedback_df = pd.DataFrame(feedback_data)
feedback_df
Out[27]:
StudentID Feedback Rating
0 1 Great class, learned a lot! But I really did not like PCA. 5
1 2 The course was very informative and well-structured. Would prefer if lectures went faster. 4
2 3 I found the assignments challenging but rewarding. But the midterm was brutal. 4
3 4 The lectures were engaging and the instructor was very knowledgeable. 5
4 5 I struggled with the linear algebra. I would recommend this class to anyone interested in data science. 5
In [28]:
output_schema = {
        "type": "json_schema",
        "json_schema": {
            "name": "issue_schema",
            "schema": {
                "type": "object",
                "properties": {
                    "Issue": {
                        "description": "Any issues or concerns the user raised about the class.",
                        "type": "string"
                    },
                    "Liked": {
                        "description": "Any things the user liked about the class.",
                        "type": "string"
                    },
                    "additionalProperties": False
                }
            }
        }
    }

def process_feedback(feedback):
    prompt = f"""Extract the following information in JSON format:
    {{
  "Issue": "Any issues or concerns the user raised about the class.",
  "Liked": "Any things the user liked about the class."
  }}

  Feedback: "{feedback}"
"""
    response = client.invoke(prompt)
    import re, json
    try:
        json_match = re.search(r"\{.*\}", response, re.DOTALL)
        return json.loads(json_match.group(0)) if json_match else {"Issue": "", "Liked": ""}
    except:
        return {"Issue": "", "Liked": ""}
In [29]:
responses = feedback_df["Feedback"].apply(process_feedback)
responses
Out[29]:
0                                                     {'Issue': 'I really did not like PCA.', 'Liked': 'Great class, learned a lot!'}
1                 {'Issue': 'Would prefer if lectures went faster.', 'Liked': 'The course was very informative and well-structured.'}
2                                 {'Issue': 'The midterm was brutal.', 'Liked': 'I found the assignments challenging but rewarding.'}
3                                   {'Issue': None, 'Liked': 'The lectures were engaging and the instructor was very knowledgeable.'}
4    {'Issue': 'I struggled with the linear algebra.', 'Liked': 'I would recommend this class to anyone interested in data science.'}
Name: Feedback, dtype: object
In [30]:
pd.set_option('display.max_colwidth', None)
feedback_df.join(pd.DataFrame(responses.to_list()))
Out[30]:
StudentID Feedback Rating Issue Liked
0 1 Great class, learned a lot! But I really did not like PCA. 5 I really did not like PCA. Great class, learned a lot!
1 2 The course was very informative and well-structured. Would prefer if lectures went faster. 4 Would prefer if lectures went faster. The course was very informative and well-structured.
2 3 I found the assignments challenging but rewarding. But the midterm was brutal. 4 The midterm was brutal. I found the assignments challenging but rewarding.
3 4 The lectures were engaging and the instructor was very knowledgeable. 5 None The lectures were engaging and the instructor was very knowledgeable.
4 5 I struggled with the linear algebra. I would recommend this class to anyone interested in data science. 5 I struggled with the linear algebra. I would recommend this class to anyone interested in data science.