Getting Setup¶
You can run this notebook on the Jupyter Hub machines but you will need to setup an OpenAI account. Alternatively, if you are running on your own computer you can also try to run a model locally.
Step 1. Create an OpenAI account¶
You can create a free account which has some initial free credits by going here:
You will the need to get an API Key. Save that api key to a local file called openai.key
:
# with open("openai.key", "w") as f:
# f.write("YOUR KEY")
Step 2. Install Python Tools¶
Uncomment the following line.
!pip install -U openai langchain langchain-openai
Collecting openai Downloading openai-1.76.2-py3-none-any.whl.metadata (25 kB) Collecting langchain Downloading langchain-0.3.24-py3-none-any.whl.metadata (7.8 kB) Collecting langchain-openai Downloading langchain_openai-0.3.15-py3-none-any.whl.metadata (2.3 kB) Requirement already satisfied: anyio<5,>=3.5.0 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from openai) (4.3.0) Collecting distro<2,>=1.7.0 (from openai) Downloading distro-1.9.0-py3-none-any.whl.metadata (6.8 kB) Requirement already satisfied: httpx<1,>=0.23.0 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from openai) (0.27.0) Collecting jiter<1,>=0.4.0 (from openai) Downloading jiter-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.2 kB) Requirement already satisfied: pydantic<3,>=1.9.0 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from openai) (2.9.2) Requirement already satisfied: sniffio in /srv/conda/envs/notebook/lib/python3.11/site-packages (from openai) (1.3.1) Requirement already satisfied: tqdm>4 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from openai) (4.67.1) Collecting typing-extensions<5,>=4.11 (from openai) Downloading typing_extensions-4.13.2-py3-none-any.whl.metadata (3.0 kB) Collecting langchain-core<1.0.0,>=0.3.55 (from langchain) Downloading langchain_core-0.3.56-py3-none-any.whl.metadata (5.9 kB) Collecting langchain-text-splitters<1.0.0,>=0.3.8 (from langchain) Downloading langchain_text_splitters-0.3.8-py3-none-any.whl.metadata (1.9 kB) Collecting langsmith<0.4,>=0.1.17 (from langchain) Downloading langsmith-0.3.39-py3-none-any.whl.metadata (15 kB) Requirement already satisfied: SQLAlchemy<3,>=1.4 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from langchain) (2.0.16) Requirement already satisfied: requests<3,>=2 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from langchain) (2.32.3) Requirement already satisfied: PyYAML>=5.3 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from langchain) (6.0.1) Collecting tiktoken<1,>=0.7 (from langchain-openai) Downloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB) Requirement already satisfied: idna>=2.8 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from anyio<5,>=3.5.0->openai) (3.6) Requirement already satisfied: certifi in /srv/conda/envs/notebook/lib/python3.11/site-packages (from httpx<1,>=0.23.0->openai) (2025.1.31) Requirement already satisfied: httpcore==1.* in /srv/conda/envs/notebook/lib/python3.11/site-packages (from httpx<1,>=0.23.0->openai) (1.0.5) Requirement already satisfied: h11<0.15,>=0.13 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from httpcore==1.*->httpx<1,>=0.23.0->openai) (0.14.0) Requirement already satisfied: tenacity!=8.4.0,<10.0.0,>=8.1.0 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from langchain-core<1.0.0,>=0.3.55->langchain) (9.1.2) Collecting jsonpatch<2.0,>=1.33 (from langchain-core<1.0.0,>=0.3.55->langchain) Downloading jsonpatch-1.33-py2.py3-none-any.whl.metadata (3.0 kB) Requirement already satisfied: packaging<25,>=23.2 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from langchain-core<1.0.0,>=0.3.55->langchain) (24.0) Collecting orjson<4.0.0,>=3.9.14 (from langsmith<0.4,>=0.1.17->langchain) Downloading orjson-3.10.18-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (41 kB) Collecting requests-toolbelt<2.0.0,>=1.0.0 (from langsmith<0.4,>=0.1.17->langchain) Downloading requests_toolbelt-1.0.0-py2.py3-none-any.whl.metadata (14 kB) Collecting zstandard<0.24.0,>=0.23.0 (from langsmith<0.4,>=0.1.17->langchain) Downloading zstandard-0.23.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.0 kB) Requirement already satisfied: annotated-types>=0.6.0 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from pydantic<3,>=1.9.0->openai) (0.7.0) Requirement already satisfied: pydantic-core==2.23.4 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from pydantic<3,>=1.9.0->openai) (2.23.4) Requirement already satisfied: charset_normalizer<4,>=2 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from requests<3,>=2->langchain) (3.3.2) Requirement already satisfied: urllib3<3,>=1.21.1 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from requests<3,>=2->langchain) (2.2.1) Requirement already satisfied: greenlet!=0.4.17 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from SQLAlchemy<3,>=1.4->langchain) (3.2.0) Requirement already satisfied: regex>=2022.1.18 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from tiktoken<1,>=0.7->langchain-openai) (2024.11.6) Requirement already satisfied: jsonpointer>=1.9 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from jsonpatch<2.0,>=1.33->langchain-core<1.0.0,>=0.3.55->langchain) (2.4) Downloading openai-1.76.2-py3-none-any.whl (661 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 661.3/661.3 kB 8.8 MB/s eta 0:00:00 Downloading langchain-0.3.24-py3-none-any.whl (1.0 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.0/1.0 MB 16.4 MB/s eta 0:00:00 Downloading langchain_openai-0.3.15-py3-none-any.whl (62 kB) Downloading distro-1.9.0-py3-none-any.whl (20 kB) Downloading jiter-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (351 kB) Downloading langchain_core-0.3.56-py3-none-any.whl (437 kB) Downloading langchain_text_splitters-0.3.8-py3-none-any.whl (32 kB) Downloading langsmith-0.3.39-py3-none-any.whl (359 kB) Downloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 22.1 MB/s eta 0:00:00 Downloading typing_extensions-4.13.2-py3-none-any.whl (45 kB) Downloading jsonpatch-1.33-py2.py3-none-any.whl (12 kB) Downloading orjson-3.10.18-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (132 kB) Downloading requests_toolbelt-1.0.0-py2.py3-none-any.whl (54 kB) Downloading zstandard-0.23.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.4 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5.4/5.4 MB 59.6 MB/s eta 0:00:00 Installing collected packages: zstandard, typing-extensions, orjson, jsonpatch, jiter, distro, tiktoken, requests-toolbelt, openai, langsmith, langchain-core, langchain-text-splitters, langchain-openai, langchain Attempting uninstall: typing-extensions Found existing installation: typing_extensions 4.10.0 Uninstalling typing_extensions-4.10.0: Successfully uninstalled typing_extensions-4.10.0 ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. torch 2.5.1 requires sympy==1.13.1; python_version >= "3.9", but you have sympy 1.13.3 which is incompatible. Successfully installed distro-1.9.0 jiter-0.9.0 jsonpatch-1.33 langchain-0.3.24 langchain-core-0.3.56 langchain-openai-0.3.15 langchain-text-splitters-0.3.8 langsmith-0.3.39 openai-1.76.2 orjson-3.10.18 requests-toolbelt-1.0.0 tiktoken-0.9.0 typing-extensions-4.13.2 zstandard-0.23.0
Using OpenAI with LangChain¶
from langchain_openai import OpenAI
import pandas as pd
openai_key = open("openai.key", "r").readline()
llm = OpenAI(openai_api_key=openai_key,
model_name="gpt-3.5-turbo-instruct")
llm.invoke("What is the capital of California? Provide a short answer.")
'\n\nSacramento. '
for chunk in llm.stream("Write a short song about data science and large language models."):
print(chunk, end="", flush=True)
Verse 1: Data science, the magic of numbers Uncovering secrets, we all wonder From mountains of data, we find the truth And with each discovery, we gain new proof Chorus: Data science, it's the key To unlock mysteries, we can't see With large language models, we can explore The depths of knowledge, never seen before Verse 2: Language models, a powerful tool Processing words, making sense of it all From speech to text, they can understand And help us communicate, in a whole new land Chorus: Data science, it's the key To unlock mysteries, we can't see With large language models, we can explore The depths of knowledge, never seen before Bridge: of data, we learn more And with each model, we open new doors The possibilities, they seem endless large language models, we are fearless Chorus: Data science, it's the key To unlock mysteries, we can't see With large language models, we can explore The depths of knowledge, never seen before Outro: 's embrace this world of data and code And together, let's unlock
tweets = pd.read_json("AOC_recent_tweets.txt")
list(tweets['full_text'][0:10])
['RT @RepEscobar: Our country has the moral obligation and responsibility to reunite every single family separated at the southern border.\n\nT…', 'RT @RoKhanna: What happens when we guarantee $15/hour?\n\n💰 31% of Black workers and 26% of Latinx workers get raises.\n😷 A majority of essent…', '(Source: https://t.co/3o5JEr6zpd)', 'Joe Cunningham pledged to never take corporate PAC money, and he never did. Mace said she’ll cash every check she gets. Yet another way this is a downgrade. https://t.co/DytsQXKXgU', 'What’s even more gross is that Mace takes corporate PAC money.\n\nShe’s already funded by corporations. Now she’s choosing to swindle working people on top of it.\n\nPeak scam artistry. Caps for cash 💰 https://t.co/CcVxgDF6id', 'Joe Cunningham already proving to be leagues more decent + honest than Mace seems capable of.\n\nThe House was far better off w/ Cunningham. It’s sad to see Mace diminish the representation of her community by launching a reputation of craven dishonesty right off the bat.', 'Pretty horrible.\n\nWell, it’s good to know what kind of person she is early. Also good to know that Mace is cut from the same Trump cloth of dishonesty and opportunism.\n\nSad to see a colleague intentionally hurt other women and survivors to make a buck. Thought she’d be better. https://t.co/CcVxgDF6id', 'RT @jaketapper: .@RepNancyMace fundraising off the false smear that @AOC misrepresented her experience during the insurrection. She didn’t.…', 'RT @RepMcGovern: One reason Washington can’t “come together” is because of people like her sending out emails like this.\n\nShe should apolog…', 'RT @JoeNeguse: Just to be clear, “targeting” stimulus checks means denying them to some working families who would otherwise receive them.']
Suppose I wanted to evaluate whether a tweet is attacking someone
prompt = """
Is the following text making a statement about minimum wage? You should answer either Yes or No.
{text}
Answer:
"""
questions = [prompt.format_map(dict(text=t)) for t in tweets['full_text'].head(20)]
Ask each of the LLMs to answer the questions:
open_ai_answers = llm.batch(questions)
open_ai_answers
['No ', '\nYes', '\nNo', '\nNo', '\nNo', '\nNo', '\nNo', '\nNo', '\nNo', '\nYes', '\nYes', '\nNo', '\nNo', '\nYes', '\nNo', '\nYes', '\nNo', '\nNo', '\nYes', '\nNo']
pd.set_option('display.max_colwidth', None)
df = pd.DataFrame({"OpenAI": open_ai_answers,
"Text": tweets['full_text'].head(20)})
df["OpenAI"] = df["OpenAI"].str.contains("Y")
df
OpenAI | Text | |
---|---|---|
0 | False | RT @RepEscobar: Our country has the moral obligation and responsibility to reunite every single family separated at the southern border.\n\nT… |
1 | True | RT @RoKhanna: What happens when we guarantee $15/hour?\n\n💰 31% of Black workers and 26% of Latinx workers get raises.\n😷 A majority of essent… |
2 | False | (Source: https://t.co/3o5JEr6zpd) |
3 | False | Joe Cunningham pledged to never take corporate PAC money, and he never did. Mace said she’ll cash every check she gets. Yet another way this is a downgrade. https://t.co/DytsQXKXgU |
4 | False | What’s even more gross is that Mace takes corporate PAC money.\n\nShe’s already funded by corporations. Now she’s choosing to swindle working people on top of it.\n\nPeak scam artistry. Caps for cash 💰 https://t.co/CcVxgDF6id |
5 | False | Joe Cunningham already proving to be leagues more decent + honest than Mace seems capable of.\n\nThe House was far better off w/ Cunningham. It’s sad to see Mace diminish the representation of her community by launching a reputation of craven dishonesty right off the bat. |
6 | False | Pretty horrible.\n\nWell, it’s good to know what kind of person she is early. Also good to know that Mace is cut from the same Trump cloth of dishonesty and opportunism.\n\nSad to see a colleague intentionally hurt other women and survivors to make a buck. Thought she’d be better. https://t.co/CcVxgDF6id |
7 | False | RT @jaketapper: .@RepNancyMace fundraising off the false smear that @AOC misrepresented her experience during the insurrection. She didn’t.… |
8 | False | RT @RepMcGovern: One reason Washington can’t “come together” is because of people like her sending out emails like this.\n\nShe should apolog… |
9 | True | RT @JoeNeguse: Just to be clear, “targeting” stimulus checks means denying them to some working families who would otherwise receive them. |
10 | True | Amazon workers have the right to form a union.\n\nAnti-union tactics like these, especially from a trillion-dollar company trying to disrupt essential workers from organizing for better wages and dignified working conditions in a pandemic, are wrong. https://t.co/nTDqMUapYs |
11 | False | RT @WorkingFamilies: Voters elected Democrats to deliver more relief, not less. |
12 | False | We should preserve what was there and not peg it to outdated 2019 income. People need help! |
13 | True | If conservative Senate Dems institute a lower income threshold in the next round of checks, that could potentially mean the first round of checks under Trump help more people than the first round under Biden.\n\nDo we want to do that? No? Then let’s stop playing & just help people. |
14 | False | @iamjoshfitz 😂 call your member of Congress, they can help track it down |
15 | True | All Dems need for the slam dunk is to do what people elected us to do: help as many people as possible.\n\nIt’s not hard. Let’s not screw it up with austerity nonsense that squeezes the working class yet never makes a peep when tax cuts for yachts and private jets are proposed. |
16 | False | It should be $2000 to begin w/ anyway. Brutally means-testing a $1400 round is going to hurt so many people. THAT is the risk we can’t afford.\n\nIncome thresholds already work in reverse & lag behind reality. Conservative Dems can ask to tax $ back later if they’re so concerned. |
17 | False | We cannot cut off relief at $50k. It is shockingly out of touch to assert that $50k is “too wealthy” to receive relief.\n\nMillions are on the brink of eviction. Give too little and they’re devastated. Give “too much” and a single mom might save for a rainy day. This isn’t hard. https://t.co/o14r3phJeH |
18 | True | Imagine being a policymaker in Washington, having witnessed the massive economic, social, and health destruction over the last year, and think that the greatest policy risk we face is providing *too much* relief.\n\nSounds silly, right?\n\n$1.9T should be a floor, not a ceiling. |
19 | False | @AndrewYang @TweetBenMax @RitchieTorres Thanks @AndrewYang! Happy to chat about the plan details and the community effort that’s gone into this legislation. 🌃🌎 |
Working with Google Gemini Models¶
You will need to install Gemini API to use the code below. You can install these APIs by uncommenting and running the following command:
!pip install -q -U google-generativeai
You will need to obtain an API key. Unfortunately, UC Berkeley has not yet enabled access to the Gemini API for Berkeley accounts but you can use any free Google account to obtain an API key. You can obtain an API key by following the instructions here.
Once you get an API Key you can put it here:
# with open("gemini_key.txt", "w") as f:
# f.write("YOUR KEY")
GEMINI_API_KEY = None
if not GEMINI_API_KEY:
with open("gemini_key.txt", "r") as f:
GEMINI_API_KEY = f.read().strip()
We can then connect to the Gemini API using the following code:
import google.generativeai as genai
genai.configure(api_key=GEMINI_API_KEY)
models_df = pd.DataFrame(genai.list_models())
models_df
name | base_model_id | version | display_name | description | input_token_limit | output_token_limit | supported_generation_methods | temperature | max_temperature | top_p | top_k | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | models/chat-bison-001 | 001 | PaLM 2 Chat (Legacy) | A legacy text-only model optimized for chat conversations | 4096 | 1024 | [generateMessage, countMessageTokens] | 0.25 | NaN | 0.95 | 40.0 | |
1 | models/text-bison-001 | 001 | PaLM 2 (Legacy) | A legacy model that understands text and generates text as an output | 8196 | 1024 | [generateText, countTextTokens, createTunedTextModel] | 0.70 | NaN | 0.95 | 40.0 | |
2 | models/embedding-gecko-001 | 001 | Embedding Gecko | Obtain a distributed representation of a text. | 1024 | 1 | [embedText, countTextTokens] | NaN | NaN | NaN | NaN | |
3 | models/gemini-1.0-pro-vision-latest | 001 | Gemini 1.0 Pro Vision | The original Gemini 1.0 Pro Vision model version which was optimized for image understanding. Gemini 1.0 Pro Vision was deprecated on July 12, 2024. Move to a newer Gemini version. | 12288 | 4096 | [generateContent, countTokens] | 0.40 | NaN | 1.00 | 32.0 | |
4 | models/gemini-pro-vision | 001 | Gemini 1.0 Pro Vision | The original Gemini 1.0 Pro Vision model version which was optimized for image understanding. Gemini 1.0 Pro Vision was deprecated on July 12, 2024. Move to a newer Gemini version. | 12288 | 4096 | [generateContent, countTokens] | 0.40 | NaN | 1.00 | 32.0 | |
5 | models/gemini-1.5-pro-latest | 001 | Gemini 1.5 Pro Latest | Alias that points to the most recent production (non-experimental) release of Gemini 1.5 Pro, our mid-size multimodal model that supports up to 2 million tokens. | 2000000 | 8192 | [generateContent, countTokens] | 1.00 | 2.0 | 0.95 | 40.0 | |
6 | models/gemini-1.5-pro-001 | 001 | Gemini 1.5 Pro 001 | Stable version of Gemini 1.5 Pro, our mid-size multimodal model that supports up to 2 million tokens, released in May of 2024. | 2000000 | 8192 | [generateContent, countTokens, createCachedContent] | 1.00 | 2.0 | 0.95 | 64.0 | |
7 | models/gemini-1.5-pro-002 | 002 | Gemini 1.5 Pro 002 | Stable version of Gemini 1.5 Pro, our mid-size multimodal model that supports up to 2 million tokens, released in September of 2024. | 2000000 | 8192 | [generateContent, countTokens, createCachedContent] | 1.00 | 2.0 | 0.95 | 40.0 | |
8 | models/gemini-1.5-pro | 001 | Gemini 1.5 Pro | Stable version of Gemini 1.5 Pro, our mid-size multimodal model that supports up to 2 million tokens, released in May of 2024. | 2000000 | 8192 | [generateContent, countTokens] | 1.00 | 2.0 | 0.95 | 40.0 | |
9 | models/gemini-1.5-flash-latest | 001 | Gemini 1.5 Flash Latest | Alias that points to the most recent production (non-experimental) release of Gemini 1.5 Flash, our fast and versatile multimodal model for scaling across diverse tasks. | 1000000 | 8192 | [generateContent, countTokens] | 1.00 | 2.0 | 0.95 | 40.0 | |
10 | models/gemini-1.5-flash-001 | 001 | Gemini 1.5 Flash 001 | Stable version of Gemini 1.5 Flash, our fast and versatile multimodal model for scaling across diverse tasks, released in May of 2024. | 1000000 | 8192 | [generateContent, countTokens, createCachedContent] | 1.00 | 2.0 | 0.95 | 64.0 | |
11 | models/gemini-1.5-flash-001-tuning | 001 | Gemini 1.5 Flash 001 Tuning | Version of Gemini 1.5 Flash that supports tuning, our fast and versatile multimodal model for scaling across diverse tasks, released in May of 2024. | 16384 | 8192 | [generateContent, countTokens, createTunedModel] | 1.00 | 2.0 | 0.95 | 64.0 | |
12 | models/gemini-1.5-flash | 001 | Gemini 1.5 Flash | Alias that points to the most recent stable version of Gemini 1.5 Flash, our fast and versatile multimodal model for scaling across diverse tasks. | 1000000 | 8192 | [generateContent, countTokens] | 1.00 | 2.0 | 0.95 | 40.0 | |
13 | models/gemini-1.5-flash-002 | 002 | Gemini 1.5 Flash 002 | Stable version of Gemini 1.5 Flash, our fast and versatile multimodal model for scaling across diverse tasks, released in September of 2024. | 1000000 | 8192 | [generateContent, countTokens, createCachedContent] | 1.00 | 2.0 | 0.95 | 40.0 | |
14 | models/gemini-1.5-flash-8b | 001 | Gemini 1.5 Flash-8B | Stable version of Gemini 1.5 Flash-8B, our smallest and most cost effective Flash model, released in October of 2024. | 1000000 | 8192 | [createCachedContent, generateContent, countTokens] | 1.00 | 2.0 | 0.95 | 40.0 | |
15 | models/gemini-1.5-flash-8b-001 | 001 | Gemini 1.5 Flash-8B 001 | Stable version of Gemini 1.5 Flash-8B, our smallest and most cost effective Flash model, released in October of 2024. | 1000000 | 8192 | [createCachedContent, generateContent, countTokens] | 1.00 | 2.0 | 0.95 | 40.0 | |
16 | models/gemini-1.5-flash-8b-latest | 001 | Gemini 1.5 Flash-8B Latest | Alias that points to the most recent production (non-experimental) release of Gemini 1.5 Flash-8B, our smallest and most cost effective Flash model, released in October of 2024. | 1000000 | 8192 | [createCachedContent, generateContent, countTokens] | 1.00 | 2.0 | 0.95 | 40.0 | |
17 | models/gemini-1.5-flash-8b-exp-0827 | 001 | Gemini 1.5 Flash 8B Experimental 0827 | Experimental release (August 27th, 2024) of Gemini 1.5 Flash-8B, our smallest and most cost effective Flash model. Replaced by Gemini-1.5-flash-8b-001 (stable). | 1000000 | 8192 | [generateContent, countTokens] | 1.00 | 2.0 | 0.95 | 40.0 | |
18 | models/gemini-1.5-flash-8b-exp-0924 | 001 | Gemini 1.5 Flash 8B Experimental 0924 | Experimental release (September 24th, 2024) of Gemini 1.5 Flash-8B, our smallest and most cost effective Flash model. Replaced by Gemini-1.5-flash-8b-001 (stable). | 1000000 | 8192 | [generateContent, countTokens] | 1.00 | 2.0 | 0.95 | 40.0 | |
19 | models/gemini-2.5-pro-exp-03-25 | 2.5-exp-03-25 | Gemini 2.5 Pro Experimental 03-25 | Experimental release (March 25th, 2025) of Gemini 2.5 Pro | 1048576 | 65536 | [generateContent, countTokens, createCachedContent] | 1.00 | 2.0 | 0.95 | 64.0 | |
20 | models/gemini-2.5-pro-preview-03-25 | 2.5-preview-03-25 | Gemini 2.5 Pro Preview 03-25 | Gemini 2.5 Pro Preview 03-25 | 1048576 | 65536 | [generateContent, countTokens, createCachedContent] | 1.00 | 2.0 | 0.95 | 64.0 | |
21 | models/gemini-2.5-flash-preview-04-17 | 2.5-preview-04-17 | Gemini 2.5 Flash Preview 04-17 | Preview release (April 17th, 2025) of Gemini 2.5 Flash | 1048576 | 65536 | [generateContent, countTokens, createCachedContent] | 1.00 | 2.0 | 0.95 | 64.0 | |
22 | models/gemini-2.0-flash-exp | 2.0 | Gemini 2.0 Flash Experimental | Gemini 2.0 Flash Experimental | 1048576 | 8192 | [generateContent, countTokens, bidiGenerateContent] | 1.00 | 2.0 | 0.95 | 40.0 | |
23 | models/gemini-2.0-flash | 2.0 | Gemini 2.0 Flash | Gemini 2.0 Flash | 1048576 | 8192 | [generateContent, countTokens, createCachedContent] | 1.00 | 2.0 | 0.95 | 40.0 | |
24 | models/gemini-2.0-flash-001 | 2.0 | Gemini 2.0 Flash 001 | Stable version of Gemini 2.0 Flash, our fast and versatile multimodal model for scaling across diverse tasks, released in January of 2025. | 1048576 | 8192 | [generateContent, countTokens, createCachedContent] | 1.00 | 2.0 | 0.95 | 40.0 | |
25 | models/gemini-2.0-flash-exp-image-generation | 2.0 | Gemini 2.0 Flash (Image Generation) Experimental | Gemini 2.0 Flash (Image Generation) Experimental | 1048576 | 8192 | [generateContent, countTokens, bidiGenerateContent] | 1.00 | 2.0 | 0.95 | 40.0 | |
26 | models/gemini-2.0-flash-lite-001 | 2.0 | Gemini 2.0 Flash-Lite 001 | Stable version of Gemini 2.0 Flash Lite | 1048576 | 8192 | [generateContent, countTokens, createCachedContent] | 1.00 | 2.0 | 0.95 | 40.0 | |
27 | models/gemini-2.0-flash-lite | 2.0 | Gemini 2.0 Flash-Lite | Gemini 2.0 Flash-Lite | 1048576 | 8192 | [generateContent, countTokens, createCachedContent] | 1.00 | 2.0 | 0.95 | 40.0 | |
28 | models/gemini-2.0-flash-lite-preview-02-05 | preview-02-05 | Gemini 2.0 Flash-Lite Preview 02-05 | Preview release (February 5th, 2025) of Gemini 2.0 Flash Lite | 1048576 | 8192 | [generateContent, countTokens, createCachedContent] | 1.00 | 2.0 | 0.95 | 40.0 | |
29 | models/gemini-2.0-flash-lite-preview | preview-02-05 | Gemini 2.0 Flash-Lite Preview | Preview release (February 5th, 2025) of Gemini 2.0 Flash Lite | 1048576 | 8192 | [generateContent, countTokens, createCachedContent] | 1.00 | 2.0 | 0.95 | 40.0 | |
30 | models/gemini-2.0-pro-exp | 2.5-exp-03-25 | Gemini 2.0 Pro Experimental | Experimental release (March 25th, 2025) of Gemini 2.5 Pro | 1048576 | 65536 | [generateContent, countTokens, createCachedContent] | 1.00 | 2.0 | 0.95 | 64.0 | |
31 | models/gemini-2.0-pro-exp-02-05 | 2.5-exp-03-25 | Gemini 2.0 Pro Experimental 02-05 | Experimental release (March 25th, 2025) of Gemini 2.5 Pro | 1048576 | 65536 | [generateContent, countTokens, createCachedContent] | 1.00 | 2.0 | 0.95 | 64.0 | |
32 | models/gemini-exp-1206 | 2.5-exp-03-25 | Gemini Experimental 1206 | Experimental release (March 25th, 2025) of Gemini 2.5 Pro | 1048576 | 65536 | [generateContent, countTokens, createCachedContent] | 1.00 | 2.0 | 0.95 | 64.0 | |
33 | models/gemini-2.0-flash-thinking-exp-01-21 | 2.5-preview-04-17 | Gemini 2.5 Flash Preview 04-17 | Preview release (April 17th, 2025) of Gemini 2.5 Flash | 1048576 | 65536 | [generateContent, countTokens, createCachedContent] | 1.00 | 2.0 | 0.95 | 64.0 | |
34 | models/gemini-2.0-flash-thinking-exp | 2.5-preview-04-17 | Gemini 2.5 Flash Preview 04-17 | Preview release (April 17th, 2025) of Gemini 2.5 Flash | 1048576 | 65536 | [generateContent, countTokens, createCachedContent] | 1.00 | 2.0 | 0.95 | 64.0 | |
35 | models/gemini-2.0-flash-thinking-exp-1219 | 2.5-preview-04-17 | Gemini 2.5 Flash Preview 04-17 | Preview release (April 17th, 2025) of Gemini 2.5 Flash | 1048576 | 65536 | [generateContent, countTokens, createCachedContent] | 1.00 | 2.0 | 0.95 | 64.0 | |
36 | models/learnlm-1.5-pro-experimental | 001 | LearnLM 1.5 Pro Experimental | Alias that points to the most recent stable version of Gemini 1.5 Pro, our mid-size multimodal model that supports up to 2 million tokens. | 32767 | 8192 | [generateContent, countTokens] | 1.00 | 2.0 | 0.95 | 64.0 | |
37 | models/learnlm-2.0-flash-experimental | 2.0 | LearnLM 2.0 Flash Experimental | LearnLM 2.0 Flash Experimental | 1048576 | 32768 | [generateContent, countTokens] | 1.00 | 2.0 | 0.95 | 64.0 | |
38 | models/gemma-3-1b-it | 001 | Gemma 3 1B | 32768 | 8192 | [generateContent, countTokens] | 1.00 | NaN | 0.95 | 64.0 | ||
39 | models/gemma-3-4b-it | 001 | Gemma 3 4B | 32768 | 8192 | [generateContent, countTokens] | 1.00 | NaN | 0.95 | 64.0 | ||
40 | models/gemma-3-12b-it | 001 | Gemma 3 12B | 32768 | 8192 | [generateContent, countTokens] | 1.00 | NaN | 0.95 | 64.0 | ||
41 | models/gemma-3-27b-it | 001 | Gemma 3 27B | 131072 | 8192 | [generateContent, countTokens] | 1.00 | NaN | 0.95 | 64.0 | ||
42 | models/embedding-001 | 001 | Embedding 001 | Obtain a distributed representation of a text. | 2048 | 1 | [embedContent] | NaN | NaN | NaN | NaN | |
43 | models/text-embedding-004 | 004 | Text Embedding 004 | Obtain a distributed representation of a text. | 2048 | 1 | [embedContent] | NaN | NaN | NaN | NaN | |
44 | models/gemini-embedding-exp-03-07 | exp-03-07 | Gemini Embedding Experimental 03-07 | Obtain a distributed representation of a text. | 8192 | 1 | [embedContent, countTextTokens] | NaN | NaN | NaN | NaN | |
45 | models/gemini-embedding-exp | exp-03-07 | Gemini Embedding Experimental | Obtain a distributed representation of a text. | 8192 | 1 | [embedContent, countTextTokens] | NaN | NaN | NaN | NaN | |
46 | models/aqa | 001 | Model that performs Attributed Question Answering. | Model trained to return answers to questions that are grounded in provided sources, along with estimating answerable probability. | 7168 | 1024 | [generateAnswer] | 0.20 | NaN | 1.00 | 40.0 | |
47 | models/imagen-3.0-generate-002 | 002 | Imagen 3.0 002 model | Vertex served Imagen 3.0 002 model | 480 | 8192 | [predict] | NaN | NaN | NaN | NaN | |
48 | models/gemini-2.0-flash-live-001 | 001 | Gemini 2.0 Flash 001 | Gemini 2.0 Flash 001 | 131072 | 8192 | [bidiGenerateContent, countTokens] | 1.00 | 2.0 | 0.95 | 64.0 |
We can obtain a model and use it to make a prediction. Here we will use the "gemini-2.5-flash"
model, which is generally pretty good for a wide range of tasks.
from IPython.display import Markdown
display(Markdown(models_df[models_df["name"] == "models/gemini-2.5-flash-preview-04-17"]['description'].values[0]))
Preview release (April 17th, 2025) of Gemini 2.5 Flash
model = genai.GenerativeModel("gemini-2.5-flash-preview-04-17")
Use the model to generate text
response = model.generate_content("Why is Data 100 great?")
Markdown(response.text)
Okay, let's talk about why Data 100 (specifically the UC Berkeley version, which is arguably the most famous and influential one) is widely considered a great course.
Here are the key reasons:
- Comprehensive and Integrated Curriculum: Data 100 doesn't just teach isolated concepts. It brilliantly integrates programming, data manipulation, visualization, statistical thinking (inference), and fundamental machine learning algorithms into a cohesive workflow. It shows students how these pieces fit together to solve real data problems.
- Builds a Strong Foundation: Building upon introductory concepts (like those from Data 8 or a stats/programming prerequisite), Data 100 provides a solid, technical base for more advanced data science topics. It teaches why things work, not just how to use a library function.
- Hands-on and Practical: The course heavily emphasizes practical application through labs and assignments using industry-standard tools like Python, Pandas, NumPy, Matplotlib, and scikit-learn. Students spend a significant amount of time coding and manipulating real-world(ish) datasets.
- Rigorous and Challenging (in a good way): Data 100 is known for being demanding. It requires students to think critically, debug complex code, and understand the underlying principles of the algorithms they use. This rigor leads to deep learning and prepares students for the challenges of real data science work.
- Project-Based Learning: A significant portion of the course is dedicated to larger projects where students apply everything they've learned – from data cleaning and visualization to model building and evaluation – to a substantial problem. This mimics real-world data science workflows and helps solidify understanding.
- Focus on the Entire Data Lifecycle: It doesn't just focus on modeling. It covers essential skills like data cleaning ("data wrangling"), exploratory data analysis (EDA), and communicating results, which are crucial but often overlooked in more algorithm-focused courses.
- Emphasis on Understanding Principles: While it teaches how to use powerful libraries, Data 100 spends time explaining the mechanics behind algorithms like linear regression, logistic regression, and k-nearest neighbors. This conceptual understanding makes students more adaptable when facing new problems or technologies.
- Real-World Tools: Students gain proficiency in tools and libraries (like Pandas for data manipulation, Matplotlib/Seaborn for visualization, Scikit-learn for ML) that are standard in the data science industry and research.
- Prepares for Future Opportunities: The skills and knowledge gained in Data 100 are highly valuable for internships, research positions, and entry-level data science roles, as well as for pursuing more specialized upper-division courses.
- Strong Community and Resources (typically): As a popular, large course, it usually has extensive resources, including helpful TAs, detailed documentation, and a large student community for support.
In short, Data 100 is great because it's a comprehensive, challenging, and practical course that effectively bridges the gap between introductory statistics/programming and more advanced machine learning/data science topics, equipping students with the skills and understanding needed to tackle real-world data problems.
Working with images¶
from IPython.display import Image
from IPython.core.display import HTML
img = Image("data100_logo.png", width=200, height=200)
img
response = model.generate_content([
"""What is going on in this picture I downloaded from
the Berkeley Data100 Course Website?
How does it related to Data Science""", img])
Markdown(response.text)
Okay, let's break down the image and its relation to Data Science, especially in the context of Berkeley's Data100 course.
What is going on in the picture? The image is a logo for the Berkeley Data100 course.
- It clearly displays the text "DATA 100", which is the name of the course.
- There are curved white lines that could represent the flow of data, statistical curves, or perhaps the process of data manipulation and analysis.
- Most importantly, there is a cartoon panda bear resting comfortably on these lines.
How does it relate to Data Science? This logo is a visual representation of a key tool used in Data Science, particularly in introductory courses like Data100: the Pandas library in Python.
- Pandas: Pandas is a fundamental and widely-used open-source Python library for data manipulation and analysis. It provides data structures (like DataFrames) and functions needed to efficiently work with structured data (like tables).
- The Panda Mascot: The panda bear is the unofficial (but very common) mascot of the Pandas library. Using the panda in the Data100 logo is a direct and clear visual reference to this essential tool that students will learn and use extensively in the course.
- The Data & Curves: The "DATA" text and the curved lines represent the subject matter itself – data and potentially the patterns, transformations, or analysis performed on it.
In summary, the image is the logo for UC Berkeley's Data100 course. It prominently features the course name ("DATA 100") and a panda bear, which is the widely recognized mascot for the Pandas library. Pandas is a core tool for data manipulation and analysis taught and used in Data100, making the panda a very relevant and symbolic representation of the skills and tools learned in the course. The curved lines represent the data itself that students will be working with using tools like Pandas.
You can stream content back which could be useful for interacting with the model.
from IPython.display import clear_output
response = model.generate_content("Write a poem about Data Science.", stream=True)
output = ""
for chunk in response:
output += chunk.text
clear_output(wait=True)
display(Markdown(output))
From server racks to digital streams, A restless tide, a sea of dreams. Where every click and scroll and trace, Leaves whispers floating time and space.
A wild expanse, unshaped, untamed, Raw numbers waiting to be claimed. A chaos vast, a silent roar, Data piles upon the shore.
Then comes the mind, with patient grace, To tame the mess, prepare the space. With code and tool, a steady hand, To clean the noise, to understand.
They filter out the dust and blur, Make scattered data now cohere. Like sculpting clay or polishing stone, A structured beauty starts is shown.
With charts that bloom and graphs that gleam, They paint a visual, vivid dream. Exploring paths, both wide and deep, Unlocking secrets numbers keep.
Then algorithms take their flight, Mathematical engines, burning bright. To seek the patterns, weave the thread, Connect the living and the dead.
They build their models, sharp and keen, To learn from what the past has been. To find the links, the hidden art, The beating pulse, the data's heart.
From insights won, a path is shown, Predictions whisper, softly blown. Guiding decisions, large and small, Preventing failure, standing tall.
It's more than math, beyond the code, It's knowledge rising, lifting load. To see the future, clearer sight, And flood the world with data's light.
So hail the science, sharp and new, That finds the meaning, fresh and true. In bytes and bits, a story lies, Reflected in intelligent eyes.
Using Gen AI for EDA¶
We could use the model to help analyze our data.
df = pd.read_html("https://en.wikipedia.org/wiki/List_of_colleges_and_universities_in_California")[1]
df
Name | City | County | Enrollment[1] Fall 2022 | Founded | Athletics | |
---|---|---|---|---|---|---|
0 | University of California, Berkeley | Berkeley | Alameda | 45307 | 1869 | NCAA Div. I (ACC, MPSF, America East) |
1 | University of California, Davis | Davis | Yolo | 39679 | 1905 | NCAA Div. I (Big Sky, MPSF, Big West, America East) |
2 | University of California, Irvine | Irvine | Orange | 35937 | 1965 | NCAA Div. I (Big West, MPSF, GCC) |
3 | University of California, Los Angeles | Los Angeles | Los Angeles | 46430 | 1882* | NCAA Div. I (Big Ten, MPSF) |
4 | University of California, Merced | Merced | Merced | 9103 | 2005 | NAIA (Cal Pac) |
5 | University of California, Riverside | Riverside | Riverside | 26809 | 1954 | NCAA Div. I (Big West) |
6 | University of California, San Diego | San Diego | San Diego | 42006 | 1960 | NCAA Div. I (Big West, MPSF) |
7 | University of California, Santa Barbara | Santa Barbara | Santa Barbara | 26420 | 1891** | NCAA Div. I (Big West, MPSF, GCC) |
8 | University of California, Santa Cruz | Santa Cruz | Santa Cruz | 19478 | 1965 | NCAA Div. III (C2C, ASC) |
fast_model = genai.GenerativeModel("gemini-1.5-flash-8b")
prompt = "What is the mascot of {school}? Answer by only providing the mascot."
df['mascot'] = df['Name'].apply(
lambda x: fast_model.generate_content(prompt.format(school=x)).text)
df
Name | City | County | Enrollment[1] Fall 2022 | Founded | Athletics | mascot | |
---|---|---|---|---|---|---|---|
0 | University of California, Berkeley | Berkeley | Alameda | 45307 | 1869 | NCAA Div. I (ACC, MPSF, America East) | Grizzly Bear\n |
1 | University of California, Davis | Davis | Yolo | 39679 | 1905 | NCAA Div. I (Big Sky, MPSF, Big West, America East) | Aggie\n |
2 | University of California, Irvine | Irvine | Orange | 35937 | 1965 | NCAA Div. I (Big West, MPSF, GCC) | Anteater\n |
3 | University of California, Los Angeles | Los Angeles | Los Angeles | 46430 | 1882* | NCAA Div. I (Big Ten, MPSF) | Bruin\n |
4 | University of California, Merced | Merced | Merced | 9103 | 2005 | NAIA (Cal Pac) | Merced Miner\n |
5 | University of California, Riverside | Riverside | Riverside | 26809 | 1954 | NCAA Div. I (Big West) | Big Red\n |
6 | University of California, San Diego | San Diego | San Diego | 42006 | 1960 | NCAA Div. I (Big West, MPSF) | Triton\n |
7 | University of California, Santa Barbara | Santa Barbara | Santa Barbara | 26420 | 1891** | NCAA Div. I (Big West, MPSF, GCC) | Gaucho\n |
8 | University of California, Santa Cruz | Santa Cruz | Santa Cruz | 19478 | 1965 | NCAA Div. III (C2C, ASC) | Banana Slug\n |
More EDA with Open AI¶
from langchain_openai import OpenAI
openai_key = open("openai.key", "r").readline()
client = OpenAI(openai_api_key=openai_key,
model_name="gpt-3.5-turbo-instruct")
# Simulating student feedback data
feedback_data = {
'StudentID': [1, 2, 3, 4, 5],
'Feedback': [
'Great class, learned a lot! But I really did not like PCA.',
'The course was very informative and well-structured. Would prefer if lectures went faster. ',
'I found the assignments challenging but rewarding. But the midterm was brutal.',
'The lectures were engaging and the instructor was very knowledgeable.',
'I struggled with the linear algebra. I would recommend this class to anyone interested in data science.'
],
'Rating': [5, 4, 4, 5, 5]
}
feedback_df = pd.DataFrame(feedback_data)
feedback_df
StudentID | Feedback | Rating | |
---|---|---|---|
0 | 1 | Great class, learned a lot! But I really did not like PCA. | 5 |
1 | 2 | The course was very informative and well-structured. Would prefer if lectures went faster. | 4 |
2 | 3 | I found the assignments challenging but rewarding. But the midterm was brutal. | 4 |
3 | 4 | The lectures were engaging and the instructor was very knowledgeable. | 5 |
4 | 5 | I struggled with the linear algebra. I would recommend this class to anyone interested in data science. | 5 |
output_schema = {
"type": "json_schema",
"json_schema": {
"name": "issue_schema",
"schema": {
"type": "object",
"properties": {
"Issue": {
"description": "Any issues or concerns the user raised about the class.",
"type": "string"
},
"Liked": {
"description": "Any things the user liked about the class.",
"type": "string"
},
"additionalProperties": False
}
}
}
}
def process_feedback(feedback):
prompt = f"""Extract the following information in JSON format:
{{
"Issue": "Any issues or concerns the user raised about the class.",
"Liked": "Any things the user liked about the class."
}}
Feedback: "{feedback}"
"""
response = client.invoke(prompt)
import re, json
try:
json_match = re.search(r"\{.*\}", response, re.DOTALL)
return json.loads(json_match.group(0)) if json_match else {"Issue": "", "Liked": ""}
except:
return {"Issue": "", "Liked": ""}
responses = feedback_df["Feedback"].apply(process_feedback)
responses
0 {'Issue': 'I really did not like PCA.', 'Liked': 'Great class, learned a lot!'} 1 {'Issue': 'Would prefer if lectures went faster.', 'Liked': 'The course was very informative and well-structured.'} 2 {'Issue': 'The midterm was brutal.', 'Liked': 'I found the assignments challenging but rewarding.'} 3 {'Issue': None, 'Liked': 'The lectures were engaging and the instructor was very knowledgeable.'} 4 {'Issue': 'I struggled with the linear algebra.', 'Liked': 'I would recommend this class to anyone interested in data science.'} Name: Feedback, dtype: object
pd.set_option('display.max_colwidth', None)
feedback_df.join(pd.DataFrame(responses.to_list()))
StudentID | Feedback | Rating | Issue | Liked | |
---|---|---|---|---|---|
0 | 1 | Great class, learned a lot! But I really did not like PCA. | 5 | I really did not like PCA. | Great class, learned a lot! |
1 | 2 | The course was very informative and well-structured. Would prefer if lectures went faster. | 4 | Would prefer if lectures went faster. | The course was very informative and well-structured. |
2 | 3 | I found the assignments challenging but rewarding. But the midterm was brutal. | 4 | The midterm was brutal. | I found the assignments challenging but rewarding. |
3 | 4 | The lectures were engaging and the instructor was very knowledgeable. | 5 | None | The lectures were engaging and the instructor was very knowledgeable. |
4 | 5 | I struggled with the linear algebra. I would recommend this class to anyone interested in data science. | 5 | I struggled with the linear algebra. | I would recommend this class to anyone interested in data science. |