Why this matters
As a Data Analyst, you will spend a lot of time loading, cleaning, and exploring data before any modeling or visualization. Pandas provides two essential building blocks for this: Series (one-dimensional labeled data) and DataFrame (two-dimensional labeled tabular data). Mastering their basics lets you quickly inspect data quality, select subsets, compute new columns, and prepare reports.
- Real tasks you will do: quickly preview incoming CSVs, pick specific rows/columns, compute summary metrics, and fix simple data issues (missing values, types).
- Outcome: you can confidently read data, understand its shape and labels, and perform first-pass checks in seconds.
Who this is for
- New or aspiring Data Analysts starting with pandas.
- Analysts coming from Excel who want to understand pandas tables.
- Anyone who needs reliable basics for further analysis and visualization.
Prerequisites
- Python basics: variables, lists/dicts, functions, and importing libraries.
- Ability to run code in a notebook (e.g., Jupyter) or a Python script.
Concept explained simply
Think of a DataFrame as a spreadsheet with row labels (index) and column labels (columns). Each column is a Series. A Series is like a labeled column of values.
Key terms, quickly
- Series: 1D labeled array (values + index).
- DataFrame: 2D table (columns + index). Each column is a Series.
- Index: row labels. Default is 0..N-1; you can set your own.
- Columns: column labels; typically strings.
- Shape: (rows, columns).
- Dtypes: data types of each column.
Core objects: Series and DataFrame
import pandas as pd
# Series examples
s1 = pd.Series([10, 20, 30]) # default index: 0,1,2
s2 = pd.Series([10, 20, 30], index=['a','b','c'])
# DataFrame examples
df1 = pd.DataFrame({'city': ['NY', 'LA', 'SF'], 'sales': [100, 120, 90]})
# Each column is a Series
df1['sales'] # Series
Inspecting structure
df1.shape # (3, 2)
df1.index # RangeIndex(start=0, stop=3, step=1)
df1.columns # Index(['city', 'sales'], dtype='object')
df1.dtypes # city: object, sales: int64
# Quick look
df1.head(2)
df1.tail(2)
df1.info() # non-null counts and dtypes
Creating Series and DataFrames
pd.Series([1,2,3])
pd.DataFrame({'A':[1,2], 'B':[3,4]}) # lists must be equal length
rows = [
{'city':'NY', 'sales':100},
{'city':'LA', 'sales':120}
]
pd.DataFrame(rows)
df = pd.DataFrame({'id':[101,102], 'name':['A','B']})
df = df.set_index('id') # now rows labeled 101, 102
Selecting data (loc vs iloc)
- loc: label-based selection (uses index/column names).
- iloc: position-based selection (uses integer positions).
df = pd.DataFrame({
'city':['NY','LA','SF','SEA'],
'sales':[100,120,90,110]
}, index=['n1','l2','s3','s4'])
# loc - by labels
df.loc['s3', 'sales'] # 90
df.loc[['n1','s4'], ['city']] # rows n1,s4 and the city column
# iloc - by positions
df.iloc[2, 1] # 90 (3rd row, 2nd column)
df.iloc[0:2, 0:1] # first 2 rows, first column
Worked examples
Example 1: Preview new data quickly
import pandas as pd
orders = pd.DataFrame({
'order_id':[1,2,3,4],
'country':['US','US','UK','DE'],
'price':[12.5, 8.0, 15.0, 7.5]
})
print(orders.shape) # (4, 3)
print(orders.columns) # Index(['order_id','country','price'], dtype='object')
print(orders.head(2))
Why it helps: in one glance, you know size, fields, and example rows for sanity checks.
Example 2: Select a column and compute a derived one
orders['price_with_tax'] = orders['price'] * 1.2
avg_price = orders['price'].mean()
Result: a new column (Series arithmetic is vectorized) and a quick metric.
Example 3: Label vs position selection
# With a named index, loc aligns to labels
orders_idx = orders.set_index('order_id')
price_3 = orders_idx.loc[3, 'price'] # label 3
# Position-based
third_row_second_col = orders_idx.iloc[2, 1]
Use loc when labels matter (safer, clearer), iloc for positional slicing.
Practice: Your turn
Complete the exercises below. The Quick Test at the end is available to everyone; only logged-in users will see saved progress when they return.
- Exercise 1: Create Series/DataFrame and inspect shape, index, columns, and dtypes.
- Exercise 2: Practice loc/iloc selection and simple filtering.
Exercise 1 — instructions
Mirror of Exercise 1 in the Exercises panel below.
- Create a Series for daily visitors: [120, 135, 90] with index ['Mon','Tue','Wed'].
- Create a DataFrame with columns 'city' and 'sales': cities ['NY','LA','SF','SEA'] and sales [100,120,90,110].
- Print shape, index, columns, and dtypes for the DataFrame. Print the Series' index and the value for 'Tue'.
Expected: shape (4,2); correct index/columns; 'Tue' value is 135.
Exercise 2 — instructions
Mirror of Exercise 2 in the Exercises panel below.
- Using the DataFrame from Exercise 1, set the index to 'city'.
- With loc, select the 'sales' value for 'SF'.
- With iloc, select the same value by position.
- Filter rows where sales >= 110 and show only the 'sales' column.
Expected: 'SF' is 90; filtered rows should include 'LA' (120) and 'SEA' (110).
Common mistakes and self-check
- Confusing loc and iloc: loc uses labels; iloc uses positions. If your index is not 0..N-1, using iloc with label numbers will return wrong rows.
- Expecting unequal list lengths to work: DataFrame from dict-of-lists requires equal lengths.
- Forgetting the index after set_index: After df.set_index('col'), rows are labeled by that column; refer to labels, not old row numbers.
- Misreading dtypes: Objects may be text; numbers stored as text will not sum until converted. Self-check with df.dtypes and df.info().
Self-check tips
- Print df.shape before/after operations to ensure row/column counts are as expected.
- Print df.index and df.columns to verify labels align with your selection method.
- Use df.head() after creating new columns to confirm values look correct.
Practical projects (small)
- Retail snapshot: Load a small CSV-like dict into a DataFrame, compute total revenue (price * qty), and show top 3 rows.
- Temperature log: Create a Series with day labels and temperatures; compute mean, min, and day of max temperature (idxmax()).
- Mini catalog: Build a DataFrame with id, name, category; set index to id; practice loc selections by id.
Learning path
- This subskill: DataFrame and Series basics (you are here).
- Data loading and saving: read_csv, read_excel, to_csv.
- Selection and filtering deep dive: boolean masks, query, isin.
- Data cleaning: handling missing values, type conversion, renaming.
- Aggregation: groupby, pivot tables, descriptive stats.
Mini challenge
Create a DataFrame with columns: product ['A','B','C','D'], price [10,15,7,12], qty [3,1,5,2].
- Add a new column 'revenue' = price * qty.
- Set index to 'product'.
- Using loc, get revenue for 'C'. Using iloc, grab the first two rows and 'price' column.
- What's the shape and dtypes?
Peek solution
import pandas as pd
df = pd.DataFrame({
'product':['A','B','C','D'],
'price':[10,15,7,12],
'qty':[3,1,5,2]
})
df['revenue'] = df['price'] * df['qty']
df = df.set_index('product')
rev_c = df.loc['C','revenue']
first_two_prices = df.iloc[0:2, df.columns.get_loc('price')]
print(df.shape)
print(df.dtypes)
Next steps
- Repeat the exercises with a different small dataset to build fluency.
- Move on to selection/filtering patterns with boolean masks and conditions.
- Start using head(), info(), dtypes automatically whenever you load new data.