1.2. Getting Started#
Before diving into data science concepts, we need to set up a few essential tools. These tools will be used throughout this resource, in both labs and projects. We’ll go through the basics of each tool and explain how to set them up locally on your system.
1.2.1. Python#
This course uses Python as the primary programming language. The easiest way to install Python is from its official website. Download the version compatible with your operating system.
Once installed, verify your setup by running a simple “Hello World” program.
On Windows, open Command Prompt or PowerShell. On macOS/Linux, open the terminal and type:
python
Then, inside the Python interpreter, run:
print("Hello World!")
If you see the output, your Python installation is successful.
1.2.2. UV#
Python packages are external libraries that extend functionality, allowing developers to build powerful applications more efficiently.
UV is a modern, high-performance Python package manager known for its speed and reliability. It lets you easily install and manage packages within your projects, giving access to a wide range of pre-built library functions. UV outperforms older tools like pip and conda and has been widely adopted across industry and open-source communities. With pip-style commands, it serves as a seamless drop-in replacement.
Install UV following the official documentation.
Verify your installation by running:
uv --version
Note
If you prefer using another package manager such as pip, you can continue doing so. We’ll use pip-style installation commands that are compatible with UV.
1.2.3. Jupyter Notebook#
Jupyter Notebooks are an interactive environment for running Python code in small chunks. They are extremely popular in data science due to their support for visualizations, inline documentation, and iterative experimentation.
To install Jupyter, run:
uv pip install notebook
Then start Jupyter with:
uv run -m jupyter notebook
Your notebooks should now open in a browser window. For detailed instructions, visit the official Jupyter installation guide.
Since we’ll be using Jupyter extensively, it’s recommended to review the Jupyter Notebook basics to get comfortable with its interface.
1.2.4. VS Code (IDE)#
We recommend Visual Studio Code (VS Code) as your Integrated Development Environment (IDE). It’s lightweight, widely used in the industry, and provides excellent support for Python and Jupyter notebooks.
Download VS Code from the official website. After installation, open VS Code and create a new directory or folder to verify your setup.
Next, install the Jupyter extension for VS Code.
Create a new file named test.ipynb, and try running a cell with the following command:
print("Hello World!")
You should see the output inline, confirming that your setup works.
1.2.5. Git#
Git is a distributed version control system used to track code changes and collaborate on projects. It enables developers to maintain a full history of their work, revert to earlier versions, and manage parallel development through branches.
We will use Git through its integration in VS Code. For a clear overview of how this works, see the VS Code source control guide.
Note
It is a good practice to have a special file named .gitignore in the root of the repository. It contains either exact names or patterns of the files and folders which should not be added to the git repo. This helps in reducing unwanted files and keeps the reposity lean.
Note
Generally, large datasets are also not added to git repos and are downloaded externally for reproducibility.
1.2.6. GitHub#
GitHub is an online platform for hosting and sharing Git repositories. It supports collaboration, code review, and centralized project management.
Create an account at github.com and explore its interface.
We will use GitHub through VS Code’s built-in support for Git remotes. The VS Code source control guide also covers this integration.
1.2.7. Secrets Management#
In the data science lifecycle, we often rely on external services such as APIs for datasets, language models, or web hosting. These services typically require authentication, commonly provided through an API key that verifies your identity with each request.
Because API keys are often tied to billing accounts or privileged access, they must be handled securely. Never expose or store secrets directly in your codebase or notebooks, as this can lead to unauthorized access or financial loss.
Here’s a safe and organized approach to manage secrets:
Identify required secrets List all external services your project depends on. For example, if you are using OpenAI’s API, you will need an API key such as
OPENAI_API_KEY.Create a
.env.samplefile This file serves as a template listing all required environment variables, but without actual values.OPENAI_API_KEY=
Create your personal
.envfile Each user copies.env.sampleto.envand fills in their own values.OPENAI_API_KEY=sk-your-key-here
Load environment variables in code Install and use the
python-dotenvpackage to automatically load secrets from.envinto your environment.uv pip install python-dotenv
Then, at the top of your Python script or notebook:
from dotenv import load_dotenv load_dotenv()
Access secrets in your code Use the
osmodule to retrieve secrets when needed.import os api_key = os.environ.get("OPENAI_API_KEY", "MISSING")
Ignore secrets in version control Add the
.envfile to.gitignoreto prevent it from being tracked in Git..env
Commit your source code and
.env.sample, but never the actual.envfile.
This workflow ensures security, reproducibility, and clean collaboration when working with sensitive credentials.