Data Acquisition

3. Data Acquisition#

Data Science Flower

The first and most essential step in any data science workflow is acquiring the data. Whether it’s downloading files from the internet, fetching information from APIs, scraping data from the web, accessing sensors in real-time, or connecting to structured databases hosted in the cloud or on local servers, the ability to gather data effectively sets the stage for every subsequent analysis.

In this chapter, we will explore the foundations of data acquisition. We’ll begin by working with static sources of data stored in files like Excel, CSV, etc and then we’ll move to unstructured and dynamic sources of data such as live sensors, web APIs, or streaming inputs. These sources present exciting opportunities and unique challenges, especially when it comes to cleaning and structuring the data for analysis.

After acquiring the data, we need to store it somewhere. Where we store the data forms the foundation for everything that follows, so it must be chosen with future use cases in mind. We will shift our focus to conventional storage solutions, including relational databases and cloud-based repositories, which remain at the core of many enterprise-level data science pipelines. You’ll learn how to connect to these systems, query data efficiently, and prepare it for downstream tasks.

By the end of this chapter, you’ll have a practical understanding of how to locate, access, and retrieve data from a wide range of environments.

This chapter is structured as follows: