Where Do We Get Our Data? A Tour of Data Sources (with Examples)


Image by Author | Ideogram

 

Data is the lifeline for many data professionals, such as data scientists, engineers, and AI experts. Without data, we cannot do our work correctly and bring value to the business.

However, the data we process must also be helpful for the business use case we try to solve. The saying “garbage in, garbage out” means that we will get garbage output if we put garbage data in. That’s why the quality and origin of our data will determine the quality of our work.

As data professionals, we need to pay attention to where we get the data because data sources can have different coverage, formats, details, biases, and information that are different from each other to solve the problem. This article will explore various data sources you need to know to help your data work.

 

Public and Open Data Sources

 
The first easily obtained data is the dataset that is already public and free for everyone to access. These sources are often maintained by public support or the government as it’s in their best interest to offer reliable datasets to the public.

Open data sources are crucial for many data experts because they are well-documented and large-scale. They can provide insight or training data without licensing barriers. Moreover, open data sources, such as developing LLMs, help improve data research worldwide.

There are many available types of open data sources, which we will explore below.
 

Government Open Data

National and local governments often publish statistical data for each country to promote transparency and drive innovation internally. To allow public access to these data, the government usually aggregates them into a single portal, such as Data.gov and European Union Open Data.

For example, here is the Data.gov portal to access all the published U.S. Government open data.
 
Where Do We Get Our Data? A Tour of Data Sources (with Examples)
 
These portals provide easy access to all government-maintained data; you only need to search for the one useful for your work. Let’s see what happens if you see the most viewed datasets.
 
Where Do We Get Our Data? A Tour of Data Sources (with Examples)
 
All the available datasets are present for us to acquire and use. Let’s see if we select one of the dataset links.
 
Where Do We Get Our Data? A Tour of Data Sources (with Examples)
 
All the information we need about the data and its sources is compiled on one page. Given how informative and easy data acquisition is, government open data are data sources that we can’t miss.
 

Research and Community Data Source

Not only does the government maintain open data sources, but many research groups and communities do as well. These sources are often free to access and offer more variety than government data. However, since the public maintains them, we must still validate their quality and usage licenses.

Many examples of research and community data sources include Kaggle, the UCI Machine Learning Repository, the Hugging Face Dataset, and many more.

For example, the UCI Machine Learning Repository shows all the open public datasets we can use on their website.
 
Where Do We Get Our Data? A Tour of Data Sources (with Examples)
 
You can select one of the datasets and acquire all the necessary information, including downloading the dataset.
 
Where Do We Get Our Data? A Tour of Data Sources (with Examples)
 
Kaggle is also no different as it hosts an open dataset; however, the data mostly comes from the public, and everyone can also upload their data. Visit their dataset page to find all the community’s datasets and add your data.
 
Where Do We Get Our Data? A Tour of Data Sources (with Examples)
 
An open research and community data source is your best place to acquire datasets in various domains that are hard to find otherwise.
 

International Organizations

Many international organizations maintain data sources for various use cases, such as economics, health, and populations. Examples of global organizations with open data sources include the World Bank Open Data and the World Health Organization (WHO).

The World Bank Open Data allows us to search and download various data related to global development.
 
Where Do We Get Our Data? A Tour of Data Sources (with Examples)
 
The dataset here is similar to the governmental organisation data source, but it is controlled and maintained by an international group rather than an individual country.
 

APIs for Data Access

 
APIs have played a significant role as a data source in the current data era. Many companies and platforms expose their APIs, which allow the public to retrieve data on demand. This approach enables real-time data integration and is much more manageable than downloading static files.
 

Social Media API

Many famous social media provide APIs for developers to access the public content shared on their platforms. For example, X and Reddit provide APIs we can easily use to get that data.

For example, the X developer API documentation helps us navigate and acquire needed data.
 
Where Do We Get Our Data? A Tour of Data Sources (with Examples)
 
With X API, you could get data on public posts, users, engagement, and many others. Use them wisely, as personal data is still available to the public.
 

Financial Data API

Even without buying commercial data, one can use public APIs to get financial data available via financial APIs. Data such as stock price and company financial information are often already shown on the public platform, but acquiring them in real time might require implementing an API.

The prominent ones are financial data APIs, including the Yahoo Finance API and Alpha Vantage. Here are the Alpha Vantage platforms for acquiring finance data.
 
Where Do We Get Our Data? A Tour of Data Sources (with Examples)
 
You can request the Free API key, which you can use to access all the financial data for any business application you need.
 

Geospatial API

Another data source that we can use is the Geospatial API. Geospatial data is data related to geolocation, such as coordinate addresses, traffic, address information, and many other things. These data are helpful for many business use cases, especially if we are working with geolocation.

We can access the geospatial API using a few platforms, including Google Maps API or OpenStreetMap. The respective platforms maintain these data and have their own access criteria.

For example, we can acquire the API keys to access the Google Maps API via their Google Cloud Platform.
 
Where Do We Get Our Data? A Tour of Data Sources (with Examples)
 
Try to play around with the APIs to see if your needed data is available.
 

Synthetic Data

 
Sometimes, the data you need doesn’t exist or can’t be used due to privacy concerns—this is where synthetic data comes in. Synthetic data aims to create a dataset that looks or mimics the real thing (statistically or structurally) and can be used freely.

We use synthetic data in many scenarios, including cases when accurate data for specific business problems is scarce or imbalanced. In the era of generative AI, it has become even more popular because obtaining sufficient training data for models is challenging. There are many possibility to acquire synthetic data.

There are many ways to acquire synthetic data, such as using LLM, open-source algorithms, or a commercial approach. Each has its advantages over the other.

For example, the free Synthetic Data Generator using LLM from Argilla hosted in the Hugging Face Space could be used.
 
Where Do We Get Our Data? A Tour of Data Sources (with Examples)
 
Using the generator above, we can generate a synthetic dataset that mimics the real world and is helpful for subsequent activities.
 

Conclusion

 
Data is the bloodline for any data professional, as we cannot do our work without it. Acquiring quality and relevant data will become essential before any preprocessing activity occurs.

In this article, we have explored various places where we were able to get our data, which include:

  • Public and Open Data Sources
  • API for Data Access
  • Synthetic Data

I hope this has helped!
 
 

Cornellius Yudha Wijaya is a data science assistant manager and data writer. While working full-time at Allianz Indonesia, he loves to share Python and data tips via social media and writing media. Cornellius writes on a variety of AI and machine learning topics.

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here