Data science is one of the most promising and flourishing domains offering a plethora of job opportunities. We all know that the field is all about data and making the most of it. It requires individuals to possess a rich set of skills. Among the data science skills, SQL is a must-have.
SQL (Structured Query Language) is a query language used to communicate or interact with a relational database . It helps you store and manage data in a relational database management system (RDBMS). IBM Computer Scientists in the 1970s created this language and became the American National Standards Institute (ANSI) and International Organization for Standardization (ISO) standards in 1986 and 1987, respectively.
Today, Structured Query Language is the standard and de-facto language to store, manipulate, retrieve, and delete data from relational databases. Let us discuss how SQL is used in data science.
What is Data Science?
The simple definition is studying huge amounts of data to extract meaningful insights.
A typical data scientist has to collect data from various sources, then filter the data, manipulate it to suit a business problem or transform it to analyze it further, apply algorithms and statistical measures to it, and finally process and visualize data that can be then interpreted to put to use or perform further analysis on it.
The data that a data scientist receives are just chunks of raw or unstructured data. It can be in any form – text, image, sensor data from devices, etc. Through filtering, sorting, and other mechanisms, data is transformed into a more usable form that can lead to solutions for complex business problems.
Some typical examples of data science applications are recommendation systems for movies and online purchases, inventory tracking and update for manufacturing units, and speech and image recognition. No wonder data science has set its foot in almost all domains, whether e-commerce, healthcare, finance, logistics, manufacturing, or banking.
Check out in detail: What is Data Science?
Need for SQL in Data Science
As discussed above, Structured Query Language enables you to perform various operations on data stored in relational databases. You can create and manipulate tables, views, and procedures, add new records and update existing ones, and delete records. SQL commands help you perform all these operations. And data science is all about data.
Today, data-driven companies leverage databases to store collected data. SQL comes in handy to extract data from databases and work with it. Though many companies have adopted NoSQL technologies for data management, SQL still remains the first choice for numerous big data tools like Hadoop, business intelligence tools like Power BI, and in-office operations.
As a result, we can say that the Structured Query Language is useful in data science for:
- Managing structured data.
- Performing analytics operations on data stored in RDBMS, like MySQL, Oracle, SQL Server, etc.
- Data preparation and data wrangling .
Further, Structured Query Language plays a vital role in the different phases of data science, such as data collection, cleaning, modeling, and mining.
1. Data Cleaning and Conditioning
When we receive input data from various sources, it can contain missing values, redundant values, duplicates, null values, and so on. Also, the data is highly unstructured and raw. Using DBMS normalization , it is possible to design a proper database schema to structure the data by splitting it into tables and adding constraints, dependencies, and keys to indicate the relationships between the different tables.
Next, you can use SQL queries and run schema automatic batch scripts on this organized data to remove irrelevant or corrupt data without manual intervention. This makes the cleaning process faster and more reliable. We can also use R or Python for data cleaning.
2. Data Modeling and Mining
With a proper schema in place, model planning and building become easy. Having big amounts of unstructured data transformed into a structured form saves time and effort and makes querying the database easier and faster. By creating queries to get information, you can easily mine data to get patterns, trends, and relevant information in the least amount of time.
An added advantage is that SQL has specific features to aggregate, filter, sort, index, and display data based on many conditions. Also, you can join data from multiple tables and display only the required columns. Moreover, triggers help automate certain tasks and indexes to fetch data faster.
Basics of SQL for Data Science
1. SQL Commands
Commands are instructions to communicate with a database , i.e., perform various operations or run queries on a database. The following basic commands that every data professional should be well-versed at:
- DML (Data Manipulation Language) – DDL commands alter the structure of a table either by creating one and updating or deleting the existing one. These commands include CREATE, DROP, ALTER, and TRUNCATE. They are auto-committed, meaning they save changes permanently to the database.
- DDL (Data Definition Language) – DML commands modify databases and are not auto-committed.The commands in DML are INSERT, UPDATE, and DELETE.
- DCL (Data Control Language) – These commands control access to the data by granting and revoking permissions to different users. GRANT and REVOKE are DCL commands.
- TCL (Transaction Control Language) – These control the transactions and are used with DML commands, as they are not auto-committed. GRANT and REVOKE are two TCL commands.
- DQL (Data Query Language) – It has only one command called SELECT to fetch data from databases.
2. Aggregate Functions
An aggregate function performs operations on multiple rows/tuples or a single column/attribute of a table and returns a single value. Its primary function is to summarize data.
The following are the aggregate functions in Structured Query Language:
- SUM(): It returns the sum of a selected numeric column.
- COUNT(): It returns the number of rows in a database table.
- AVG(): It calculates the average of a set of values.
- MIN(): It returns the minimum value from a set of non-NULL values.
- MAX(): It returns the maximum value from a set of non-NULL values.
Learn about aggregate functions in detail here .
3. Grouping and Sorting
GROUP BY and ORDER BY are two clauses that help you group and sort your data. They help you view data in different groups or sort rows in a specific order.
- GROUP BY: Organizes similar data in one group.
- ORDER BY: Sorts data in ascending order by default. You need to specifiy DESC to sort data in descending order.
Check out : SQL ORDER BY and SQL Group By
4. Joins
Joins help you combine rows or tuples from two or more tables, depending upon a common field between them.
There are 4 types of SQL joins:
- INNER JOIN
- LEFT JOIN
- RIGHT JOIN
- FULL JOIN
Learn more about joins here .
While retrieving data from multiple tables using joins, some values may be missing, represented as NULL.
Not e: Please note that NULL is different from a zero value. Rather than being zero, NULL means the value does not exist.
5. Indexes
Indexes help to get results faster and optimize performance, especially when we join two or more tables. Particularly, during a full join, which does a cartesian product, each row of the left table is compared with each row of the right table (m x n). When there are millions of records, imagine the time it would take to execute a query. This is where the role of indexes comes into play. Think of them as the index you have at the back of books.
Indexing in SQL serves a similar purpose, i.e., making querying faster. If there are no indexes, a database has to look through each row to match a column. This results in making querying slower. If we create an index on the column, the lookup becomes highly efficient and fast. You can also think of an index as a pointer to the row with the value your query is searching for.
Learn more about indexes here .
6. Subqueries or Nested Queries
A nested query is a query inside a query. It is used to filter down further the criteria to search and select data more quickly. The output or data you get after executing the inner query serves as the input of the outer query.
Get to know more about nested queries here .
Data Analysis and SQL
Now, let’s get started with some basic examples to understand how much SQL one needs for data science. To keep things simple, we will take the example of a simple student database that stores information about students of an engineering college.
Consider the following small set of data:
student_id | student_name | date_of_birth | department | hobby | total_marks |
1 | John | 04-03-1984 | CSE | singing | 94 |
2 | Mark | 17-02-1985 | CSE | reading | 92 |
3 | Joe | 04-04-1985 | ECE | reading | 91 |
4 | Richard | 03-11-1984 | EEE | photography | 93.5 |
5 | Samantha | 08-08-1983 | IT | dancing | 86 |
6 | Janice | 14-05-1984 | IT | watching movies | 84 |
-
Choice of Data
Let us say we want to see the details of all the students. To do so, we need to use the following query:
select * from student;
This will give all the records present in the student database. However, if we want to display only certain records, use the following query:
select student_name, department
from student;
This will display the student's name and the corresponding department.
We can use different types of expressions to present data in different ways.
For example:
select student_id + ‘—’ + department as roll_number
from student;
This will fetch the following result:
roll_number
-----------
1-CSE
2-CSE
3-ECE
4-EEE
5-IT
6-IT
Note that roll_number is just a temporary variable that stores the result. It is not a column.
-
Filtering the Most Relevant Data and Applying Aggregate Functions
How about finding students whose hobby is reading? Or those who belong to the ECE department? Or just the number of students who scored marks between 85 and 90? This is nothing but filtering data or retrieving only the desired one.
Filtering the required data with SQL is amazingly simple. Consider the following statement:
select * from student
where hobby = ‘reading’;
This fetches all the columns of the table where the hobby is reading.
Now, look at the following query.
select count(student_id)
from student
where marks between 85 and 90;
This fetches a single value, the count of students who secured marks between 85 and 90.
Next, we have a big query.
select student_name, department, marks
from student
group by department having marks > 90
order by marks desc;
Okay, this requires a detailed explanation, but first, let us see the result:
John CSE 94
Mark CSE 92
Richard EEE 93.5
Joe ECE 91
We see only those records with marks > 90 in descending order. Also, the group by clause ensures that the rows are grouped department-wise. That’s why we see 92 before 93.5.
Let us say we want to get the list of departments, but we don’t want the values to repeat. Since we have more than one value for some departments, this situation is likely. To avoid that, we use the keyword distinct .
select distinct(department)
from student;
How about we get the overall average marks of students in each department?
For this, we use the following query.
select avg(marks), department
from student
group by department;
We receive the following output.
93.0|CSE
91.0|ECE
93.5|EEE
85.0|IT
Isn't SQL powerful yet simple?
We can get any combination of data that we need. You can also use aggregate functions to retrieve useful trends, patterns, and other insights.
-
Identifying Missing Values
The data won’t be as nice and readable in real-world scenarios as in our examples above. It will have lots of missing values, incorrect values, etc. Databases have a unique way of identifying missing values. It is called NULL.
NULL values can cause differences in the measurements and final results. As we saw above, we filter data using where conditions.
For example:
select * from student
where department = ‘IT’;
or
select * from student
where department != ‘IT’;
With null values, these comparisons (= and !=) don’t work. To get a NULL value, we have to use IS and IS NOT, as shown below:
select * from student
where department IS NULL;
or
select * from student
where department IS NOT NULL;
That’s why NULL values will never appear in any result where we make regular comparisons leading to errors in data.
Imagine having a dataset with millions of records and thousands of NULL values. This will significantly impact data analysis. Thus, null values should be corrected to include default or zero values so that no columns (or data) are missed out.
-
Date Time Functions
Date time functions are useful to compare dates and analyze data based on different quarters, months, or years. There are many functions available that can convert data into any format. The timestamp data type includes the time in hours, minutes, and seconds.
For example, we can get only those records where the year of birth of the student is 1984 using the following query:
select year(date_of_birth), student_name
from student
where year(date_of_birth) = 1984;
Check out all the date-related functions here .
-
Combining Data from Multiple Tables
Real data is huge, complex, and split into multiple tables. That means getting the required collective information will need joining one or more tables and displaying the relevant columns from those tables. For this, SQL uses different types of joins, as we discussed above:
- Inner join – Gets only matching values in both tables.
- Left join – Returns all the records of the left table and matching records of the right table.
- Right join – Returns all the records of the right table and matching records of the left table.
- Full join – Returns all the records with a match on either the left or the right table.
- Self-join – Join of a table with itself.
Suppose we want to list the students who have the same marks. We do not have any such values in our above table right now.
Let us update Joe’s marks to 93.5:
update student set marks = 93.5
where student_name = ‘Joe’;
and,
select s1.student_name, s2.student_name
from student s1, student s2
where s1.marks = s2.marks and s1.student_name != s2.student_name;
This will now fetch – Joe|Richard. We used the same table to join itself in the above query.
Let us say there is another table, books_library, that stores details of students borrowing books from the library.
book_id | book_name | author | student_id | date_of_issue | date_of_return |
23 | Let us C++ | Yashwant K. | 1 | 23-01-2020 | 23-02-2020 |
24 | Head First Java | Kathy S. | 2 | 20-01-2020 | 20-02-2020 |
25 | Data Science from Scratch | Joel Grus | 3 | 19-12-2019 | 19-01-2020 |
26 | Headfirst Python | Paul B. | 4 | 12-12-2019 | 12-01-2020 |
If we want to get the details of which student bought which book, we can join the tables and display the desired columns as shown below:
select student.student_name, books_library.book_name, books_library.date_of_issue
from books_library inner join student on s.student_id = b.student_id;
The above query will yield the following output:
John |
Let us C++ |
23-01-2020 |
Mark |
Head First Java |
20-01-2020 |
Joe |
Data Science from scratch |
19-12-2019 |
Richard |
Headfirst Python |
12-12-2019 |
Joins are a powerful way to consolidate data and represent it differently.
Wrapping it Up!
These were the basics of SQL the data scientists need to know and use in their process. The language is helpful during data pre-processing, data analysis , and machine learning processes. Only learning these concepts is not enough. You need to practice them a lot to use them in the real world. A few popular websites that help you practice SQL are HackerRank, DataLemur, and SQLPad. Solve the practice questions from these websites, and take your skills to the next level.
People are also reading:
Leave a Comment on this Post