SQL for Data Science - A Complete Guide

Posted in /   /  

SQL for Data Science - A Complete Guide
ramyashankar

Ramya Shankar
Last updated on December 21, 2024

    Data science is one of the most promising and flourishing domains offering a plethora of job opportunities. We all know that the field is all about data and making the most of it. It requires individuals to possess a rich set of skills. Among the data science skills, SQL is a must-have.

    SQL (Structured Query Language) is a query language used to communicate or interact with a relational database . It helps you store and manage data in a relational database management system (RDBMS). IBM Computer Scientists in the 1970s created this language and became the American National Standards Institute (ANSI) and International Organization for Standardization (ISO) standards in 1986 and 1987, respectively.

    Today, Structured Query Language is the standard and de-facto language to store, manipulate, retrieve, and delete data from relational databases. Let us discuss how SQL is used in data science.

    What is Data Science?

    The simple definition is studying huge amounts of data to extract meaningful insights.

    A typical data scientist has to collect data from various sources, then filter the data, manipulate it to suit a business problem or transform it to analyze it further, apply algorithms and statistical measures to it, and finally process and visualize data that can be then interpreted to put to use or perform further analysis on it.

    The data that a data scientist receives are just chunks of raw or unstructured data. It can be in any form – text, image, sensor data from devices, etc. Through filtering, sorting, and other mechanisms, data is transformed into a more usable form that can lead to solutions for complex business problems.

    Some typical examples of data science applications are recommendation systems for movies and online purchases, inventory tracking and update for manufacturing units, and speech and image recognition. No wonder data science has set its foot in almost all domains, whether e-commerce, healthcare, finance, logistics, manufacturing, or banking.

    Check out in detail: What is Data Science?

    Need for SQL in Data Science

    As discussed above, Structured Query Language enables you to perform various operations on data stored in relational databases. You can create and manipulate tables, views, and procedures, add new records and update existing ones, and delete records. SQL commands help you perform all these operations. And data science is all about data.

    Today, data-driven companies leverage databases to store collected data. SQL comes in handy to extract data from databases and work with it. Though many companies have adopted NoSQL technologies for data management, SQL still remains the first choice for numerous big data tools like Hadoop, business intelligence tools like Power BI, and in-office operations.

    As a result, we can say that the Structured Query Language is useful in data science for:

    • Managing structured data.
    • Performing analytics operations on data stored in RDBMS, like MySQL, Oracle, SQL Server, etc.
    • Data preparation and data wrangling .

    Further, Structured Query Language plays a vital role in the different phases of data science, such as data collection, cleaning, modeling, and mining.

    1. Data Cleaning and Conditioning

    When we receive input data from various sources, it can contain missing values, redundant values, duplicates, null values, and so on. Also, the data is highly unstructured and raw. Using DBMS normalization , it is possible to design a proper database schema to structure the data by splitting it into tables and adding constraints, dependencies, and keys to indicate the relationships between the different tables.

    Next, you can use SQL queries and run schema automatic batch scripts on this organized data to remove irrelevant or corrupt data without manual intervention. This makes the cleaning process faster and more reliable. We can also use R or Python for data cleaning.

    2. Data Modeling and Mining

    With a proper schema in place, model planning and building become easy. Having big amounts of unstructured data transformed into a structured form saves time and effort and makes querying the database easier and faster. By creating queries to get information, you can easily mine data to get patterns, trends, and relevant information in the least amount of time.

    An added advantage is that SQL has specific features to aggregate, filter, sort, index, and display data based on many conditions. Also, you can join data from multiple tables and display only the required columns. Moreover, triggers help automate certain tasks and indexes to fetch data faster.

    Basics of SQL for Data Science

    1. SQL Commands

    Commands are instructions to communicate with a database , i.e., perform various operations or run queries on a database. The following basic commands that every data professional should be well-versed at:

    • DML (Data Manipulation Language) – DDL commands alter the structure of a table either by creating one and updating or deleting the existing one. These commands include CREATE, DROP, ALTER, and TRUNCATE. They are auto-committed, meaning they save changes permanently to the database.
    • DDL (Data Definition Language) – DML commands modify databases and are not auto-committed.The commands in DML are INSERT, UPDATE, and DELETE.
    • DCL (Data Control Language) – These commands control access to the data by granting and revoking permissions to different users. GRANT and REVOKE are DCL commands.
    • TCL (Transaction Control Language) – These control the transactions and are used with DML commands, as they are not auto-committed. GRANT and REVOKE are two TCL commands.
    • DQL (Data Query Language) – It has only one command called SELECT to fetch data from databases.

    2. Aggregate Functions

    An aggregate function performs operations on multiple rows/tuples or a single column/attribute of a table and returns a single value. Its primary function is to summarize data.

    The following are the aggregate functions in Structured Query Language:

    • SUM(): It returns the sum of a selected numeric column.
    • COUNT(): It returns the number of rows in a database table.
    • AVG(): It calculates the average of a set of values.
    • MIN(): It returns the minimum value from a set of non-NULL values.
    • MAX(): It returns the maximum value from a set of non-NULL values.

    Learn about aggregate functions in detail here .

    3. Grouping and Sorting

    GROUP BY and ORDER BY are two clauses that help you group and sort your data. They help you view data in different groups or sort rows in a specific order.

    • GROUP BY: Organizes similar data in one group.
    • ORDER BY: Sorts data in ascending order by default. You need to specifiy DESC to sort data in descending order.

    Check out : SQL ORDER BY and SQL Group By

    4. Joins

    Joins help you combine rows or tuples from two or more tables, depending upon a common field between them.

    There are 4 types of SQL joins:

    1. INNER JOIN
    2. LEFT JOIN
    3. RIGHT JOIN
    4. FULL JOIN

    Learn more about joins here .

    While retrieving data from multiple tables using joins, some values may be missing, represented as NULL.

    Not e: Please note that NULL is different from a zero value. Rather than being zero, NULL means the value does not exist.

    5. Indexes

    Indexes help to get results faster and optimize performance, especially when we join two or more tables. Particularly, during a full join, which does a cartesian product, each row of the left table is compared with each row of the right table (m x n). When there are millions of records, imagine the time it would take to execute a query. This is where the role of indexes comes into play. Think of them as the index you have at the back of books.

    Indexing in SQL serves a similar purpose, i.e., making querying faster. If there are no indexes, a database has to look through each row to match a column. This results in making querying slower. If we create an index on the column, the lookup becomes highly efficient and fast. You can also think of an index as a pointer to the row with the value your query is searching for.

    Learn more about indexes here .

    6. Subqueries or Nested Queries

    A nested query is a query inside a query. It is used to filter down further the criteria to search and select data more quickly. The output or data you get after executing the inner query serves as the input of the outer query.

    Get to know more about nested queries here .

    Data Analysis and SQL

    Now, let’s get started with some basic examples to understand how much SQL one needs for data science. To keep things simple, we will take the example of a simple student database that stores information about students of an engineering college.

    Consider the following small set of data:

    student_id student_name date_of_birth department hobby total_marks
    1 John 04-03-1984 CSE singing 94
    2 Mark 17-02-1985 CSE reading 92
    3 Joe 04-04-1985 ECE reading 91
    4 Richard 03-11-1984 EEE photography 93.5
    5 Samantha 08-08-1983 IT dancing 86
    6 Janice 14-05-1984 IT watching movies 84
    • Choice of Data

    Let us say we want to see the details of all the students. To do so, we need to use the following query:

    select * from student;

    This will give all the records present in the student database. However, if we want to display only certain records, use the following query:

    select student_name, department 
    from student;

    This will display the student's name and the corresponding department.

    We can use different types of expressions to present data in different ways.

    For example:

    select student_id + ‘—’ + department as roll_number 
    from student;

    This will fetch the following result:

    roll_number
    
    -----------
    
    1-CSE
    2-CSE
    3-ECE
    4-EEE
    5-IT
    6-IT

    Note that roll_number is just a temporary variable that stores the result. It is not a column.

    • Filtering the Most Relevant Data and Applying Aggregate Functions

    How about finding students whose hobby is reading? Or those who belong to the ECE department? Or just the number of students who scored marks between 85 and 90? This is nothing but filtering data or retrieving only the desired one.

    Filtering the required data with SQL is amazingly simple. Consider the following statement:

    select * from student 
    where hobby = ‘reading’;

    This fetches all the columns of the table where the hobby is reading.

    Now, look at the following query.

    select count(student_id) 
    from student 
    where marks between 85 and 90;

    This fetches a single value, the count of students who secured marks between 85 and 90.

    Next, we have a big query.

    select student_name, department, marks 
    from student 
    group by department having marks > 90 
    order by marks desc;

    Okay, this requires a detailed explanation, but first, let us see the result:

    John         CSE      94
    Mark         CSE      92
    Richard      EEE      93.5
    Joe          ECE      91

    We see only those records with marks > 90 in descending order. Also, the group by clause ensures that the rows are grouped department-wise. That’s why we see 92 before 93.5.

    Let us say we want to get the list of departments, but we don’t want the values to repeat. Since we have more than one value for some departments, this situation is likely. To avoid that, we use the keyword distinct .

    select distinct(department) 
    from student;

    How about we get the overall average marks of students in each department?

    For this, we use the following query.

    select avg(marks), department 
    from student 
    group by department;

    We receive the following output.

    93.0|CSE
    91.0|ECE
    93.5|EEE
    85.0|IT

    Isn't SQL powerful yet simple?

    We can get any combination of data that we need. You can also use aggregate functions to retrieve useful trends, patterns, and other insights.

    • Identifying Missing Values

    The data won’t be as nice and readable in real-world scenarios as in our examples above. It will have lots of missing values, incorrect values, etc. Databases have a unique way of identifying missing values. It is called NULL.

    NULL values can cause differences in the measurements and final results. As we saw above, we filter data using where conditions.

    For example:

    select * from student 
    where department = ‘IT’;

    or

    select * from student 
    where department != ‘IT’;

    With null values, these comparisons (= and !=) don’t work. To get a NULL value, we have to use IS and IS NOT, as shown below:

    select * from student 
    where department IS NULL;
    

    or

    select * from student 
    where department IS NOT NULL;

    That’s why NULL values will never appear in any result where we make regular comparisons leading to errors in data.

    Imagine having a dataset with millions of records and thousands of NULL values. This will significantly impact data analysis. Thus, null values should be corrected to include default or zero values so that no columns (or data) are missed out.

    • Date Time Functions

    Date time functions are useful to compare dates and analyze data based on different quarters, months, or years. There are many functions available that can convert data into any format. The timestamp data type includes the time in hours, minutes, and seconds.

    For example, we can get only those records where the year of birth of the student is 1984 using the following query:

    select year(date_of_birth), student_name 
    from student 
    where year(date_of_birth) = 1984;

    Check out all the date-related functions here .

    • Combining Data from Multiple Tables

    Real data is huge, complex, and split into multiple tables. That means getting the required collective information will need joining one or more tables and displaying the relevant columns from those tables. For this, SQL uses different types of joins, as we discussed above:

    • Inner join – Gets only matching values in both tables.
    • Left join – Returns all the records of the left table and matching records of the right table.
    • Right join – Returns all the records of the right table and matching records of the left table.
    • Full join – Returns all the records with a match on either the left or the right table.
    • Self-join – Join of a table with itself.

    Suppose we want to list the students who have the same marks. We do not have any such values in our above table right now.

    Let us update Joe’s marks to 93.5:

    update student set marks = 93.5 
    where student_name = ‘Joe’;

    and,

    select s1.student_name, s2.student_name 
    from student s1, student s2 
    where s1.marks = s2.marks and s1.student_name != s2.student_name;

    This will now fetch – Joe|Richard. We used the same table to join itself in the above query.

    Let us say there is another table, books_library, that stores details of students borrowing books from the library.

    book_id book_name author student_id date_of_issue date_of_return
    23 Let us C++ Yashwant K. 1 23-01-2020 23-02-2020
    24 Head First Java Kathy S. 2 20-01-2020 20-02-2020
    25 Data Science from Scratch Joel Grus 3 19-12-2019 19-01-2020
    26 Headfirst Python Paul B. 4 12-12-2019 12-01-2020

    If we want to get the details of which student bought which book, we can join the tables and display the desired columns as shown below:

    select student.student_name, books_library.book_name, books_library.date_of_issue 
    from books_library inner join student on s.student_id = b.student_id;

    The above query will yield the following output:

    John
    Let us C++
    23-01-2020
    Mark
    Head First Java
    20-01-2020
    Joe
    Data Science from scratch
    19-12-2019
    Richard
    Headfirst Python
    12-12-2019

    Joins are a powerful way to consolidate data and represent it differently.

    Wrapping it Up!

    These were the basics of SQL the data scientists need to know and use in their process. The language is helpful during data pre-processing, data analysis , and machine learning processes. Only learning these concepts is not enough. You need to practice them a lot to use them in the real world. A few popular websites that help you practice SQL are HackerRank, DataLemur, and SQLPad. Solve the practice questions from these websites, and take your skills to the next level.

    People are also reading:

    FAQs


    SQL, or Structured Query Language, is a language to manipulate and manage data in relational databases. It is used to retrieve and modify data from large databases and perform various data analysis tasks.

    Some common SQL commands used in data science are SELECT, WHERE, GROUP BY, HAVING, JOIN, UNION, and ORDER BY to query and filter data, aggregate and summarize data, and join data from multiple tables.

    Yes, SQL is very easy to learn, even for non-IT individuals. The syntax is similar to the English language and easy to understand.

    Structured Query Language is a powerful language that helps data scientists to retrieve and analyze data from large datasets, join data from multiple tables, and perform complex queries.

    SQL is a language primarily designed to work with structured data and relational databases. Hence, it is not capable of analyzing unstructured and semi-structured data.

    Leave a Comment on this Post

    0 Comments