Top SQL Interview Questions
Here are a dozen of the most common SQL interview questions you will see during an interview for a data professional role. I would highly suggest you attempt to answer the questions and write prospective queries for each question before you compare your answers to mine at the end of the article. I created a fake dataset and schema to run queries. Here is the link to the database schema that is outlined below the questions. That will get you started.
- Determine the number of duplicates in a table. [Solution]
- Find all unique combinations of two columns. [Solution]
- Count the number of non-null entries in a column. [Solution]
- When to use group-by vs. distinct? Show an example. [Solution]
- Why use coalesce vs. case statements? [Solution]
- When would you choose left join over an inner join? Show an example. [Solution]
- In what cases is a subquery a bad idea? [Solution]
- Why is a temp table a good or bad idea? [Solution]
- When are unions useful? Show an example. [Solution]
- How to filter using a join? Show an example. [Solution]
- When to use having vs. where? Show an example. [Solution]
- How do common table expressions help an analyst? Show an example. [Solution]
Other Topics Not Covered in this Article:
Sum. Date-time manipulation. String formatting. Substring. Window functions like rank and row. LAG and LEAD. Understanding indexing. Running totals. MAX and MIN.
Database Schema Set-Up
You can set up the schema at SQLFiddle using the following SQL code or going to this direct link to run the code in MySQL or this direct link for PostgreSQL. In addition to the code below, there are two tables that visually represent the 2019_Sales and Store_Info tables. Below the tables, you will find the Solutions section.
Note: The schema worked in MySQL, but may not work in other SQL databases. Other SQL database management systems might have stricter rules on table names and double vs. single quote usage.
CREATE TABLE 2019_Sales (
sales_id INT NOT NULL,
year INT NOT NULL,
country VARCHAR(40) NOT NULL,
store_id INT NOT NULL,
sales INT NOT NULL,
PRIMARY KEY (sales_id)
);
INSERT INTO 2019_Sales
(sales_id, year, country, store_id, sales)
VALUES
(1,2019,"Canada",1111,1000),
(2,2019,"Canada",1112,5000),
(3,2019,"Canada",1113,700),
(4,2019,"Mexico",1114,8000),
(5,2019,"Mexico",1115,100),
(6,2019,"United States",1116,2000),
(7,2019,"United States",1117,150),
(8,2019,"Canada",1111,1000),
(9,2019,"United States",1116,500),
(10,2019,"Mexico",1114,8000),
(11,2019,"Mexico",1115,500),
(12,2019,"Mexico",1115,500);
CREATE TABLE Store_Info (
store_id INT NOT NULL,
store_name VARCHAR(40) NOT NULL,
size INT,
PRIMARY KEY (store_id)
);
INSERT INTO Store_Info
(store_id, store_name, size)
VALUES
(1111,"Red Store",1000),
(1112,"Blue Store",3000),
(1113,"Green Store",2000),
(1114,"Pink Store",NULL),
(1115,"Yellow Store",NULL),
(1116,"Brown Store",500),
(1117,"Black Store",1000);
2019_Sales Table
sales_id | year | country | store_id | sales |
---|---|---|---|---|
1 | 2019 | Canada | 1111 | 1000 |
2 | 2019 | Canada | 1112 | 5000 |
3 | 2019 | Canada | 1113 | 700 |
4 | 2019 | Mexico | 1114 | 8000 |
5 | 2019 | Mexico | 1115 | 100 |
6 | 2019 | United States | 1116 | 2000 |
7 | 2019 | United States | 1117 | 150 |
8 | 2019 | Canada | 1111 | 1000 |
9 | 2019 | United States | 1116 | 500 |
10 | 2019 | Mexico | 1114 | 8000 |
11 | 2019 | Mexico | 1115 | 500 |
12 | 2019 | Mexico | 1115 | 500 |
Store_Info Table
store_id | country | size |
---|---|---|
1111 | Red Store | 1000 |
1112 | Blue Store | 3000 |
1113 | Green Store | 2000 |
1114 | Pink Store | |
1115 | Yellow Store | |
1116 | Brown Store | 500 |
1117 | Black Store | 1000 |
Click here to return to questions
***SOLUTIONS ARE BELOW THIS***
Solutions
Question 1: Determine the number of duplicates in a table
This can be interpreted in two different ways. Finding the number of duplicates including all columns in the comparison, or finding the number of duplicates for a subset of columns. If there is a primary key, it should not be possible to have an entire row that is a duplicate, but they still happen. Below is my code to find the number of duplicates in the subset of this table that only includes {“year”, “country”}.
/* Year and Country combination duplicates */
SELECT year, country, COUNT(*)
FROM 2019_Sales
GROUP BY year, country
HAVING COUNT(*)>1
Question 2: Find all unique combinations of two columns
/* Find all unique combinations of two columns*/
SELECT DISTINCT country, store_id
FROM 2019_Sales
Question 3: Count the number of non-null entries in a column
/* Count the number of non-null entries in a column*/
SELECT COUNT(size)
FROM Store_Info
Question 4: When to use group-by vs. distinct?
Group-by is a similar technique to pivot tables in Excel. It is useful for summarizing or aggregating information when values appear multiple times in a column. For example, we can aggregate sales by country using group-by on the 2019_Sales table.
/* Aggregate sales by country using group-by*/
SELECT country, sum(sales)
FROM 2019_Sales
GROUP BY country
Query Output Table
Country | Sum(Sales) |
---|---|
Canada | 7700 |
Mexico | 17100 |
United States | 2650 |
The distinct field is useful when you need to a unique set of records or a count of a unique set of records. In question 2, we found the count of the unique set of all combinations between two columns.
Question 5: Why use coalesce vs. case statements?
Case statements are available in most programming languages. They are useful in situations where you have a large number of conditional statements. They are a more organized and easier to interpret approach than writing a bunch of if/then statements. One common example is assigning ordinal rankings. Coalesce statements are simply shorthand for case statements. Because they require less code, they are not as legible to a non-programmer as case statements.
Coalesce statements are preferred when a programmer wants to reduce code or when users access the query regularly. Because it is harder to interpret, you would not want an infrequent user reading through lines of coalescent statements.
On the other hand, case statements are better for infrequent users because of legibility. This characteristic would also make case statements easier for debugging SQL statements.
Question 6: When would you choose left join over an inner join?
A left join is common when you have a primary data set that you want to combine with an incomplete data set. The goal is to add some new information and not override anything from the primary data set. In the below example, we used a left join to add the Store_Info table onto the 2019_Sales table. Because there is no Store_Info record for store_id 1117, an inner join would omit this record in the joined table. This is seen if you run the “Inner Join” code below and compare it to the “Left Join” output
/* Left join two tables - Includes Store_ID 1117*/
SELECT 2019_Sales.sales_id, 2019_Sales.year, 2019_Sales.country, 2019_Sales.store_id, 2019_Sales.sales, Store_Info.store_id, Store_Info.store_name, Store_Info.size
FROM 2019_Sales
LEFT JOIN Store_Info ON 2019_Sales.store_id=Store_Info.store_id;
/* Inner join two tables - Doesnt include Store_ID 1117*/
SELECT 2019_Sales.sales_id, 2019_Sales.year, 2019_Sales.country, 2019_Sales.store_id, 2019_Sales.sales, Store_Info.store_id, Store_Info.store_name, Store_Info.size
FROM 2019_Sales
INNER JOIN Store_Info ON 2019_Sales.store_id=Store_Info.store_id;
Question 7: In what cases is a subquery a bad idea?
Subqueries, or nested queries, are typically used for some calculation or conditional logic that provides data to be used in the main part of the query. Some subqueries are referred to as correlated subqueries. These occur when a subquery incorporates a column from the main query. Because correlated subqueries have a dependency on the main query, they cannot be run on their own. Therefore, they are harder to debug. Common issues include incorrect references to a table or alias (Source).
Question 8: Why is a temp table a good or bad idea?
Temp tables are useful when you need to store and view intermediate results when processing data within a batch or process. This would be helpful for any debugging or checking that a procedure is behaving as expected. However, using temp tables in stored procedure leads to adverse effects, such as multiple recompilation, transaction log use, and concurrency overhead (Source). This results in a negative performance impact.
Question 9: When are unions useful?
Unions are useful for combining columns or data sets; however, they don’t combine them like a JOIN but rather like a concatenation* of strings. You append one collection of records onto the other. While joins match based on specific IDs, unions simply add one section onto the end of another. In the below example, we can confirm we have all store_ids by using union to combine the store_id column in both tables. The union function deduplicates the dataset, while union all does not.
/* Combine all store_ids from both tables into one column*/
SELECT store_id FROM 2019_Sales
UNION
SELECT store_id FROM Store_Info
ORDER BY store_id;
*Note: A similar function in the Pandas library for Python is concat( ).
Question 10: How to filter using a join?
Self joins are when a table is joined with itself. These can be useful functions for filtering records in a single table. In the following situation, we can find all pairs of store_ids within the same country. The output yields two rows for each pair because order matters in this situation.
/* Find all pairs of store_ids in the same country*/
SELECT A.store_id AS Store_ID1, B.store_id AS Store_ID2, A.country AS Country
FROM 2019_Sales A, 2019_Sales B
WHERE A.store_id <> B.store_id
AND A.Country = B.Country
ORDER BY A.Country;
Question 11: When to use having vs. where?
Where and having statements are both used to filter results for your query; however, they appear at different spots in the order of the script. This ordering is purposeful because “where” statements filter non-aggregated columns. “Having” statements are used to filter aggregated columns or columns included in the group by clause. See the example below where we want to filter on the aggregated sums for each country. Because this is an aggregation, we cannot use the where clause. The output of this query excludes the United States sum of sales that totals 2650, which is less than 2800.
-- Using the having clauses
SELECT country, sum(sales)
FROM 2019_Sales
GROUP BY country
HAVING sum(sales) > 2800
Query Output Table
country | sum(sales) |
---|---|
Canada | 7700 |
Mexico | 17100 |
Note: Many database management systems do not support alias usage in the having clause.
Question 12: How do common table expressions help an analyst?
Common table expressions (CTE) are essentially defining a temporary table that exists only for one query. The CTE is declared before the main query. They are similar to subqueries and can be more readable than complex queries. CTEs are accompanied by a “WITH” statement at the beginning. They provide additional flexibility for complex queries, including the ability to alias fields inside the CTE and use them in later expressions. For example, you can take an aggregation of an aggregation. Inception?! In the code below, I take an average of a sum in one query.
-- Using CTEs and the WITH statement
WITH sum_store_id_sale AS
(
SELECT store_id, country, SUM(sales) AS sum_sales
FROM Sales_2019
Group By store_id, country
)
SELECT AVG(sum_sales) AS avg_sales
FROM sum_store_id_sale;
The output of this code was the average of the sum_sales column, which found the sum of sales across store_ids. Therefore, the output was 3921.43.
Note: The above code did not run in MySQL in SQLfiddle. I rebuilt the schema in PostgreSQL. This required changing the name of the 2019_Sales table to Sales_2019 and replaced all double quotes with single quotes to satisfy PostgreSQL’s rules.
Feedback
These questions had a wide range of difficulties. Some were beginner level, while others tackle more complex topics like temp tables or subqueries. After comparing your answers, where did yours differ from mine? For the answers that were different, I would dive into those a little deeper. Did we have different interpretations of the question? Was my explanation missing something or was yours incomplete? During an interview, the ideal candidate can explain the most complex SQL topics in a clear, succinct manner.
~ The Data Generalist
Data Science Career Advisor
Some additional SQL resources:
- WiseOwlTutorials – Video tutorials
- w3schools – SQL tutorials
- SQLfiddle– Can build your own schema and run queries on it online
- SQLZOO – SQL tutorials
- Hackerrank – SQL Practice – SQL quizzes