Harnessing the Power of Data Science and SQL with OLAP Functions on IBM iSeries
In the ever-evolving landscape of technology, Data Science has emerged as a cornerstone of business intelligence, strategic decision-making, and operational efficiency.
At the heart of data science lies the powerful combination of SQL and Online Analytical Processing (OLAP) functions, especially within the robust IBM iSeries (AS/400) environment.
This post explores how leveraging SQL and OLAP functions on the IBM iSeries platform can transform data analysis, offering deep insights and facilitating complex analytical queries directly on the database.
Although obviously these possibilities are common to sql engines that provide OLAP functions, in this post the focus is on the IBM iSeries platform.
Introduction to Data Science and SQL
Data science combines various statistical, analytical, and programming skills to collect, analyze, and interpret large data sets, aiming to derive actionable insights.
SQL (Structured Query Language) is a fundamental tool in the data scientist’s arsenal, allowing for the management and manipulation of relational databases. The precision, efficiency, and versatility of SQL make it indispensable for querying and analyzing data.
The Role of OLAP Functions in Data Science
OLAP functions extend the capabilities of SQL, enabling advanced data manipulation and analysis features.
These functions are integral to performing complex analytical queries, such as data summarization, aggregation, and multi-dimensional analysis.
By integrating OLAP functions within SQL queries, data scientists can enhance their data exploration and analysis capabilities, uncovering deeper insights more efficiently.
OLAP Functions on IBM iSeries (DB2 for i)
The IBM iSeries (AS/400) platform, with its DB2 for i database, supports a comprehensive suite of OLAP functions designed to facilitate sophisticated data analysis.
These functions can be categorized as follows:
1 – Aggregate Functions: Essential for performing basic to advanced aggregations, these functions help summarize data, providing a foundation for further analysis:
- SUM, AVG, MAX, MIN, COUNT: Perform basic aggregations such as total, average, maximum, minimum, and count..
- STDDEV, VAR: Calculate standard deviation and variance to understand data dispersion.
2 – Ranking Functions: Offer advanced ranking capabilities, essential for comparative analysis within data sets:
- ROW_NUMBER(): Assigns a unique sequential integer to rows within a partition of a result set, starting at 1.
- RANK(): Assigns a rank to each row within a partition of a result set, with gaps in rank values if there are ties.
- DENSE_RANK(): Similar to RANK(), but without gaps in the ranking sequence for tied ranks.
- NTILE(n): Distributes the rows in an ordered partition into a specified number of groups, n. It is useful for dividing a dataset into quantiles.
3 – Window Functions: Enable access to specific data points within a data set, facilitating complex comparisons and analyses without the need for self-joins:
- LEAD, LAG: Access data from a subsequent row (LEAD) or a previous row (LAG) without requiring a self-join.
- FIRST_VALUE, LAST_VALUE: Obtain the first or last value in a specified window.
- NTH_VALUE(): Fetches the value of a specified column at the nth position within the window.
4 – Analytical Aggregate Functions: Allow for the performance of aggregations over related sets of rows, offering a deeper analytical perspective:
- SUM() OVER, AVG() OVER: Perform aggregations over a set of rows that are somehow related to the current row.
- CUME_DIST(): Calculates the cumulative distribution of a value within a set of values.
- PERCENT_RANK(): Computes the percentage rank of a value in a set of values.
5 – Grouping Operations: Enable comprehensive data aggregation and analysis, supporting complex reporting needs with efficiency and flexibility:
- GROUPING SETS: Allows for multiple levels of aggregations in a single query, useful for generating reports with subtotals and grand totals.
- ROLLUP: Produces a result set that shows aggregates for a hierarchy of values, including subtotals and a grand total.
- CUBE: Generates all possible combinations of aggregations for a group of selected columns.
6 – Miscellaneous Functions: Enhance data manipulation capabilities, allowing for innovative data presentations and insights:
- LISTAGG(): Aggregates string data by concatenating values from multiple rows.
- GROUPING(): Identifies whether a row in an aggregate result set is a super-aggregate row.
Usage Consideration
- Performance: OLAP functions, especially when applied to large datasets, can be resource-intensive. Proper indexing, partitioning, and query optimization are essential.
- Compatibility: Ensure your system is on a recent version of IBM i (OS/400) to utilize the full range of OLAP capabilities offered by DB2 for i.
- Learning Curve: While powerful, OLAP functions can be complex to master. Invest time in understanding their nuances to fully leverage their analytical capabilities.
The integration of SQL and OLAP functions on the IBM iSeries platform offers unparalleled opportunities for data scientists to perform sophisticated data analysis and reporting.
By mastering these tools, professionals can unlock deeper insights, enhance performance, and streamline data processing workflows, ultimately driving forward the Data Science frontier within their organizations.
Well, let’s start by seeing some OLAP Functions in action!
OLAP Function examples: aggregate functions SUM, AVG, MAX, MIN, COUNT, STDDEV, VAR
Below are examples of SQL commands for IBM iSeries that demonstrate the use of OLAP (Online Analytical Processing) functions, including aggregate functions like:
- SUM, AVG, MAX, MIN, COUNT: Perform basic aggregations such as total, average, maximum, minimum, and count.
- STDDEV, VAR: Calculate standard deviation and variance to understand data dispersion.
These examples also incorporate Common Table Expressions (CTEs) and the VALUES clause to prepare test data.
WITH sales_data (sale_id, category, product_id, quantity_sold, sale_amount) AS ( VALUES (1, 'C1', 'Pa', 10, 1000), (2, 'C1', 'Pb', 20, 1500), (3, 'C1', 'Pc', 15, 1200), (4, 'C1', 'Pd', 25, 2000), (5, 'C2', 'Pe', 30, 2600), (6, 'C2', 'Pf', 10, 1200), (7, 'C2', 'Pg', 20, 1700), (8, 'C2', 'Ph', 15, 1400), (9, 'C3', 'Pi', 10, 2000), (10, 'C3', 'Pj', 20, 1900), (11, 'C3', 'Pk', 15, 1500), (12, 'C3', 'Pl', 25, 2300), (13, 'C3', 'Pm', 30, 2800), (14, 'C3', 'Pn', 30, 2200) ) SELECT category, SUM(sale_amount) AS total, AVG(sale_amount) AS average, MAX(sale_amount) AS "max", MIN(sale_amount) AS "min", Round(STDDEV(sale_amount)) AS stddev, Round(VAR(sale_amount)) AS var, COUNT(*) AS "Count" FROM sales_data GROUP BY category ORDER BY 1;
Result:
CATEGORY | TOTAL | AVERAGE | max | min | STDDEV | VAR | Count |
C1 | 5.700 | 1.425 | 2.000 | 1.000 | 377 | 141.875 | 4 |
C2 | 6.900 | 1.725 | 2.600 | 1.200 | 536 | 286.875 | 4 |
C3 | 12.700 | 2.116 | 2.800 | 1.500 | 398 | 158.056 | 6 |
Two word about Standard Deviation and Variance.
OLAP (Online Analytical Processing) functions for calculating Standard Deviation and Variance are powerful tools in data analysis, especially when working with large datasets in databases or data warehouses to support business decisions.
These functions allow analysts to better understand the distribution and variability of data, which are fundamental for statistical analysis and data-driven decision-making.
Statistical Concept
- Variance: Variance measures the dispersion indicating how the data points in a dataset deviate from the mean (average) of the set. In simple terms, it tells us how variable the data are. Variance is calculated as the average of the squared differences between each value and the dataset’s mean. In the OLAP context, variance helps understand the distribution of values across different dimensions, such as time, geography, or product categories.
- Standard Deviation: Standard deviation is the square root of variance and provides a measure of data dispersion in the original units. While variance gives an idea of data variability, standard deviation is more intuitively understandable because it is expressed in the same units as the original data. This makes it easier to assess the spread of values around the mean.
Practical Significance
- Variability Analysis: Understanding the variance and standard deviation of data can help businesses identify variability in business processes, sales, production, and other key metrics. For example, high variability in sales could indicate the need to look more closely at factors affecting such fluctuations, like seasonality, promotions, or competition.
- Decision Support: OLAP functions that calculate these statistics can support strategic decisions by providing insights into how data are distributed around industry or company averages. For instance, understanding standard deviation in delivery times can help improve logistics and customer satisfaction.
- Segmentation and Targeting: By analyzing the variance and standard deviation of customer behaviors or preferences, companies can identify market segments or customer groups with similar needs or behaviors, allowing for more targeted marketing strategies.
- Anomaly Detection: Values that significantly deviate from the mean (outliers) can be easily identified when analyzing standard deviation. This can be particularly useful for detecting data errors, fraud, or other operational anomalies.
In summary, OLAP functions for calculating variance and standard deviation are essential for any advanced data analysis, allowing analysts and decision-makers to better understand the nature and distribution of data within an organization.
This understanding can guide process optimization, product innovation, and more effective marketing strategies, among other competitive advantages.
OLAP Functions examples: Analytical Aggregate Functions SUM() OVER, AVG() OVER, CUME_DIST(), PERCENT_RANK()
Below are examples of SQL commands for IBM iSeries that demonstrate the use of OLAP (Online Analytical Processing) functions, including analytical aggregate functions, also known as Window functions or Partitoning Functions, like SUM() OVER, AVG() OVER, CUME_DIST(), and PERCENT_RANK().
These examples also incorporate Common Table Expressions (CTEs) and the VALUES clause to prepare test data.
Example 1 OLAP Function: SUM() OVER and AVG() OVER
These functions compute the sum and average of a set of rows defined by the OVER clause (explained in detail below).
WITH SalesData (PersonId, Year, Amount) AS ( VALUES (1, 2019, 1000), (1, 2020, 1500), (1, 2021, 800) , (1, 2022, 1400), (1, 2023, 1600), (2, 2020, 1500), (2, 2021, 800), (2, 2022, 1400) , (2, 2023, 1600) , (3, 2021, 800), (3, 2022, 1400), (3, 2023, 1600) ) SELECT distinct PersonId , SUM(Amount) OVER (PARTITION BY PersonId) AS Total , AVG(Amount) OVER (PARTITION BY PersonId) AS Avg , SUM(Amount) OVER (PARTITION BY 'ALL') AS total , DECIMAL(100 * (SUM(Amount) OVER (PARTITION BY PersonId)) / DECIMAL(SUM(Amount) OVER (PARTITION BY 'ALL'), 10, 3), 5, 3) AS Perc , COUNT(Year) OVER (PARTITION BY PersonId) AS CountYear FROM SalesData ORDER BY 1;
Result:
PERSONID | TOTAL | AVG | TOTAL | PERC | COUNTYEAR |
1 | 6.300 | 1.260 | 15.400 | 40,909 | 5 |
2 | 5.300 | 1.325 | 15.400 | 34,415 | 4 |
3 | 3.800 | 1.266 | 15.400 | 24,675 | 3 |
In this example, the CTE SalesData prepares test data with PersonId, Year and Amount. The SELECT statement then calculates the total, average and Perc sales by each salesperson using the SUM() OVER and AVG() OVER functions, partitioning the data by PersonId.
Two word about Windows Functions or Partitioning Functions
In IBM iSeries SQL and more generally in SQL databases, the OLAP functions such as SUM(…) OVER (…), AVG(…) OVER (…), and COUNT(…) OVER (…) are part of what’s also known as Window functions or Partitioning functions.
These functions perform calculations across a set of rows that are somehow related to the current row, allowing for complex analyses like running totals, moving averages, or cumulative statistics.
Here’s how each of these functions works in the context of our query:
SUM(…) OVER (PARTITION BY…)
The SUM(…) OVER (PARTITION BY …) function calculates the total sum of a column for each partition of the data. A partition is essentially a subset of the data based on the column(s) specified in the PARTITION BY clause. In our query, SUM(Amount) OVER (PARTITION BY PersonId) computes the total amount for each PersonId across their respective rows. This means it groups the data by PersonId and then sums the Amount for each group.
AVG(…) OVER (PARTITION BY …)
Similar to the SUM function, AVG(…) OVER (PARTITION BY …) calculates the average of a column for each partition. In our example, AVG(Amount) OVER (PARTITION BY PersonId) computes the average amount for each PersonId. This is done by partitioning the data by PersonId and calculating the average Amount for each partition.
COUNT(…) OVER (PARTITION BY…)
The COUNT(…) OVER (PARTITION BY …) function counts the number of rows in each partition. In the provided query, COUNT(Year) OVER (PARTITION BY PersonId) counts the number of years recorded for each PersonId. This effectively gives you the number of entries (or records) per person.
Special Notes on our Query: the SUM(Amount) OVER (PARTITION BY ‘ALL’) is an interesting usage that would calculating the total amount across all the data without partitioning. The value ‘ALL’ is a constant to have same value fo every rows, so no partition was done (instead ‘ALL’ we can use any constant value).
Overall, window functions like SUM OVER, AVG OVER, and COUNT OVER provide powerful tools for performing detailed and sophisticated data analysis within SQL queries, enabling data analysts and developers to derive meaningful insights from complex datasets.
Example 2 OLAP Function: CUME_DIST() OVER
This function calculates the cumulative distribution (explained in detail below) of a value within a set of values.
WITH ExamScores (StudentId, Score) AS ( VALUES (1, 75), (2, 88), (3, 92), (4, 67), (5, 81) ) SELECT StudentId, Score, CUME_DIST() OVER (ORDER BY Score) AS CumulativeDistribution FROM ExamScores ORDER BY Score;
Result:
STUDENTID | SCORE | CUMULATIVEDISTRIBUTION |
4 | 67 | 0,2 |
1 | 75 | 0,4 |
5 | 81 | 0,6 |
2 | 88 | 0,8 |
3 | 92 | 1 |
Here, the CTE ExamScores prepares test data with StudentId and Score. The SELECT statement computes the cumulative distribution of scores, ordering by the score value.
Two word about Cumulative Distribution
The CUME_DIST() function in SQL calculates the cumulative distribution of a value within a sequence of values. It is an OLAP (Online Analytical Processing) function that provides a way to compute the relative position of a specific value within a group of values, ordered by some column.
The result is a value between 0 and 1, representing the proportion of rows that have values less than or equal to the value in the current row.
Here’s what happens step by step:
- Preparation of Data: The Common Table Expression (CTE) ExamScores sets up a simple dataset of StudentId and Score.
- Ordering of Scores: The CUME_DIST() function is applied over the scores ordered by the Score column. This means that for each row, SQL Server calculates the cumulative distribution of the current score relative to all scores in the dataset.
- Calculation of Cumulative Distribution: For each score, CUME_DIST() calculates the proportion of all scores in the dataset that are less than or equal to the current score. This is done by dividing the number of score values that are less than or equal to the current score by the total number of scores in the dataset.
For example, if we look at a score of 75:
- There are 5 scores in total.
- The score of 75 is greater than or equal to 2 score (including himself, i.e. 67 and 75) and less than 3 other scores (81, 88, 92).
- The cumulative distribution for the score of 75 is calculated as the number of scores up to and including 75 (which is 2 in this case) divided by the total number of scores (which is 5), resulting in a cumulative distribution value of 0.4.
This calculation is repeated for each score in the dataset, providing a way to understand not just the absolute value of scores, but how each score compares to the distribution of all scores.
The result of this query gives you a table with the StudentId, their Score, and the CumulativeDistribution of each score, showing you the proportion of students who have scored up to and including that score.
This can be particularly useful in educational analysis, grading systems, or any scenario where understanding the distribution of values within a dataset is important.
Example 3 OLAP Functions: PERCENT_RANK() OVER
This function computes the percentile rank (explained in detail below) of each row within a partition of a result set.
WITH EmployeeSales (EmployeeId, SalesAmount) AS ( VALUES (1, 5000), (2, 7600), (3, 9400), (4, 3000), (5, 6200) ) SELECT EmployeeId, SalesAmount, PERCENT_RANK() OVER (ORDER BY SalesAmount) AS PercentileRank FROM EmployeeSales ORDER BY SalesAmount;
Result:
EMPLOYEEID | SALESAMOUNT | PERCENTILERANK |
4 | 3.000 | 0 |
1 | 5.000 | 0,25 |
5 | 6.200 | 0,5 |
2 | 7.600 | 0,75 |
3 | 9.400 | 1 |
In this example, the CTE EmployeeSales prepares test data with EmployeeId and SalesAmount. The SELECT statement calculates the percentile rank of each employee based on their sales amount.
Two word about PERCENT_RANK() function
The PERCENT_RANK() function in SQL computes the relative rank of a row within a partition of a result set, excluding the highest rank. The value returned by PERCENT_RANK() is the percentage of values that are lower than the current row’s value.
This function provides a way to understand how each row’s value compares to the others in the dataset, on a scale from 0 to 1, where 0 represents the first value in the ordered set (after sorting based on the ORDER BY clause) and values closer to 1 represent higher ranks.
Here’s how the PERCENT_RANK() function works in this specific scenario:
- Data Setup: The Common Table Expression (CTE) EmployeeSales creates a dataset of EmployeeId and SalesAmount.
- Ordering: The PERCENT_RANK() function is applied to the dataset ordered by SalesAmount. This means it calculates the percentile rank based on the sales amount of each employee.
- Calculation of Percentile Rank: For each row (i.e., each employee’s sales amount), the PERCENT_RANK() function calculates the rank of the row divided by the number of rows minus one. The formula used is: (rank – 1) / (total rows – 1). The first row after ordering gets a percentile rank of 0, and the highest value gets a percentile rank just below 1 (since it’s calculated as (N-1)/(N-1), where N is the total number of rows).
For example, considering our dataset:
- The employee with SalesAmount of 3000 (EmployeeId 4) would have the lowest sales amount, thus the lowest rank, and would receive a PercentileRank of 0 because it is the first value.
- The employee with SalesAmount of 9400 (EmployeeId 3) would have the highest sales amount. If there are 5 employees, its percentile rank would be calculated based on its position in the ordered list, which would be (4)/(5-1) = 1.0, indicating it has a higher sales amount than the rest of the dataset.
This function is particularly useful for analyzing and comparing the distribution of values within a dataset, allowing for an understanding of how each value stands in relation to the others, especially in terms of rankings and percentiles.
Conclusions
Ultimately, the integration of SQL and OLAP functions within the IBM iSeries environment heralds a transformative era for data science and business intelligence.
By leveraging the advanced capabilities of OLAP functions, organizations can gain deeper insights from their data, facilitating more informed strategic decision-making and improving operational efficiency. The IBM iSeries platform, with its robust support for SQL and OLAP operations, represents a significant tool in the arsenal of data scientists and analysts.
Through aggregate functions, classification, window operations, and analytic aggregates, iSeries enables a comprehensive analytic approach that leverages the full spectrum of data science techniques.
As we have explored through practical examples, the power of OLAP functions to perform complex queries and analysis directly on the database is not just a technical improvement; it is a strategic asset that can move companies forward in a competitive landscape.
Adopting these capabilities on the IBM iSeries platform can help organizations unlock the full potential of their data, making it not just an asset but a catalyst for innovation and growth.
Whether dealing with large data sets or complex analytical queries, the synergy of data science and SQL with OLAP functions on IBM iSeries is a testament to the platform’s enduring relevance and its critical role in the future of information-based decision making. data.
The examples shown should provide a good starting point for using OLAP functions in IBM iSeries SQL. Always remember that actual syntax and capabilities may vary slightly depending on your iSeries version and database settings, so it is always a good idea to consult IBM i-specific documentation for SQL.
Useful references