Lab – More Hive Queries and Hive Built-In functions
Objectives
This lab builds on the last lab. We will analyse the data uploaded to Hive using some HQL queries 
Deliverable
Screenshot of each query results. One for each query. Each query should be fully displayed. No need to show the entire query result. Partial display of the result is sufficient. 
- Open the Hive query editor in Cloudera
- Select all column from the salesdata table
- Select orderdate, salesamount and rowid from salesdata table and show 100 records.
- Modify the above step to show orderdate column heading as OrderDate, salesamount as Sales, and rowed as RowNum
- Create a new hive query to find OrderMonth, the total number of orders, total sales, average sales, minimum sales and maximum sales for each month. Sort so that newer sales statistics are at the top.
- Modify the query you created for question 5 to show breakdown for each category
- Use GROUPING SETS to modify the above query to show the statistics for each category and for each ordermonth.
- Create a new query to extract ordered, orderdate, quantity, rate, discountpct, QuoteAmt and QuoteAmtRound.
- QuoteAmt = quantity*rate*(1-discountpct)
- QuoteAmtRound = quantity*rate*(1-discountpct)
Select orderid, orderdate, quantity, rate, discountpct,
    quantity*rate*(1-discountpct) as QuoteAmt,
    round(quantity*rate*(1-discountpct)) as QuoteAmtRound
from salesdata
- Use rand, floor and ceiling function to simulate new data. Create a query that has a random number, salesamount, ordered, wagemargin, Random sales amount, wage margin floor and wage margin ceiling column.
select rand(), saleamount, orderid, wagemargin,
    round(saleamount*rand()) as RandSaleAmount,
    floor(wagemargin) as WageMarginFlr,
    ceiling(wagemargin) as WageMarginCl
from salesdata
- Using the datediff function find the difference between two dates. Create a query that has a ordered, productcategory, productsubcategory, orderdate, projectcompletedate, and duration.
Select orderid, productcategory, productsubcategory,
    orderdate, projectcompletedate,
    datediff(projectcompletedate, orderdate) as duration
from salesData
- Generate some statistics by month using year, month, and avg functions. Create a query that has productcategory, productsubcategory, yearofOrder, monthofOrder, and Average duration.
select
    productcategory, productsubcategory,
    year(orderdate) year, month(orderdate) month,
    avg(datediff(projectcompletedate, orderdate)) Avgduration
from salesdata
group by    
    productcategory,productsubcategory,
    year(orderdate), month(orderdate)
order by 
    3,4
- Find the last day of the month using data_sub(), to_date(), cast(), and concat() functions.
select distinct
 date_sub(to_date(
   concat(cast(year(orderdate) as string),"-",cast(month(orderdate)+1 as string),"-01")
  )
 ,1 ),
 orderdate
from salesdata
limit 100;
- Identify large sales with IF function.
select orderid, saleamount, 
if(saleamount > 5000, 1, 0) as LargeSale
from salesdata
limit 1000;
- Create sales size categories
select orderid, saleamount, 
    case 
        when saleamount > 5000 then 'large'
        when saleamount > 1000 then 'medium'
        else 'small'
    end as SalesSize
from salesdata
limit 1000;
- Perform analysis by reassigning regions. Find yearly total sales for regions - southwest(south and west), east, and central.
select  
case
    when lower(region) = 'west' then 'southwest'
    when lower(region) = 'south' then 'southwest'
    else region
end as newregion,
year(orderdate) as y,
sum(saleamount) as TotalSales
from salesdata
group by
case
    when lower(region) = 'west' then 'southwest'
    when lower(region) = 'south' then 'southwest'
    else region
end,
year(orderdate)
limit 100;
Comments
Post a Comment