Extraction Transformation and Load (ETL) – some thoughts on a large IT transfer project

In 2017 I was involved in an important work project to transfer all the records in a legacy system that was being deprecated by the vendor into another maintained system. We were in some ways fortunate because both systems had been designed by a single company and they were encouraging us to transfer. We had delayed transfer for several years already but were aware that we now had to move. The vendor did have some tools in place , had staff dedicated to such transfers and were offering favorable consultancy rates. The amount of data was not horrendous in computing terms but they were far far beyond the remit of the ability to cope with any sort of manual data correction and the system was an absolute core system upon which several departments completely depended. These were systems that all departments are in from the moment they start the work day to the end. Generally its unusual if they are down for more than 5 minutes in a month, all work pretty much stops when they stop and in no circumstances could they be down for more than a day without special dispensation and coordination to indicate to manage customer expectations.

The whole project was a success although it was challenging. Here is an outline of the steps we took. As ever order here is important in most of the steps.

Step
Inform managers of all involved sections and ensure they are on board – identify and ring fence budget

Step
Appoint project manager on vendor and client side
draw together team to perform transformation.

Step
Draft time table creation of how long it will take putting in place planning for tutorials on systems and consultancy.

Step
Request managers to put forward staff on all sides willing to be involved

Step
Identify any omissions in knowledge and start to identify how this can be remedied. Kick off and complete acquisition of said staff.

Step
Meeting with lead staff to confirm buy in. Request buy in from staff including ring fencing of holidays etc.. to ensure key staff are available at required times.

Step
Set up test systems that all individuals have access to and ensure that the old and new systems can be viewed simultaneously by individuals. Ensure that the domain specialists can identify processes that will be mirrored from the old system to the new system

Step
Give DBAs or those that will be doing data transfer access to databases of source so that they can start thinking of how they can pull out information.

Step
Training for all individuals concerned in new systems.

Step
In new system start tasking individuals with how they are going to do the simple processes – eg register a record approve a record alter a record and get reports out. If possible allow new champions to start to define things like reports.

Step
Start making up any new lookup fields compared with old lookups and also start tasking individuals with creation of reports and letter that will need to be done.

Step
Start mapping the data from old system to new system – excel spreadsheets can be used for this that show the data going from the old system and what fields they are going to go into in the new system. Divide this task up between domain users – this step needs to be done after old and new systems are on domain users machines. As part of this the applications in question should expose if possible the table and field names of the source and target fields. With the systems we were involved in this was possible both for the old and new systems.

For each form on the two systems try to identify the below

Source table.field Target table.field

Also get them to map the lookup table values if direct transfer is not possible or if alias id are used in these lookups.

Source table.field.value=Equivalent.Target table.field.value

Step
Give both mapping documents to the ETL people to allow them to start writing the queries. It is unlikely that there will be a straight transfer across from table to table. While it would be expected that field and table names will be completely different it will be expected that table structure will in certain places be different in this respect it would be good to have a really nice schema diagram of both source and target.

Step
Allow data individuals to write scripts that can be run live against present initial system – if necessary doesn’t need to be live live could copy every night and then perform on 1 day old database backend – which is what we did. This means work can go on in old system and then at a touch of a button.

Encourage DBAs to be able to run these scripts every day to ensure that running them for go live is absolutely no issue. Our scripts only took about half an hour to run so this wasn’t an issue. I was personally involved in writing the SQL for those and I had systems in place to cross tab the amount coming into each new table so I could see new records and information from the old system trickling manually into the system and then being transferred.

Step
Test data input into new system

Step
Check test data input into new system with reference to domain users.

Step
Confirm go live date ensure staff available for issues

Step
Go live to production and start all new procedures ensure staff technical and domain key players on hand to make flexible solutions to things

Step
Project review on going maintenance and improvement of new system

Step
After suitable time turn off of old system if possible.

Notepad ++ / AstroGrep / Autohotkey – 3 Useful Tools

Three useful tools for speeding up or automating tasks

Notepad ++
Text editor that has formating for programming – I often use it for editing XML documents and writing VB scripts.
Notepad ++ link

AstroGREP
Son of GREP – useful Text searcher particularly useful for identifying parameters within web files within applications. When you need to customize a web page
Astro Grep link

AutoHotKey
Automation of tasks program that allows creation of executables that can navigate browsers really very powerful
AutoHotKey link

Comparative Advantage

My opinion on fixing things is that if someone has to sit around and watch you fix something the cost of the job is not just your time in fixing it but it’s also the cost of the time of the person sitting around and watching you fix the thing and the unhappiness that they feel in being pretty helpless.

Hence another reason to empower users.

How do you truly asses the cost of a job – if you are in private business it tends to be everything that you do to undertake the work. How long it is going to take you to get to the location to do the work , how much it will cost you to get to that location. In some large businesses things like time to get to a location to fix the item tends to be totally overlooked – this can be a not insignificant amount of time even if there is no revenue cost. I don’t care who you are I’m of the opinion you want to be reducing wasted time as much as possible.

Trusting Your Gut or When Not to Use Computers.

wargamesAs good as computers are at calculating numbers they are really really bad at doing somethings that humans are just great at like making decisions based on fuzzy logic or translating things between languages. In such situations always go for the human at least until C3P0 comes along. Watson cost $3 million to build, good if you want to play trivial pursuit but ask it to choose between competing software vendors and well you will be nowhere.

That’s you numerical based ranking systems for software purchase!!!

Complex Event Processing (How Cool)

complexEventProcessing2

As part of my timing software solution I had been interested in seeing if there was a better solution to the constant pulling of information from the timing boxes. It would for instance be far better to have some sort of push mechanism where calculations were only made when new information was received.

Turns out that with the advent of big data (which to my mind is being driven by sensors) it seems to be a bit of a hot topic. Doubtless brought on by the myriad number of devices and sensors which are in the market at the moment. Microsoft have a framework called StreamInSight which is free to those that already have SQL Server 2012 licence (not sure which level)

I still have a lot of questions.
Leading on from the post I did on designing my own timing software I consider that it would be useful as a single time could come in and it could be linked up with its partner times and a lap number could be calculated that would allow the relatively easy display via pivot. A process that could occur as the information came in rather than being pulled and batched which is the way my system does it at the moment.

Not sure how I can get my hands on the framework though… Would love to try it out.

Resources here.

StreamInsight: More than Just an API

General StreamInsight article list

Building Timing Software for Athletic Races

A while back I got the opportunity to work with a friend who times athletic races. After a bit of questioning he showed me some of his equipment and how it worked. I realised I could probably put together some home made timing software and so I set about doing it. The following is an overview of the stages that I went through to create the software.

Firstly its important to understand how the hardware works. To my mind this is very much a pattern and irrespective of the software you are using it is going to have to perform the same tasks in the same order.

Pre race test and hook up
IMGP0085

Each competitor is given an RFID chip that has a unique number registered to it. (Numbers can be changed but requires specialist equipment)
A timing box is used to both power the matts (aka antennae) and act as a clock. When someone goes over the matt the chip emits the number and sends to the box. The box then places a time stamp on the chip and places in a file. A file or stream of all these strings is then pushed out of the timing box to awaiting computer.
Computer is listening and takes this information and software does the rest

In terms of the information that is produced by the box this is remakably simple – a raw hex string for each read.

My solution has three main tables
T1 – Raw times
T2 – Competitors
T3 – Chip Tag numbers.

The hard coded chip numbers are hex and each competitor is allocated a chip – they are actually being allocated this hex key but the number equates so on the chip itself will be a readable number eg 14001, 14002 the actual raw code from the box is often something like 45ab32c

What the software then does.

Step 1
Ensure exact duplicates reads are not entered generally because of multiple imports of same file.This is done by setting the full hex key as a primary key disallowing the same value to go into the Raw Times table multiple times.

Step 2
Hex to Decimal translation

Step 3
Drop times before race start time.
Simple select query.

Step 4
Match Reading to competitors – A simple join after hex to decimal translation of information from 3

Step 5
Sort times within competitor in ascending order
Data sort on dual columns

Step 6
Gate times – Competitors get multiple reads over the matt. I have set up something called a gate ( no idea whether this is a standard pattern or not but its what I call it ) Then a gate period is subjectively decided, lets say 30 seconds, (this can be altered for lapped races where you know a racer can’t complete a time in a certain time). A Query is then set up to look to the first time of a competitor and the gate period is added. The software continues looking down the times and deletes any times greater than the first gate time but less than the first gate time plus the gate period.

Step 7
Count timing points
In some races there are laps some are point to point – a lapped race can be thought of as recursive non lap race in that it is a series of non lap races where the end of one non lap race is the start of another. All times for each competitor are sorted in increasing order and an additional field is added with incremented numbers 1,2,3,4 etc. Typically there will be 2 a start and a finish. There can be one where a gun has gone off and competitors are not running across a start matt. In that instance everyone is considered as having the same start time the time of the Starting pistol and start one is given the 2 point and 1 is assumed to be a pre-set time.

Step 8
Pivot the times
Place point number as column header – name of competitor as row heading and time value as the value of the pivot.

Step 9
Calculate the lap times
Simple n+1 time minus n time

Step 10
Add up individual times to get total race time sort by any additional category eg gender age etc… sort time order and allocate prizes appropriately.

Step 11
Pick up whatever pretty report writer you can get your hands on – excel will do at a push and print out and hand to race organisers.

Step 12
Make pretty forms so you can easily change competitors add in extra times for missed times allow for edge cases like DNFs DQs penalties etc.

Sunday 11th of May I had the opportunity to test the software out by acting as lead timer on the Castle of Mey 10k the most northerly 10k in the country. Glad to say worked perfectly.

Below is a video of the 2015 race where we were in the McNicol Van

Nice when a plan comes together.

Why you and your Organisation should make Web Bespoke part of your operations.

themessage565171

Don’t get me wrong as a platform for communication I consider things like facebook and Linkedin as necessary evils for what is essentially cloud service market places but my preferences are always…

No adverts
No demands to have login prior to viewing site
As much annonymity for visitors as possible
Long form writing!!!!

I guess the thing that irks me the most is the light manipulation that occurs on these sites. They gently nudge you towards their wishes not your own. I must say as an exercise in communication I am much happier with this format where I have greater editorial control. (albeit a bit less traffic)

Its why I really love bespoke ( or as bespoke as possible ) if I could I would be designing my own operating systems.
Maybe one day.

Why?

2 words

Greater Understanding.

Leading to improvements in

1) Flexibility
2) Control
3) Ability to anticipate
4) Adaptibilty
5) Automate tasks ( Given time in certain cases I have been able to completely design out tasks )
6) Improved long term planning
7) Increased reliability
8) Faster response times
9) Reduce requirement for others time ( always something which is a complete premium )
10) Better timing
11) Massive coordination improvements ( yes why ask someone to tell you about something why not give them the ability to edit the information themselves – I can stay at home, kind of)
12) Improved motivation

It allows me to experiment which is when I learn the most and I think leads to the holy grail of increased productivty and reduced cost.

Want to increase the productivity of you and the people around you? Give them as much flexibility as possible and open up your design environment. I can’t see anything but advantages resulting from this both personally and for the organisation.

A good example of an organisation going bespoke with their web design …

http://www.theregister.co.uk

Details of their setup.

http://www.theregister.co.uk/about/company/website/

Sometimes all it takes is a walk

The other day I had a good morning.  A colleague had asked me to help him out on something and we’d just spent an hour going through some options getting pretty much nowhere, finally I told him I was out of ideas and we were going to have to leave it for just now. He thanked me but truthfully I wasn’t very happy. Things were broken and in this case broken was inoperable. I left his desk and started walking back to my office which is located in another buildings ten minutes away. Out of nowhere and almost exactly half way between both buildings I realised what the problem was. I immediately phoned my colleague and he made the required adjustments. Hey presto everything worked.

I love getting things working
and I love when you suddenly realise how to get things working.

Configuration vs Programming

I used to find configuration kind of frustrating – highly repetitive unintuitive and often changing as software versions change.

I now consider it a simple exercise in rote learning for which there is no solution except perseverance, patience and determination and I try to find out the names and numbers of the key administrators.

This change in perception means I no longer feel the kind of frustration I previously felt. I am also much more likely to refer to manuals than trying to guess my way through a menu system which was in retrospect the impatience of youth. As a result I am much more organised in documenting and keeping documentation and I think more productive as a result.

The amount of setups that are hindered by insufficient security privileges being available to the  individual doing setup must amount to millions of lost hours. Please for those giving configuration tasks to individuals build in large amounts of time for configuration. If you don’t you’ll probably just be burnt.

Programming by comparison seems gloriously imaginative and logical. Makes me think that a fundamental reason why users hate changes in Operating Systems is because of configuration. They have to re-learn quite a few sets of obscure unintuitive procedures no matter how nice the UI is, they last sorted out X years ago when they bought their previous device. As for the XP, Win 7, Win 8 debate – personally I like Win 8.1, got it on my surface. Maybe because the configuration of Win 8.1 seems like a complete doddle to the kind of obscure software packages I normally have to deal with.