SQL Azure – TSQL User Defined Function – Pull out money from NVARCHAR(MAX)

I think this one is pretty dirty but here goes.

Continuing the theme of identifying substrings in NVARCHAR(MAX) field here is a set of Functions that can be used identify and separate out money in an email body which has been uploaded into a NVARCHAR(MAX) field. looks for the pattern of 2 numbers after a dot and certain numerals before a dot and steps down from hundreds of thousands of pounds to tens of pounds. Note the order is important of the final query as each shorter pattern is a subset the next longer pattern.

First important to strip out the commas in the body as this will skip any currencies which have been formatted with commas.

UPDATE T032email
SET body = REPLACE(body,',','');

WARNING It should be noted that this is very unoptimised code and even on small sets combining the functions in a nested ISNULL SQL really takes time to run.

Firstly a function that looks for the pattern 100000.00 ie 999k to 100k.

CREATE FUNCTION [dbo].[ReturnMoneyDecimal6](@fieldtosearch as nvarchar(max))
 Returns varchar(9)
 declare @length as int
 declare @loop as int
 declare @retemoney6 as decimal(9,2)
 set @loop =1
 set @length = len(@fieldtosearch)
 while @loop<=@length-8
    if exists(select 1 where substring(@fieldtosearch,@loop,9) like '[0-9][0-9][0-9][0-9][0-9][0-9][.][0-9][0-9]')
        set @retemoney6 = substring(@fieldtosearch,@loop,9)
    set @loop=@loop+1
Return @retemoney6

Next a function that looks for the pattern 10000.00 ie 99k to 10k.

CREATE function [dbo].[ReturnMoneyDecimal5](@fieldtosearch as nvarchar(max))
 Returns varchar(8)
 declare @length as int
 declare @loop as int
 declare @retemoney5 as decimal(8,2)
 set @loop =1
 set @length = len(@fieldtosearch)
 while @loop<=@length-7
    if exists(select 1 where substring(@fieldtosearch,@loop,8) like '[0-9][0-9][0-9][0-9][0-9][.][0-9][0-9]')
        set @retemoney5 = substring(@fieldtosearch,@loop,8)
    set @loop=@loop+1
Return @retemoney5

Next a function that looks for the pattern 1000.00 ie 9k to 1k.

CREATE function [dbo].[ReturnMoneyDecimal4](@fieldtosearch as nvarchar(max))
 Returns varchar(7)
 declare @length as int
 declare @loop as int
 declare @retemoney4 as decimal(7,2)
 set @loop =1
 set @length = len(@fieldtosearch)
 while @loop<=@length-6
    if exists(select 1 where substring(@fieldtosearch,@loop,7) like '[0-9][0-9][0-9][0-9][.][0-9][0-9]')
        set @retemoney4 = substring(@fieldtosearch,@loop,7)
    set @loop=@loop+1
Return @retemoney4

Next a function that looks for the pattern 100.00 ie 999 to 100.

CREATE function [dbo].[ReturnMoneyDecimal3](@fieldtosearch as nvarchar(max))
 Returns varchar(6)
 declare @length as int
 declare @loop as int
 declare @retemoney3 as decimal(6,2)
 set @loop =1
 set @length = len(@fieldtosearch)
 while @loop<=@length-5
    if exists(select 1 where substring(@fieldtosearch,@loop,6) like '[0-9][0-9][0-9][.][0-9][0-9]')
        set @retemoney3 = substring(@fieldtosearch,@loop,6)
    set @loop=@loop+1
Return @retemoney3

Lastly a function that looks for the pattern 10.00 ie 99 to 10.

CREATE function [dbo].[ReturnMoneyDecimal2](@fieldtosearch as nvarchar(max))
 Returns varchar(5)
 declare @length as int
 declare @loop as int
 declare @retemoney2 as decimal(5,2)
 set @loop =1
 set @length = len(@fieldtosearch)
 while @loop<=@length-4
    if exists(select 1 where substring(@fieldtosearch,@loop,5) like '[0-9][0-9][.][0-9][0-9]')
        set @retemoney2 = substring(@fieldtosearch,@loop,5)
    set @loop=@loop+1
Return @retemoney2

And then here is the quite horribly inefficient query that can be used to run all functions and seems to do the job.

You know your searchstrings better than me but any text with multiple decimal substrings will be an issue!!!

Which is still a few million times quicker than any person!

SELECT dbo.T032email.pkid,
			ISNull(dbo.ReturnMoneyDecimal3(dbo.T032email.body), dbo.ReturnMoneyDecimal2(dbo.T032email.body)
				)))) as money4 from T032email;

SQL Azure – TSQL User Defined Function – Cleaning a Field of Specified Characters

Stack overflow sourced, adapted and personally tested code
How to strip all non-alphabetic characters from string in SQL Server

In an earlier post I was having issues as some email addresses I was pulling out from a field were captioned in <> using the following will remove those characters prior to identifying string patterns..

Remember this pulls them out and then compacts the resulting string this may or many not be what you are looking for. I have adapted from the Stack Overflow discussion to include characters I don’t want rid of.

CREATE Function [dbo].[RemoveNonAlphaCharacters](@Temp nvarchar(max))
Returns nvarchar(max)

    Declare @KeepValues as nvarchar(50)
    Set @KeepValues = '%[^a-z0-9/@ £$+=?.\!]%'
    While PatIndex(@KeepValues, @Temp) > 0
        Set @Temp = Stuff(@Temp, PatIndex(@KeepValues, @Temp), 1, '')

    Return @Temp

And running the example code we get..

SELECT dbo.RemoveNonAlphaCharacters('abcdefghijklmnopqrstuvwxyz1234567890 !"£$()_+=-{}[]:@~;\|<>?./') as txtCorrected

We get

WARNING Please note ^ % & and * ‘ ” – are reserved TSQL characters and including them in the function appears to break the logic that I wish to see in the pattern replacement.

SQL Azure – TSQL User Defined Function – Separate multiple emails from NVARCHAR(MAX) field

Stack overflow sourced, adapted and personally tested code
Extract email address from string using tsql

A continuation of working with strings in TSQL specifically linked to emails.

Firstly create the following Function

CREATE FUNCTION [dbo].[fnFindPatternLocation]
    @string NVARCHAR(MAX),
    @term   NVARCHAR(MAX)
        SELECT pos = Number - LEN(@term) 
        FROM (SELECT Number, Item = LTRIM(RTRIM(SUBSTRING(@string, Number, 
        CHARINDEX(@term, @string + @term, Number) - Number)))
        FROM (SELECT ROW_NUMBER() OVER (ORDER BY [object_id])
        FROM sys.all_objects) AS n(Number)
        WHERE Number > 1 AND Number <= CONVERT(INT, LEN(@string))
        AND SUBSTRING(@term + @string, Number, LEN(@term)) = @term
    ) AS y);

Then create a View of what you are interested in as follows.. Note here I am taking out the carriage return as my subsequent query doesn’t like them and in emails they frequently exist.

CREATE VIEW [dbo].[v001] as SELECT pkid, REPLACE(body, CHAR(13) + CHAR(10),' ') as body1 from t001email

Then run the newly created View through a query.

SELECT pkid, body1, pos, SUBSTRING(body,beginningOfEmail,endOfEmail-beginningOfEmail) AS email
FROM v001
CROSS APPLY (SELECT pos FROM dbo.fnFindPatternLocation(body1, '@')) AS A(pos)
CROSS APPLY (SELECT CHARINDEX(' ',body1 + ' ', pos)) AS B(endOfEmail)
CROSS APPLY (SELECT pos - CHARINDEX(' ', REVERSE(SUBSTRING(body, 1, pos))) + 2) AS C(beginningOfEmail)

Couple of things here
Multiple emails will be picked out and placed as separate records so if there a string that reads

This is a sentence with two emials first@gmail.com and a second second@gmail.com

it will return

If an email starts the field then this will NOT work after finding the @ symbol it will count forward and fail to find a space and so set space before to Null it will then return just the domain of the email. I will be looking to fix this at some point.

Secondly if the emails within the field contain contiguous special html characters such as < or > these will be picked up and inculded as if they are part of the email addresses.

We can fix this by scanning through the varchar(max) field and stripping out special characters.

NOTE : If you are working with email bodies carriage returns will also screw up the above query in which case consider running the field through some kind of replace view with similar syntax as

CREATE VIEW v002 as SELECT pkid, REPLACE(body, CHAR(13) + CHAR(10),' ') as txtBodyWithoutReturns from t001email

SQL Azure – TSQL User Defined Function – Separate Defined Length String from NVARCHAR(MAX) Field

Stack overflow sourced, adapted and personally tested code
How to get part of a string that matches with a regular expression

The following function arose out of a desire to find Eplanning Scotland planning references numbers which follow the format of 9 numerals a dash and then 3 numerals within a NVARCHAR(MAX) field type. The characteristics of the string was that it is always the same length.

In SSMS select the New Query button

Create function [dbo].[ReturnEplanningNumberFull](@fieldtosearch as nvarchar(max))
 Returns nvarchar(13)
 declare @length as int 
 declare @loop as int
 declare @reteplan as varchar(13)
 set @loop =1
 set @length = len(@fieldtosearch)

 while @loop<=@length-12
    if exists(select 1 where substring(@fieldtosearch,@loop,13) like '[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][-][0-9][0-9][0-9]')
        set @reteplan = substring(@fieldtosearch,@loop,13)
    set @loop=@loop+1
Return @reteplan

And then to create a View using this you can write something like ;

CREATE VIEW v026eplanrefs AS 
dbo.ReturnEplanningNumberFull(dbo.THETABLE.FIELDCONTAINSREFERENCE) as eplanno 

I subsequently altered this to identify the first 9 digits as this is sufficent to uniquely identify eplanning records.

CREATE function [dbo].[ReturnEplanningNumberShort](@fieldtosearch as nvarchar(max))
 Returns nvarchar(9)
 declare @length as int 
 declare @loop as int
 declare @reteplanshort as nvarchar(9)
 set @loop =1
 set @length = len(@fieldtosearch)

 while @loop<=@length-8
    if exists(select 1 where substring(@fieldtosearch,@loop,9) like '[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]')
        set @reteplanshort = substring(@fieldtosearch,@loop,9)
    set @loop=@loop+1
Return @reteplanshort

Postgres Command Line : psql : Create a switch to determine whether an Arc has a clockwise or anti-clockwise direction

This is a post focused around Network distance measurement using POSTGIS and routing plugins.. This discusses the use of Ordnance survey Road Network information and its preparation in using it for routing.

The Ordnance Survey open source road network layer includes roundabouts that have an attribute value of roundabout. Great but looking at them closely some of the constituent arcs are digitised in a clockwise direction while others are digitised in an anti-clockwise direction. When using dijkstra routing with weighting to prevent incorrect pathing it is necessary to ensure that networks are weighted in the correct fashion. Directional weighting only works if you know the direction and you weight appropriately. For use with directional routing in the UK roundabouts any directional weighting should prevent travel in anticlockwise direction. ST_reverse will correct incorrect direction BUT the Ordnance survey layer seems to have no attribute that consistently indicates whether an arc on a roundabout has or has not been digitised correctly. Marking lines with direction and viewing them we see the random nature of many arcs on roundabouts.

Here is Sheriff Hall Roundabout on Edinburgh City Bypass

Here is Straiton roundabout just north of the bypass

and finally Hermiston Gate roundabout again following the theme on Edinburgh city bypass

It got me thinking was there a way to determine whether arcs on roundabouts could be determined to be clockwise or anti-clockwise?

Having thought about it in my head quite a bit I determined that it probably would be possible if we knew three points and could create some kind of virtual graph with the start point being at 6 and a finish at 12 with mid points at 9 if travelling in a clockwise position and 3 if travelling in an anti-clockwise position.

I had a look around and the following post seemed to tally with the idea of three points and positive and negative relating to clockwise or anticlockwise.

Maths to determine direction of an arc

Having looked at this I set about working through the problem in Excel to see if I could get consistent results.

Firstly I created a set of directions North West South and East and placed some coordinates that could be used in calculations.

I then went forward and tested whether I could identify the direction of various arcs from these coordinates in excel using the formula identified on Stack Exchange.

Here I replaced a,b,c with b Beginning, m Middle and f Finish

And I decided to work through manually in excel to ensure that I had the linear algebra correct.

and further testing figures

So firstly I create a separate table that just shows the roundabouts

CREATE TABLE t001roundaboutsntroadlink as select id pkidt001, st_linemerge(geom) as geom from ntroadlink where formofway = 'Roundabout';

In the above I use st_linemerge to ensure that all geometry is linestring as this is necessary to use the st_startpoint and st_endpoint postgis functions.

Next I added the the required variables from stack overflow algebra to the line table

ALTER TABLE t001roundaboutsntroadlink add column bx float(8),
Add column by float(8),
Add column mx float(8),
Add column my float(8),
Add column fx float(8),
Add column fy float(8),
Add column ux float(8),
Add column uy float(8),
Add column vx float(8),
Add column vy float(8),
Add column uxtimesvy float(8),
Add column uytimesvx float(8),
Add column uxv float(8);

Next I needed to identify a b beginning, m middle and f finish point for each line that I wanted to test.

b points (beginning)

CREATE TABLE t002bpoints AS SELECT pkidt001 as pkidt001,st_startpoint(geom) as geom, st_x(st_startpoint(geom)) as bx, st_y(st_startpoint(geom)) as by from t001roundaboutsntroadlink;

m points (middle)

CREATE TABLE t002mpoints AS SELECT pkidt001 as pkidt001,st_lineinterpolatepoint(geom,0.5) as geom, st_x(st_lineinterpolatepoint(geom,0.5)) as mx, st_y(st_lineinterpolatepoint(geom,0.5)) as my from t001roundaboutsntroadlink;

f points (finish)

CREATE TABLE t002fpoints AS SELECT pkidt001 as pkidt001,st_endpoint(geom) as geom, st_x(st_endpoint(geom)) as fx, st_y(st_endpoint(geom)) as fy from t001roundaboutsntroadlink;

It was then a case of simple update queries to complete the table

update t001roundaboutsntroadlink set bx = st_x(st_startpoint(geom));
update t001roundaboutsntroadlink set by = st_y(st_startpoint(geom));
update t001roundaboutsntroadlink set mx = st_x(st_lineinterpolatepoint(geom,0.5));
update t001roundaboutsntroadlink set my = st_y(st_lineinterpolatepoint(geom,0.5));
update t001roundaboutsntroadlink set fx = st_x(st_endpoint(geom));
update t001roundaboutsntroadlink set fy = st_y(st_endpoint(geom));
update t001roundaboutsntroadlink set ux=mx-bx;
update t001roundaboutsntroadlink set uy=my-by;
update t001roundaboutsntroadlink set vx=fx-mx;
update t001roundaboutsntroadlink set vy=fy-my;
update t001roundaboutsntroadlink set uxtimesvy = ux*vy;
update t001roundaboutsntroadlink set uytimesvx= uy*vx;
update t001roundaboutsntroadlink set uxv = uxtimesvy-uytimesvx;

Labelling up the roundabouts Hermiston Gate now looks like

And Sheriff Hall Roundabout now looks like this

Compared with a correctly directed roundabout

CREATE TABLE t001roundaboutsntroadlinkcorrected AS TABLE t001roundaboutsntroadlink;

And now correct the items display as previous and see what we see.

UPDATE t001roundaboutsntroadlinkcorrected set geom = st_reverse(geom) where uxv > 0;

Sheriff hall roundabout now

and some proof that reasonable number of lines were updated.

Which is an indication that all roundabouts arcs have been corrected properly

But a zero uxv value indicates a straight line.

It should however be possible to match starts with finishes for overlying points and where a line has 0 value of uxv and its ends and finishes are not matched with adjacent opposites create a switch to reverse the direction of all lines that are incorrect compared to their neighbours thus only correcting incorrect directions. Haven’t done that in this case.