Detecting a Dial Plan Overlap
User Report & Context
All users at Site A is experiencing issues making calls from Site A (Santarem) in Brazil to Site B when calling from Cisco IP Phone. When calling from Cisco Jabber the call works fine.
Relevant Call Flow
SIP Phone --> CUCM --> H.323 GW --> E1 via R2-Digital signaling (Voice-Ports, not Serial!)
Initial Steps
-
Collect a call example
Calling Party: (e.g. +559335552222) Called Party: (e.g. +559198886666) Date/Time: 3:45PM EST 7-04-19 Result: Your call cannot be completed as dialed.
-
Review the user report for any relevant information as to what might be causing the problem.
-Indicating that the call fails from Cisco IP Phone but works from Jabber for all users points me away from a configuration issue from the perspective of, say, the CSS configurations. This is because the expectation is that a user's Jabber and Phone have identical calling permissions, so it wouldn't make sense for one to work and the other to not work for all users given this expectation. -Indicating that the call fails from Cisco IP phone but works from Jabber makes me think of digit-by-digit vs en bloc dialing. The common expectation when dialing from a Cisco IP Phone is that a user will pick up the handset and get dial tone, then dial the digits one-by-one for the number they intend to reach. From Jabber's perspective we just type in the full number we want to dial and hit call. There is no prompt for dial-tone and digit acceptance. -Given the above my presumption is that there is an overlap in the dial plan causing users dialing out to hit an undesired pattern (probably due to Urgent Priority) and thus causing the call to fail.
-
Reproduce the issue to get a current sample that will exist within UCM traces, and debug the egress gateway as well while we have the user on the phone. We are not able to reproduce the issue without the client (due to a lack of jumpbox access, or proper hardware/software provisioned to do so).
Now you might be saying “Woah, woah. Slow down. You’re making guesses at what’s causing the issue before you’ve even looked at anything! You’re a mad man! You can’t do that!” Well, you’re right, and you’re wrong. I can as long as I don’t let it distract me and become my focus point as you would be correct – it can lead me down a rabbit hole and away from a potential root cause. So this is where we back up and take a look at all the information we have.
Analysis and Review
So by now the client has responded back with the call example (below) and I have reviewed our billing server for the specific CDR I need.
Calling Party:+559335552222
Called Party:+009198886543
Date/Time: Jul 4th, 14:36
Result: Your call cannot be completed as dialed.
From the CDR perspective I see some pretty critical data that confirms my initial thoughts.
- We see a “Cause Value = 1” on the Destination side of the call (H.323 egress gateway). This tells me that the call made it through call processing on CUCM and was sent to the gateway, where either the VGW or Carrier reported back that the number doesn’t exist.
- The called party presentation is a bit odd. We see the far-right most dialed digit is missing, presenting the called party as “+55919888654”. This immediately keys me in to the fact that the call is being sent before all the digits we want to dial are collected. My expectation when taking a look at the debugs pulled is that we will see “+55919888654” rather than our expected “+559198886543” as the called party in the Gateway’s signaling as the modification (or omission) was caused by CUCM.
Now for the debugs to confirm my theory; during the troubleshooting call where we reproduce the issue I stood up “debug voip ccapi inout” in order to take a look at call control and the calling/called party presentation. Here we can see the expected modified called party, pointing me back at CUCM (as expected) for the cause of the issue.
Jul 4 14:37:15: //-1/00DD7F5B1F01/CCAPI/cc_api_display_ie_subfields:
cc_api_call_setup_ind_common:
cisco-username=Elves Ray
----- ccCallInfo IE subfields -----
cisco-ani=559335552222
cisco-anitype=0
cisco-aniplan=0
cisco-anipi=0
cisco-anisi=1
dest=00919888654
cisco-desttype=0
cisco-destplan=0
cisco-rdie=FFFFFFFF
cisco-rdn=
cisco-rdntype=-1
cisco-rdnplan=-1
cisco-rdnpi=-1
cisco-rdnsi=-1
cisco-redirectreason=-1 fwd_final_type =0
final_redirectNumber =
hunt_group_timeout =0
Jul 4 14:37:15: //-1/00DD7F5B1F01/CCAPI/cc_api_call_setup_ind_common: Interface=0x41CA0D78, Call Info( Calling Number=559335552222,(Calling Name=)(TON=Unknown, NPI=Unknown, Screening=User, Passed, Presentation=Allowed), Called Number=00919888654(TON=Unknown, NPI=Unknown), Calling Translated=FALSE, Subscriber Type Str=Unknown, FinalDestinationFlag=TRUE, Incoming Dial-peer=100, Progress Indication=NULL(0), Calling IE Present=TRUE, Source Trkgrp Route Label=, Target Trkgrp Route Label=, CLID Transparent=FALSE), Call Id=3048
So now we pull traces from CUCM, right? Right. I load up Cisco RTMT and attempt to download the Cisco CallManager traces from the cluster. To my surprise I’ve already exceeded the retention timer and sdl traces have overwritten, however, I do have call log files that support what I was originally believing the issue to be, an overlapping, shorter pattern.
C:\Users\kenop\Downloads\CCMTraces\Node3\2019-07-04_20-04-25\cm\trace\ccm\calllogs\calllogs_00001146.txt (4 hits)
Line 11448: 2019/07/04 14:37:15.142|CC|SETUP|55808087|55808096|user@domain.com|00919888654|00919888654
Line 11448: 2019/07/04 14:37:15.142|CC|SETUP|55808087|55808096| user@domain.com |00919888654|00919888654
Line 11449: 2019/07/04 14:37:15.146|CC|OFFERED|55808087|55808096| +559335552222 |00919888654|00919888654|SEP00D6FE047204|10.96.26.126
Line 11449: 2019/07/04 14:37:15.146|CC|OFFERED|55808087|55808096| 559335552222 |009198160656|009198160656|SEP00D6FE047204|10.96.26.126
We can see that even in the call logs we have a shortened called party missing the “3” at the end. By now, the user’s already gone and my attempts to reconfigure a test Jabber account are met with me quickly realizing that Jabber works. So testing this is only going to give me a ‘good’ example which may not be as useful as I expect. So my mind now goes to “DNA”. Can I get the answer I need via DNA? I sure can. Load up DNA and input my calling party using the specific test SEP phone as the source.
DNA against Proper Called Party
Results Summary
Calling Party Information
Calling Party = \+559335247867
Partition = PT_DNDevice CSS = CSS_DEVICE_SANTAREM_E1
Line CSS = CSS_LINE_FULL_ACCESS
AAR Group Name =
AAR CSS = CSS_AAR
Dialed Digits = 0091981606562
Match Result = RouteThisPattern
Matched Pattern Information
Pattern = 00.[1-9][1-9]9XXXXXXXX
Partition = PT_SANTAREM_E1
Time Schedule =
Called Party Number = 0091981606562
Time Zone = America/Sao_Paulo
End Device = RL_SANTAREM
Call Classification = OffNet
InterDigit Timeout = NO
Device Override = Disabled
Outside Dial Tone = NO
DNA against Actual Called Party
Results Summary
Calling Party Information
Calling Party = \+559335247867
Partition = PT_DNDevice CSS = CSS_DEVICE_SANTAREM_E1
Line CSS = CSS_LINE_FULL_ACCESS
AAR Group Name =
AAR CSS = CSS_AAR
Dialed Digits = 009198160656
Match Result = RouteThisPattern
Matched Pattern Information
Pattern = 00.[1-9][1-9]XXXXXXXX
Partition = PT_SANTAREM_E1
Time Schedule =
Called Party Number = 009198160656
Time Zone = America/Sao_Paulo
End Device = RL_SANTAREM
Call Classification = OffNet
InterDigit Timeout = NO
Device Override = Disabled
Outside Dial Tone = NO
So there we go, we can see via the DNA that we hit two different patterns based on the called party, and upon inspection both have Urgent Priority enabled. Urgent Priority, in the event you do not know, causes a call to be sent as soon as it matches the affected pattern (provided the PT it exists in is contained within the dialing endpoint’s CSS). This is why when dialing our 13 digit number it stops at 12 digits and sends the call, as it matches and is set to send the call.
Desired Pattern:
00.[1-9][1-9]9XXXXXXXX in PT_SANTAREM_E1, Urgent Priority ON
Actual Pattern:
00.[1-9][1-9]XXXXXXXX in PT_SANTAREM_E1, Urgent Priority ON
Now what this actually comes down to is a flaw in the dial plan design. Given that we have overlap patterns like this it should have been noted and mitigated during roll out. Additionally we don’t tackle full on dial plan re-designs unless it’s under a billable project. So now I’m in mitigation mode. How can I resolve the issue and restore expected dialing behavior without impacting the user’s too much? For this I came up with three ideas, however, I recognize that there may be many many answers to this and likely most more elegant. Here’s what I noted are the possibilities:
-
Add a “#” to the end of the 00.[1-9][1-9]XXXXXXXX pattern. This would require any users trying to dial the shorter 12 digit pattern in Santarem to press “#” when they are done dialing the number they want to reach. It would also require a modification in the way the users dial the 12 digit pattern in Santarem.
-
Do not add a “#” to the end of the 00.[1-9][1-9]XXXXXXXX pattern and remove urgent priority. This will require users to wait for the 15sec inter-digit timeout when dialing the 12 digit pattern in Santarem before their call is sent to the gateway. It would also introduce a delay and in my opinion would be the least desired option.
-
Add a prefix, such as “9” to the beginning of the 00.[1-9][1-9]XXXXXXXX pattern, leave urgent priority on. This would require users to dial a 9 at the beginning of the 12 digit number in order to hit the appropriate pattern. The 9 would be discarded “pre-dot” so no changes would be required at the gateway. This, like option 1, would require a modification in the way users dial the 12 digit pattern only in Santarem.
At this juncture we are waiting on feedback from the site on how they would like to proceed, as even if “Option X” might be the best from my perspective, the users are the ones who ultimately have to deal with the change to their dialing habits, therefore we give the option to the site and provide our own input for context.
So here’s a few questions for ya’ll: In what ways would you tackle this dial plan overlap? Is there a more elegant solution that I’m missing? Would you recommend re-designing the dial plan over providing a band-aid solution? Would you have thought to look elsewhere or gather any additional information?
I find that in troubleshooting issues I like to think of what is most likely, or what things are most likely causing an issue before I dive in. This helps give me a framework to work off of and a checklist of things to review, as it provides contexts to the signaling and configuration and I can say “this signaling supports idea B, but not A and C” and such, and narrow down my causes there. Not always, but in cases where I don’t have a full set of traces to see what the traces say I especially utilize this method.
If you enjoyed this, maybe learned something, or even sympathize with the random issues of the Day 2 side of life consider following my blog, or tweeting at me on Twitter (@kperryuc)!