Complex Bandwidth/Latency Issue

Discussion in 'General Networking Discussion' started by rpurinton, Feb 20, 2016.

  1. rpurinton

    rpurinton

    Joined:
    Feb 20, 2016
    Messages:
    1
    Likes Received:
    0
    Hi all.

    I have an interesting situation, and seeking feedback/recommendations from networking professionals on it. I apologize before hand for the lengthy post but I feel it's necessary to explain all the factors in play.

    I work with a company that does video storage. We handle files that are often times 10 to 20 GB in size. We recently did a platform upgrade/migration.

    Pre-migration:

    We had a colo space with a 1 Gbit internet connection. 1 Gbit shared amongst all the servers in the colo. This colo was in New York City.

    Post-migration:

    We moved our systems to OVH Hosting (in Montreal) on dedicated servers that each have Dual 10 GbE connections to the internet. We have 4 XenServer hosts there. OVH does not limit inbound traffic, but they limit outbound traffic to 500Mbps per server (though this is upgradable to 1, 2, or 3 Gbps by paying an extra fee).

    The App:

    We are using a Java tool called FineUploader. We have it configured to break up a large file into 50MB chunks and upload 3 chunks at a time. This is to hopefully improve speed by having 3 concurrent TCP streams instead of just 1.

    "Content Ingest Nodes"

    We use geographically dispersed servers to help reduce the latency from our clients when uploading. We have these ingest nodes in Los Angeles, Seattle, Atlanta, and New York City. The ingest nodes are currently running on VPS servers with 1 Gbit connections.

    Geolocation:

    We found traditional GeoIP databases weren't working well for us so we implemented a system by which when a user requests to make an upload or download, we trigger a Traceroute from all of the Content Ingest Nodes simultaneously and check the results to determine which node has the lowest latency to the user. This system of geolocation seems to work well and succeeds in routing users to the node with the lowest latency.

    Uploads versus downloads:

    When a user is uploading, we temporarily store those received 50 MB chunks on the content ingest node. When a chunk is received, we add it to a queue, and another server running at the head-end in Montreal retrieves those chunks over UDT protocol (a UDP based protocol used for sending files quickly across a WAN), then it appends the chunk to the file being uploaded.

    When a user is downloading, the remote node requests that file from Montreal over the UDT protocol and then passes it along to the client (who is connected over HTTPS). This is more like a relay than a CDN because we are not storing a mirror of the files locally on the node.

    In the past:

    The CIN nodes for uploads made perfect sense. We only had 1 Gbit connection to the old Colo. So if we had multiple users uploading simultaneously, the uploads were buffered at the CIN nodes, the user sees their upload complete quickly regardless of congestion at the head end. Their chunks are queued until transferred and appended successfully. Also, with bandwidth being equal at the CIN nodes and the head-end, then latency is the next best determining factor on where to point the user.

    In the present:

    I'm not sure that these CIN nodes are creating any improvement for us any more having 10 Gbit connections at the head-end now.

    Problem:

    I don't have a good way to prove-disprove this!

    I am a network engineer with 15+ years experience and my gut tells me that keeping those CIN nodes in the mix is detrimental to performance. The CEO however believes that the CIN nodes are necessary to reduce Latency to the end user, and is also a key selling point to differentiate us in the marketplace.

    There's only 1 scenario where I am sure they are detrimental:

    If a user has more than 1 Gbit upload speed. The CIN nodes only have 1 Gbit bandwidth, so that is the maximum possible attainable speed of their upload, however if the user was uploading directly to Montreal their attainable speed could be as high as 10Gbit.

    And there's only 1 scenario where I am sure they are beneficial:

    If a user takes a bad route to get to Montreal, but has a good route to a local CIN node, and assuming the route from Montreal to the CIN node is good, then the user will have a better experience.

    Outside of that I'm not sure how to prove my point to the CEO that the CIN nodes are 1) not helping, 2) possibly causing poorer performance, 3) an unnecessary cost.

    Here's the last of my thoughts:
    *Impact of Latency should already be minimized because we are uploading with 3 simultaneous TCP streams not just 1. This should pretty much make Latency irrelevant for uploads right?
    * The CIN nodes are on shared bandwidth, so if another client on that same VPS host is sucking up all the bandwidth, then we will definitely have much lower performance
    * We are probably causing performance issues for other customers on the VPS hosts because during transfers, when we blast 500 to 1000 Mbps of UDP traffic. Our VPS host has already cut us off multiple times for believing it was a DDoS attack, seeing so much UDP, each time breaking users uploads/downloads and causing user Quality of Experience issues
    * Upgrading CIN's to 1 Gb dedicated bandwidth would make us have to get dedicated servers, which are 10 times as expensive as the VPS servers
    * Upgrading CINs to 10Gb dedicated bandwidth is way outside the budget
    * The VPS hosts all limit bandwidth monthly transfer, whereas our hosting at OVH is unmetered 10Gbit
    * The downloading seems silly to me. We have to request the file over UDP to be spit right back out over HTTPS, I'm sure this is ineffecient, but unsure how to prove that.

    Personally, I'd like to get rid of all the CIN nodes and use OVH servers only. It reduces costs, and would significantly reduce the complexity of having to shuttle the data around over UDP, we could just simply write the uploaded chunks directly to the file server. We also reduce certain risk factors such as a VPS going offline or having performance impacted by another VPS customer. We could do away with the tracerouting system (which causes about a 3 second delay before starting uploads/downloads. And it also would make my job much easier just having to manage the core VMs at OVH and not an extra 20 servers worldwide with multiple vendors.

    I've already floated the idea by the CEO but he is currently not in agreement. Those CIN nodes have been a major selling point for him in the past (even though the positive effects of using them is unproven), customers like to hear you have servers close by. Also, he still believes that latency is such a huge factor in the equation that not having CIN nodes will cause uploads/downloads to slow down.

    We don't have a good way at the moment to prove one way is better than the other via hard data. I'm sure if I could show hard evidence that uploads and downloads go faster without CIN nodes then the CEO would probably let them go. His main reason for having the CIN nodes is because he believes they improve the performance of user uploads.

    In lieu of hard evidence, and since I'm the only networking engineer on staff, and the CEO isn't inclined to believe me yet, I hoped that this forum might provide enough feedback to reinforce my thoughts.

    So whether or not you have anything technical to add, it would be nice to see some replies on if you think we should keep or destroy the CIN nodes, the more replies the better!

    :thankyou:
    Russ Purinton, FCNSP
     
    rpurinton, Feb 20, 2016
    #1
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.