Multimedia
Sink or Swim: Making Data lakes a Central Pillar of Your C/ETRM
Discover the importance of modern data management, data interoperability, and security with leading energy and commodity trading experts.
September 22nd, 2023 | 44:11
Summary KeywordsEnergy trading, commodity trading, data management, data interoperability, ETRM systems, ETRM/CTRM, data security, C/ETRM, AI, machine learning, energy transition, risk technology, risk systems
Transcript
Ben HillaryManaging Director, Commodities People
Paul KaisharisSVP of Engineering, Molecule
Alex WhittakerGeneral Manager, Bonroy Petchem
Tim KramerFounder and CEO, CNIC Funds
Ryan RogersPrincipal, ENITE
Kari FosterVP of Marketing, Molecule
Ben Hillary Well, hello, everyone, and welcome to today's webinar, Sink or Swim: Making Data Lakes a Central Pillar of Your C/ETRM.
My name is Ben Hillary, Managing Director of Commodities People and would really just like to say a huge thank you to everyone for being here with us. Really delighted to see how much this webinar has attracted the interest of the industry, with over 400 registrants from all corners of the globe and all parts of the commodities and data ecosystem.
In the next 60 minutes, we'll be deep-diving into the latest best practices in data management. We'll be exploring how data lakes can work in partnership with your C/ETRM, allowing real-time answers to really complex data queries with the goal of providing absolute trading advantage.
Recent years have seen really, really incredible advances in advanced analytics, AI, machine learning (ML), and many other forms of interpreting and actioning data. However, all this is virtually useless without a strong and effective data management strategy in place - firmly integrated and understood throughout the organization. This is what we aim to shine light and provide best practices on today.
We've got a truly expert speaker panel lined up to whom I'm very, very grateful for their time and input. Some of the subjects we'll be covering today include the importance of data interoperability in a shifting energy market, approaches to managing and analyzing unstructured versus structured data, best practices for optimizing data accessibility across your trading organization, and the key role your E/CTRM provider plays within your data ecosystem.
The webinar will take the format of a panel discussion, followed by Q+A. So, on
that note, throughout the webinar, please be posting your questions in the Q+A box and upvoting others of interest. Also, do make full use of the chat channel for any comments you want to share with the panel and the audience or even just to say hello and introduce yourself.
I am now delighted to pass over to Kari Foster, VP of Marketing for Molecule. Kari, the
floor is yours.
Kari Foster Thank you so much, Ben, and it's great to be here. As Ben mentioned, I'm Kari Foster. I'm the VP of Marketing at Molecule, which is the modern ETRM/CTRM platform. An ETRM/CTRM, if you're not familiar, is an energy trading or commodity trading risk management platform.
I'm really excited to be here today, introducing just a stellar panel of experts who are representing trading, risk management, and technology. And, this is really such an important topic for anyone within the trading organization who depends on data to do their jobs basically. Don't we all? And, a data management strategy is so much more than just the technology you have in place. And certainly, that's going to be covered today - but, it's how the data is structured, how it's accessed and consumed, the ways that it can be analyzed. And, that all starts with a strategy that has the end in mind.
So, without further ado, I'd love to get this really important discussion going by introducing today's panelists. First, Paul Kaisharis is my colleague, the Senior VP of Engineering here at Molecule. Tim Kramer is the Founder and CEO of CNIC Funds, and he actually used Molecule to model the prototype for their U.S. Carbon Neutral Power Futures Index ETF. Ryan Rogers is Principal at ENITE, which is a management consulting firm delivering strategic solutions to energy utilities and manufacturing. And, Alex Whittaker is General Manager at global energy trading and supply company, Bonroy Petchem.
And, I'll pass things back over to you, Ben.
Ben Hillary Excellent. Thank you, Kari. Right, drumroll. We will get kicked off right away to begin with a poll for the audience, so I am launching the first poll now. And, hopefully everyone can see this.
So, the question is - and it's a single choice - what is the biggest challenge with getting better insights from your trading data? Is it data quality, data management tools, data analytics tools, overall data strategy, internal skills or knowledge, unsure, or not applicable? So, if everyone can ponder that one, and I'll end the poll in about 10 seconds.
Excellent, okay. I am ending the poll now. Okay, so interesting results. Alex, Tim, how do these results line up with your own experiences? And, from your perspectives, how do they align with the main challenges you face?
Alex Whittaker Yeah, I mean I think I can relate to those answers, like the data strategy and data quality being the main problems, but we'll say the situation is a bit of everything.
So, my experience with this is just how fragmented all the different data sources are, like the volume of data. All the different sources, all the different licenses that I need. How do I get all of my prices together, even for what is a relatively simple trading book?
So, yeah. That's what I take from this, is just that say, my own struggles at the start of getting everything set up for Bonroy, I think, come through in that - data quality and data strategy. And yeah, just thinking, how do I get just basic prices in? I mean, let alone more complicated prices. That's what concerns me at the moment - overall data quality.
Tim Kramer So, for what we're seeing, the overall quality and management haven't really been a problem because we're using exchange-traded prices. And so those things, have like an auto scrape and that just hasn't been an issue.
And, the skills and knowledge - the people that we're seeing right now that are working for us, plus the people that we interface with, just amazing. You know, the younger generation and their math skill stat skills or, kind of, what they know and how computers iterate. It's just stunning.
But, the part that we see that's a little bit challenging still is the analytics on this. There's, you know, different math techniques and different things that people want to look at. You know, different people have, like, a different version of how they want to do a sharp ratio, things like that.
And then, when people try to use the data to get some more insights out of it and actually make something useful with it to try to get an edge, that's where there's just so much more information that you can tease out of the data, like cross-correlation and co-integration and things like that.
So, trying to isolate the individual components once you have the data, that's kind of been what we see as the biggest challenge.
Ben Hillary Excellent, thank you.
Well, next question. The trading landscape itself has changed immensely in recent years. We've got factors like the energy transition, the rise of carbon markets, general shifts in technology. What has been the impact on business needs from a data perspective? Ryan, if we could start with you on that question, then we'll go to Tim, Paul, and Alex.
Ryan Rogers Certainly. So, the traditional highly structured data remains critical. You know, things like risk market, risk management, credit risk management, compliance.
Where I think we've seen new needs and new analytical capabilities needed is in some of the renewables markets. There's a lot of much larger networks for the Internet of Things (IoT). A lot more smart grids and sensors bringing in semi-structured data. Things that are in JSON or XML.
So, there's more of a struggle to deal with that semi-structured data in the traditional data warehouses. And then, I think the emerging markets - carbon credits, environmental credits. Some of those markets, where they are on exchange-traded platforms and have structured prices, it's great. But, some of them are auctioned.
There's, you know, infrequent data points to pull in. So, I think some of those emerging markets and the need for just monitoring news feeds, market sentiment for some of those emerging markets would benefit from natural language processing (NLP), machine learning (ML), and some of those emerging capabilities.
Ben Hillary And, Tim?
Tim Kramer So, what we've kind of seen for how the landscape has changed has been sourcing and documentation of the data. So, I mean the data comes in. It's good, but everyone says, "Where'd you get that? Where did that come from? Well, what's the web link to that? Well, how can I verify that?"
So, there's been a big push - and, again, because we're registered with the SEC for what we do - there's a big push on the actual data sourcing, the documentation. So, for like, for auditing, for like, SOC, et cetera. And, as Ryan said, that when you're taking a look at the carbon, people want to be able to kind of verify things all the way back to the source to say, "Okay, does
this qualify for SFDR, article six, seven, eight, nine..." whatever it would be.
So, that would be the thing that we're seeing is the documentation and the actual. It's all the way down, so someone can look at the actual links and verify them.
Ben Hillary Excellent. Paul, your thoughts.
Paul Kaisharis Yeah, so I guess what I would add, we've kind of touched on this already, about the volume and the variety of sources of data, and just the volume of that data has increased tremendously, you know, over time. And, it's really a big challenge to, you know, manage this volume and make sense of it, and add value to the business.
So, there's so much data coming from all different sources, and Tim touched on it - just having that connection of where that data has come from and the relevance of it. You know, I would say the technology has matured quite a bit. And, there is, for businesses, on how to use that technology and how to manage that volume of data and make appropriate use of that data.
I think that's an impact of business - to decide how to best, you know, utilize and leverage that. And then, of course, there's the larger question of, you know, AI - "the evil AI," and what do we do with machine learning (ML) and large learning models? You know, that's a real question businesses need to ask themselves. And, how, you know, if their competitors are going to be using that technology, is that going to leave them behind?
So, I think there's some big questions around that for businesses to answer.
Ben Hillary Alex, your thoughts.
Alex Whittaker Yeah. To echo what Paul just said, that really sort of impact we've seen. You know, more data, more data vendors, more delivery methods, more choice, more complications, more costs, more service problems. And yeah, just the sheer growth in data, different sources that it comes from.
And, I think it ties in quite well with this, you know, data lakes and your CTRM. What's the solution to that? It's trying to get that fragmentation to come together in one place with people who know what they're doing and how to do it and try to streamline it that way is something I've actually learned during the process of doing this panel. Right, talking to these guys. So, it's something I'm looking at for Bonroy right now, in fact.
Ben Hillary Next question. We'll go to Ryan and Paul with this one. What is the difference between unstructured, semi-structured, and structured data?
Ryan Rogers So, structured data is the one we're all familiar with. All of the pricing volume, transactional data contracts that the sources are well known, how fresh, it's time-stamped. It's verified. The trade controls person is monitoring that data daily
The semi-structured is probably the next most useful bucket of data that I've seen in my clients. So, this is time series data that has some structure but is not necessarily time-stamped, cleaned, verified. Things like meter data, SCADA data, PI data. Things that are necessary and useful for all the ancillary operations but difficult to get into a highly structured format. Or, the data source is enormous, so it's time-consuming and complicated to integrate into your highly structured data warehouse. So, that's probably the next most useful category.
Unstructured data is things like text, images, video, things that would benefit from natural language processing (NLP), machine learning (ML), especially in some of these emerging markets like we've been talking about.
Paul Kaisharis I mean, Ryan hit on it but maybe a little bit... at a not-too-technical, a little lower level. I mean, with the structured data, it's traditional relational database data. You
know, stuff that's stored in tables. You know, trading data, market data, curves, trade valuations. You know, that's considered the structured data, with rows and columns of information.
On the semi-structured side to elaborate down on that one a little bit. It's, kind of, JSON and XML type data, where you get a little bit more descriptive information about, what is that data about? For example, for Molecule on a semi-structured side, obviously Molecule has a lot of structured data with what I just mentioned, in terms of trades, market data, et cetera.
But, on semi-structured, we provide Value at Risk (VaR) calculations through a JSON structure, which is, again, a more complex structure that you can get more descriptive, more information on. Also, a lot of modern ETRM systems, of course like Molecule, have APIs that return data in that JSON structure. So, what do you do with semi-structured data like that?
And then, Ryan mentioned the unstructured, which is the documents, video, and images, and all that. So, that's how I would... I mean that's a typical categorization of those data structures.
Ben Hillary Following on from that, what are your approaches and your best practices
in managing unstructured, semi-structured, and structured data? And, Paul, if you want to continue.
Paul Kaisharis Sure, yeah. So, I would always say start with security, and start with security first. And, you've always got to consider security, and security of the data. Always consider the principle of least privilege that basically means only provide access to what a person or entity needs to do their job. So, start with security first.
You know, Alex mentioned about all these different sources and different locations of the data. Bring all the data together into a centrally managed, you know, store location. You know, bring that information together. And, you know, I mentioned about technology maturity. I mean, you know, take advantage of relatively low-cost cloud storage infrastructure.
There's always ways to store and bring this data together. You know, bringing that data together and take advantage of that you know lower cost cloud storage I think is something to consider there.
Ben Hillary Tim, your thoughts on that.
Tim Kramer Those guys pretty much hit all the points. I got nothing of substance to add, but thank you.
Ben Hillary All good. Ryan, do you have anything to add, in terms of how the best practices of managing those data types?
Ryan Rogers I think Paul hit the big ones.
The only thing I would add is, maybe, governance. You know, data governance, policies, ownership. You know, policies on quality checks, datalife cycle management, just kind of ownership, health and hygiene, when it gets purged, where it came from, documentation. But, Paul had the big one, security, access.
Ben Hillary Excellent. Okay, then. On this subject, let's move on to our second poll. But, before we do that, I want to remind the audience that we will be taking your questions towards the end of this discussion, so do keep on putting them into the Q+A and upvoting any within there that are of interest.
So, moving on to our next poll. Okay, I'm launching the next poll. So, everyone should see that now. Are you currently using or considering implementing a data lake. Single choice. Yes, I'm using; I'm considering implementing; No, I've got no plans to; or not applicable.
So, audience, keep on throwing your answers in there, and I'll close the poll in five seconds. Five, four, three, two, one. Ending the poll. Sharing results.
So, I'm not sure if that's as expected or a surprise for the panel. I guess, it's kind of what I would have expected.
Ryan Rogers I'm actually surprised 24% are actively using. That's a little higher than I expected.
Ben Hillary Good to see. Good audience. Okay, excellent.
So, next question - data interoperability. It's a term that is talked about a lot in the context of a data management strategy. But, what does that mean, and how does a data lake enable better data interoperability?
Paul, if we could go to you with that one, please.
Paul Kaisharis Sure, probably best to first start with, kind of, the definition of data interoperability, and I just happen to have the definition handy, so I'll just read it.
So, really data interoperability - the pure definition of it is - the ability of systems and services that create, exchange, and consume data to have clear shared expectations for the contents, context, and meaning of that data.
So, it's for shared systems to create some type of meaning, you know, of that information. So, you know they come from different sources, different types of data attributes, but how do you provide a common meaning about that? So, interoperability provides, like I said, meaning and context to the data. It does allow disparate data to be organized and cataloged. So, having an implementation around data interoperability, and it also relies on good metadata management.
And, to the second part of the question, is how does a data lake support that? It really is about kind of that metadata management. The cataloging of the data and really automating of doing that and providing that type of meaning.
I mean, we'll talk more about this later. But, in terms of what these kinds of technologies allow to bring all these different structures of data together - how do you provide that meaning to it? And, really, that metadata layer that provides information about that data that you can use to catalog, describe, organize that information? That's really, you know, the idea behind data interoperability and what data lakes do to support that.
Ryan Rogers Right, I'll pick up on the second half of that question, how data lakes enable this. And, I think there's four, kind of, key elements that a data lake will enable it.
That's flexibility. So, being able to pull in disparate data sources without predefining the schemas and the speed you're able to do that. Scalability. So, a lot of these new cloud services do allow much cheaper and larger storage. And then, making the disparate data sources centrally located, so there's one place to go. And then, making that central location queryable.
So, some of these advanced tools that are emerging are enabling users to query some of that structured, semi-structured, and unstructured data together.
Ben Hillary Thank you. Next question, I'd like to take this one to Tim. What role does the ETRM play in the data management strategy? What are the benefits of using a data lake in partnership with your ETRM? Because you've got a real-life experience here.
Tim Kramer Sure, so some background information to, kind of, give you the context for the answer here. And, that is electricity is the most consumed commodity in the U.S. on a retail notional basis. But, it wasn't in any index; it wasn't in any ETF, any mutual fund, nothing.
So, what we did is, we created the first ever Carbon Neutral Electricity Index, and then we partnered with ICE, Intercontinental Exchange. We published the index in January, and then in mid-May, we launched an ETF on the New York Stock Exchange. The ticker is AMPD. So, given that, we had to develop the index from scratch because nothing existed. So, what we did is we actually used the risk system for this.
And, we were kind of adamant about, we wanted to develop the product inside of the risk system. And then, we wanted to use that to basically run and manage the product. and we also wanted to use it to market the product so that when people had any questions, they could say, "Where'd you get that?" Here it is, right? So, when it came time for the development of this. We use the risk system because we had to figure out what was the optimal setup for the index. So, you're trying to figure out what the best risk-adjusted returns are, so you're looking at all sorts of variables, like roll windows, future tenor, future selection, weights, collateral, et cetera.
And so, you have all those different pieces, and that's a heck of a lot to try to digest and figure out what the right thing is. And, so, I mean if you do that in Excel or some other database, you're gonna mash F9 and get a white screen for about five minutes. And, maybe you get an answer, and maybe you don't.
But, if you do that in a risk system, which allows you to customize those inputs, you get an answer right away. And, it just makes the optimization a lot easier. So, the benefits of that are on the development part of this. It just saves you a lot of time. It's more reliable, and it just looks more professional when you present that to people. So, when we walked up to ICE to partner with them, and we showed them the risk system and how we developed inside the risk system there, they were just, you know, "Okay, this is great. Let's go."
In terms of the ongoing management, then, since the product is up and running, you know, the obvious things are performance in P&L, but then you get a lot of questions about, you know, risk metrics, and then you get questions about PCA, principle or portfolio component analysis. So, okay, what percentage of returns came from this kind of electricity futures? What percentage came from, you know, carbon allowances?
And so, to be able to tease that out and have that in the risk system and not have to keep going up, downloading data, and beating up Excel, and trying to create all these bespoke reports that somebody may or may not actually pay attention to.
And then, when it comes time for the marketing, it doesn't matter who you walk in and talk to when you market it. Whatever you have, they want to see something different. If you say, "Look, you know, here's the sharp ratio." They go, "Oh well, you know, we use information ratio here." "Oh, you know, we use Sortino ratio here." So, if you have a risk system that has those things in it that you can give those things real-time and live, and they also vary the time frame they want to see.
So, a risk system that can give you all those portfolio metrics basically in real-time as you're walking in and talking to a prospective client, that helps. And then, you're always going to get these bespoke requests. Like, they want to see, you know, what's the correlation; or this product that product; what's the different time frame for the correlations?
And then, they sometimes want to have the data exported, so they can run it in their own systems. And so, having that function inside of a centralized risk system is invaluable, so that's kind of how we tackled this. And, that's why having everything from soup to nuts on a risk system was valuable to us because there's no handoff problems. There's no data leakage, and you can respond in real-time. And, it just makes everything look, you know, more credible and more professional.
Ben Hillary That's excellent.
Well, next, let's go back to Ryan and Paul. What are the benefits of a data lake, and how do you prevent it from becoming - a term I love - a data swamp?
Ryan Rogers Well, the benefits are some of those we've mentioned already. The scalability, being able to incorporate enormous amounts of disparate data. Flexibility, dealing with queries of structured, semi-structured, unstructured data.
Lowered costs that a lot of these new cloud tools offer. Speed - speed in incorporating new unstructured data sets or semi-structured data sets rather than going through the project lifecycle of, you know, ETLs. Transformation and loading.
And then, the advanced analytics capabilities, and this is this is all, you know, fairly new to the ETRM/CTRM market. But, all the machine learning (ML) and some of the older tools, like natural language processing (NLP), enable more advanced analytics.
And on the preventing a data swamp, some of the practices we've talked about, governance policies about data cataloging, metadata management, access, lifecycle management - not just for the retention, but when it'll be deleted, lowering that attack surface, purging all data when it's not needed anymore.
Monitoring it so somebody still needs to own the quality of the data, even though it is not going through that ETL process. And then, documentation - what it is, where it came from, how it should be used.
Paul Kaisharis Yeah, what I'd add on the benefits side, one of the key benefits of some of the more modern data lake technologies is that you can just bring the raw data in.
Traditionally, historically, there's been a lot of complication, cost, delay in having to do complex - extract, transform and load operations in a batch mode. So, you know, being able to bring that data in its raw format is one benefit, and Ryan touched on all the other ones.
I did want to have one thing about really, kind of, the ETRM responsibility. I think, related to the data lakes and providing necessary data, I think one of the responsibilities of an ETRM system, one like Molecule, is to be able to provide that data in real-time, not in a delayed batch, slow mechanism.
I don't think an ETRM system necessarily needs to be the ones to provide the data lake technology because there's companies out there that do better than I think ETRM companies do. But, it is a responsibility to make sure they get that data fast to, you know, to be able to feed into the data lake and feed into the analytics that we talked about.
On how to prevent the swamp, which is you know, another definition. It's an unorganized pool of data that's difficult to use. You know, it's really directly related to interoperability, data interoperability, that we talked about earlier.
I mean the metadata layer being able to effectively use that metadata layer, and provide that meaning to that data. Being able to, you know, properly catalog and keep track of the data and who has access to it.
There's, you know, a lot of these newer technologies provide auto-detecting of schemas. Schema being descriptive information about that data. So, you know, using that information and auto-detecting the schema of that data to provide that cataloging of that information.
And, data quality - we mentioned this already - is an important aspect of this because even though the machines and technology can do these things, there still needs to be some level of a business user or subject matter expert involved in the workflow to make sure that that data that's going in is good.
You know, having bad data, that feeds the swamp. But, you've got to have some kind of workflow governance to make sure subject matter experts can filter that data a bit.
Ben Hillary Thank you.
Next question I'm sure is one which many listeners on the webinar can sympathize with and probably many of our panelists, also. What are best practices for actually ensuring the right data is accessible, in a secure manner, to the right roles within the trading organization?
So, Ryan, if we could start with you on that, and then we'll go to Alex, Paul, and Tim.
Ryan Rogers Certainly. So, on the security side, encryption, obviously. But then, also multi-factor authentication for the lake.
I think beyond that, it's controlling the access. Role-based access, just like people are used to with their ETRM or CTRM systems. But, role-based access for the data lake.
And then, audits. So, having audit trails of who accessed data and what and when. And then, regular audits of who accessed the data and what and when.
Paul Kaisharis And then - oh, sorry! I wasn't next. I think it's over to...
Ben Hillary Alex, over to you.
Alex Whittaker Yeah, I mean - I think for me - it's about communication, about actually identifying what is the right data. How is it used? How important is it?
I think one of the things I've learned at a young company or small companies that I reckon is probably best practice to actually have some sort of data specialist or a specific role with a data expert quite early on. That's definitely something I'm considering at the moment.
Who would then really - again, you're sort of looking to this person about centralizing all of that data in one place. And, say that information and communication coming in one place, so one person has the full picture and can understand those fiddly issues, like consistency and things like that. And, you know, where are you having to make sure prices are all done, at half of seven, or if there's a dog leg between half four and half seven. Things like that.
So, I think having a specific person in charge of that, specific data expert I think would be best practice. But, the usual things with technology: communication, taking your time to get the details lined up, and actually understanding what you're doing, why, and how because there's
a huge return on investment in the time you spend doing that early on.
Paul Kaisharis Alright, we've hit on a lot of these points, but I'm gonna hone in specifically around security. You know, I mentioned least privilege. You know, start with, don't give access out that you don't need to.
Encryption is incredibly important. You know, also encryption at rest. You know, what's on storage. Make sure that whatever's on storage is encrypted. And, in motion, over the wire - anything that travels across the network. You need to make sure that that data is encrypted.
A lot of cloud providers talk about shared responsibility models, which basically is they do their part, but you also need to do your part. So, even though the cloud providers have a tremendous level of security they've put in place and the SOC compliance, et cetera. But, as a company - as a product company, whatever you are - you need to make sure you do your part on that side of it.
And then, we talked about strong authentication and authorization. You know, authentication with two-factor authentication. You need to make sure that that you're authenticating the people, so you know who they are. Authorization, Ryan mentioned, role-based access control. Controlling what they can see based on roles.
And, one we haven't mentioned really around geofencing, where you can actually control who has access to that data based on your location. So, if you're vacationing in Costa Rica, maybe you shouldn't get access to the data if you're doing that. But so, there's things like that you consider.
And then, you make sure you've got the security controls to cover everything: structured, unstructured, and semi-structured. And, we talked about the metadata, you know. You can use the metadata in these systems to help define, to implement those security controls. So, we've talked a lot about the tools and technologies out there. You still have to be very smart about how you use them.
Ben Hillary Tim, your thoughts.
Tim Kramer Not so much anything I can add around like the data access part, but it's more like what people do with that access. And so, we would have issues with, you know, people would still tend to want to run their own stuff inside of Excel.
And then, you'll have a scenario where you walk in in the morning and someone's, "Oh no, we have a compliance issue. You know, last night we had a VaR or a position blow out, and we got to report this to the CFTC in September of 25." And you're like, "Okay." So, you spend two hours tracking it down, and it was part of a spread trade. And, there was no issues. And so, there's two hours of my life I'm not getting back, and there really isn't a problem.
So, what we try to do is, you know, we don't discourage people from trying to do their own work and figure things out. But, we want people to make sure they use the risk system and use it the right way.
And, if it's lacking something or there's an improvement, then make sure you get that in the risk system rather than having people just taking the data and have the half-baked data and try to run their own reports. And then, you know, things may or may not come out of that are useful.
Ben Hillary Thank you. Next question.
There was no way we were going to be able to get through a webinar without having a question on this subject. So, the question to everyone is, do you see AI or machine learning (ML) playing a greater role in data management within energy and commodity trading?
Tim, if we could go back to you on that question, please.
Tim Kramer Yeah, sure. So, we talked about how we developed our index and then our ETF, AMPD, and I think in the past - I may get this wrong - but in the past, I believe three months, you've seen 27 different new funds come out that are using AI.
And so, in the commodity space, what they're doing right now is they're saying, "Okay, I'm going to look at all the different commodities. And, when I need to roll the commodities, I'm going to use AI to select which ones are the best. And so, in order to implement that and get that into your risk system, that's an entirely new thing that people are trying to get up to speed on.
And, it's one of those things where, when you go to backtest, it looks great. But, you know, how does that work going forward? The fact that you're actually, you know, interacting with the marketplace. Did you change what would have happened in the past? So, that's kind of the things people are looking at.
It's not just the AI and then the optimization of something in a new fund, but can you go this the next level of that, and can you say, "Okay, the AI did or did not change what the backtest looks like.
Ben Hillary Excellent. Alex, hear your thoughts.
Alex Whittaker Ah, yes. In terms of data management and energy trading, I think there'll definitely be a role for AI and machine learning (ML). I think also people need to be very careful about focusing on the problem they're actually trying to solve.
I think, often with technology, people can get carried away with solution-based design. They forget what why they're doing something. So, just focusing on that balance between problem-based design and solution-based design and just having a clear goal in mind and sticking to it I think is important.
Because, ultimately, you want to make sure that you are helping yourself. Getting the productivity gains that you should be getting from something like AI, rather than, I don't know, getting caught up in a sale and just sort of getting a bit carried away.
I think over, maybe the last 10 years or so, a lot of people have used technology and not gotten the benefits from it that they should do. So, in this next wave of AI and machine learning (ML). I think people focus on delivery and results. Then, perhaps, they might learn from some of the mistakes they made in the, sort of, first wave of technology coming through energy trading.
Ben Hillary And, Paul, your thoughts.
Paul Kaisharis Yeah. I would say, for sure, you know how that's all going to play out and be done is... I think there's still work to be done to figure all that out.
But, I think, in particular, around the private use of this technology. I don't think any company wants to push, put their data out there, and then have OpenAI and ChatGPT learn their data. But, what you're seeing now - you're seeing a ChatGPT, or OpenAI came out with an announcement just a few days ago about ChatGPT for business.
So, now companies can create these large learning models on their private data. McKinsey has a technology that they've been promoting internally to help their consultants. So, you already have these big enterprises that are starting to use it - use this type of machine learning (ML), AI, large learning models - on their private data. So, I think that, to me... yes, I see that particularly around private data.
And then, of course, all the things we're talking about with commodity trading systems. They need to be in the play and feed all this to be able to help businesses that play in this space. I think that's a key role of systems like this.
Ryan Rogers So, part of the reason I was surprised by the poll - that almost a quarter of people are using data lakes - is I have seen them used, but mostly at the very largest of my clients, the vertically integrated. And, it hasn't really trickled down into the medium and smaller shops yet.
But, those companies that are - where I've interacted with them, they are hiring incredibly bright data scientists. They're starting to hire up data engineers. They have, you know, an army of very smart people focused on this.
And then, a little bit... it suffers from the problem Alex mentioned of, you know, solution-based design rather than problem-based. But, I think there is enormous potential, especially - I mostly work in the financial, too - but physical commodity trading. And, there are enormous amounts of semi-structured data that are critical to the operation. So, this isn't emerging needs. These are like schedulers.
So, if you have crude product schedulers, NGL schedulers in your shop, every single one of them has their own unique tool - usually Excel - where they're doing all their supply-demand balancing. And, it's impossible to standardize all of that into one enterprise. You know, supply-demand forecast.
And so, all of that incredibly valuable data that your schedulers are accurately managing is not in an enterprise system. And, that's something that machine learning (ML) or data lakes could begin tackling, intelligently. That is hard to design a structured system and a structured scheduling tool with ETRM capability for, right?
Don't get me wrong, all the ETRM systems will capture that after the fact, like an accounting approach. You know, what did you nominate? Then, you go put it in the system. What did you move? Then, you could put it in the system. What was actualized? Then, you go put it in the system.
But, on their spreadsheets, they have the day-to-day forecasts of their supply-demand balance. And, that is something that would be very valuable to tackle, and machine learning's possibly a candidate for that.
Ben Hillary Thank you.
Well, there's one more question from me, and then we will move into questions from the audience. I see we've got a few questions from the audience already. So, audience, do have a look into the Q+A box. Upvote any which are of interest and add your own.
So, final question for me. I'll address this to everyone. We've seen an evolution over the years from data warehouses to data lakes. What's the next step in this evolution? This is the crystal ball question.
Paul, if we start with you, please.
Paul Kaisharis Sure. I mean, kind of, talking about the evolution real quick but not too long. But, basically, we saw from departmental use cases of data warehouses. So, it's structured data, and, you know, good use for the financial department, risk department, et cetera.
And then, we saw the evolution to big data. But, the tooling was very complex and expensive to run. And then, you know, cloud providers came up, but the tooling was still very expensive. But now, we're seeing lower costs in cloud providers, less complex tooling to provide access through common things that people use, like Python or SQL or Excel.
So with that evolution - I mean we've kind of hit this already around the tools that enable - I mean the evolution is really around what's going to enable modern data science and development of private large learning models and data analytics. I mean that's, to me, what's really the driver that's next.
It's okay, the tools and technologies, data there. Put the right governance in place. Have good data quality. But now, you need to be able to, you know, use that information to just to do the data analytics that you really need to on a large data set versus a very small departmental level type data.
Ben Hillary Ryan, your thoughts.
Ryan Rogers Yeah, I think there's probably some big ones and then smaller ones.
I think the big ones are maybe incorporating the data warehouses into the data lake. Combined, you're always going to have a need for highly structured data for risk management compliance. But, placing that within the data lake so that the same users can also access SCADA data, PI data, all of the other useful, less structured or scrutinized data.
And then, I think some of the opportunities with machine learning (ML) and AI, you know, beyond things like pulling in physical scheduling data and making use of previously unaccessible data are things like - machine learning (ML) can probably start tagging the data, inspecting the quality of the data. You can probably automate some of that that was very laborious for humans.
Ben Hillary Excellent. Tim
Tim Kramer Yeah, I think Ryan nailed the two things.
The first would be - I'll just call it one-stop shopping. And so, people right now want instantaneous answers. So, they want, you know, the data warehouse, the data lake, the risk system. They want all that stuff. Instant access. And, they don't really care where it comes from as long as you're able to document it. So, as long as you can grant that instant access, that's what they want.
And then, the second thing is the integration of AI. With that, that's moving so fast, and there's so many demands right now for, "Okay I see the vanilla product. I want an AI product on top of that. What's that look like?" And, they want that now.
Ben Hillary Excellent, Alex.
Alex Whittaker Yeah, I think the next step in this evolution ought to be a focus on essence and to get the current technology working.
I mean, we've gotten to data lakes. Let's start focusing on delivering results, getting the most out of what we have at the moment, learning that in detail, and then you'll be in a position to start adding to that when the next technological advancements come along.
But, if you don't stop and actually get what's here now, working for you right now today, then you're just going to end up in a mess in a sort of never-ending hamster wheel, basically.
Ben Hillary Well, we've now got about 10 minutes to take some questions from the audience, so I see four rather interesting ones already.
I'll address these to the panel, and please, panel, do jump in with your thoughts. Firstly, from Ali Saliq. Hi, Ali. Hope you're doing well.
Paul mentioned to consolidate your data in one place, but what is your experience with data lakes versus regulative requirements on data? And, where are they stored? How to manage the data, let's say, in three regions: U.S, UK, EU versus regulative requirements in these regions.
Great question. Who wants to take a stab at that?
Ryan Rogers Without going into too much detail, I do know some of the data lake providers, cloud providers do offer geo-partitioning, so I haven't practiced that myself, but I know that you can geo-partition portions of the data.
Paul Kaisharis Yeah, I'll add to that. I mean, I know we actually had to deal with this.
Well, first, to answer the question. Yeah, you can't ignore those regulatory requirements, right? They're there, and you have to account for them. Molecule, for example, we're running in the U.S. and Europe now.
So, we actually had to set up instances of our system in the U.S. and Europe, and data can't be shared between those because of those regulatory requirements. I mean some of the things that are available for the data lake to bring the data together, there's still multi-tenant, kind of, capabilities you can lay on top of it.
But certainly, most likely because of the regulatory requirements, you can't store data from Europe in a U.S. data center. You're not going to be able to do that, so you're going to have to make sure you separate that data, where you know part of it's running into the U.S. and part of it's running in Europe because of those regulatory requirements.
Now, I guess maybe the question is, you're losing some of that insight of that consolidated data. But, I don't think you can ignore those regulatory requirements though.
Ben Hillary Thank you for that.
Next question, Jo Hollington. Good to see you, as well, Jo. How easy is SSO to embed into the data cells (i.e. to ensure only licensed users get the data)?
Paul Kaisharis Yeah, it's kind of a technology question, so I can take that.
I mean SSO - I assume, Jo, you talk about single sign-on type technologies, and you reference like data calls. I mean, most modern systems provide single sign-on type capabilities, where you have an identity provider, like Okta or even AWS that is what can authenticate you and who are you. And then, the systems themselves provide the authorization layer - what can you actually do with that information?
So, you could be like, in most of these systems, we're talking about - Molecule, for one - can also support SSO type functionality. And then, the underlying technologies implement the security controls, on top of that, to control access to the data. I think that's what you're referring to, but let me know if that wasn't correct.
Ben Hillary Then, next question from Tiffany Maine. How would you break down the data management vendor landscape? Are there good end-to-end solutions, or is it better to develop your own best-of-breed?
Paul Kaisharis I can... I'll start and put in my two cents.
So, you know, I don't think anyone's like won it yet. But, if you look at the major players that are sitting out there, we hear a lot about Snowflake, which has been around for a while, that provides a data lake and provides these things that we're talking about. I'm not promoting Snowflake or any of these technologies, but that's what's out there.
Databricks is another one that's getting a lot of attention out there, in terms of what that landscape is, as well. Obviously, you've got to consider the cloud providers, like AWS, that sort of provide these kind of solutions.
Personally, I don't think I would embark on building your own best-of-breed kind of thing. I think that's getting into some pretty complex territory. The costs and the resources required to do some of these things are really high. And, for you, I mean unless you're, yes, a big, large enterprise that feels like they have the resource to do it.
But, I would recommend more, taking advantage of the tooling that's available out there. I'll give you an example. One thing, I think one necessary tooling that Molecule does, we use this at Molecule - again, we're not we're not a data lake provider, of course. But, we use data streaming technology from a provider called Kafka that we use. And, we use that technology to stream, in real-time, data to whatever destination it needs to go to to feed these technologies and solutions we're talking about.
Again, I wouldn't embark, kind of, building my own, but I would definitely look at what the landscape is out there. But, I don't think anybody's won yet.
Alex Whittaker Oh, yeah. I think I agree with what Paul's saying. It's an interesting question this, as well, and you know, I've spent quite a lot of time looking at Bonroy's technology infrastructure - how we would increase that, invest in that as the company grows and things.
And yeah, I don't really know any data management vendors at all. Like, a CTRM would barely work without data and things. And yeah, you'd think actually looking at this, and these discussions that they ought to go hand-in-hand. And yet, they're very separate.
And, as Paul says, it's difficult to identify any data management vendor who's saying it's particularly strong or as, you know, one in ETRM/CTRM, energy and commodities. So, it's something I'm more interested to find out more about and to actually talk to some of these vendors. Because, again, I think it's a path that I'll be going down quite soon.
Ryan Rogers I would agree with Paul on not going down the best-of-breed path, just in general. But, I also would avoid the other extreme of the all-in-one vendors. Without naming names, you know, some of the legacy, 90s behemoths offer all-in-one solutions.
But, you know, it's hard to imagine you're hiring smart data scientists, smart data engineers, and then lock yourself into some antiquated infrastructure for all of these emerging capabilities. So, I probably split the baby, like Paul was mentioning. Go with the specialized vendors, things like Snowflake.
And then really, it starts with the people you hire. If you don't hire the right data scientists who hire the right data engineers and make the right decisions, it doesn't really matter who your vendors are.
Ben Hillary Excellent.
I've just had an interesting question come in from Stephen Nemo. Are the ETRM vendors going to begin embracing existing data schema standards, such as FPML or ISDA, CDM.
Anyone want to give that one a crack?
Paul Kaisharis I mean, I'm not as familiar with these particular standards.
But, I guess what I would say is, most of these around kind of integration type schema standards. I mean, I would say from an ETRM provider, is that you know to support these standards, what we would do is, you know, we wouldn't change initially our core systems to do these, but what we have in place is really extensions.
So, you know, I mentioned these data streaming technologies, where we can take any type of data in our ETRM system: market data, valuations data, whatever it is. And, we can transform that to any type of other format. And, we can take one data source, have multiple destinations, have multiple schemas we can support.
And, how that data is delivered, you know, to ETRM systems themselves, you know, support those standards in the core? No. But, I would say, I can only speak for Molecule, for us. It would be, like, what we have done is, we have the ability to support those standards and multiple standards without really a big lift on our side to get the data out of our system.
We just need to change the transformation now, you know, end piece to support those schema configurations. But, again, I'm not as familiar with what those standards are though.
Ben Hillary Very good.
Well, we've got one additional question from Tiffany Maine. To an extent, I think it's been covered in Ali's question. But, perhaps, there's an additional angle on, sort of, data sovereignty, data ownership.
Have the panel faced challenges when it comes to multi-regional data interoperability? How about data sovereignty issues?
That one has been quite well covered in the previous question. Yeah, I think geo-partitioning probably solves that yeah agree agree.
Okay, well that actually brings us very, very neatly to nearly the end of time. So, at this point, I would like to hand it back to you, Kari.
Kari Foster Great, thank you so much, Ben. Huge thanks to all of our panelists today and the
expertise and advice that you brought into this discussion. Very interesting discussion.
And, you may have seen as you were registering for this webinar, a mention of something called Bigbang. And, in fact, this is Molecule's forthcoming data-lake-as-a-service platform, which is an add-on and works in tandem with Molecule.
The launch of that product is imminent, which we're very excited about. And, we're actually planning a webinar for Bigbang in October. So, please be on the lookout for that from Molecule in the coming weeks.
So, I'll end my promotional bit there, but thank you all for coming. I really appreciate it, and I will pass things back over to Ben to close out the webinar.
Ben Hillary Lovely, thank you, Kari.
So, yeah. Just huge thanks to our panel for their insights today and, you, the audience for joining. So, the webinar recording will be sent via email to you all in the next two days. If you found this of interest do please share with your colleagues, with your wider network.
If Kari, myself, or any of the panel can be of any assistance, drop us a line or connect via LinkedIn. So yeah, from my side again, many thanks. Audience, panel, you've been fantastic and wishing you all an excellent day or evening ahead. Thank you.