Ben Miller, Recursion Pharmaceuticals | Splunk .conf 2017

>> Announcer: Live, from
Washington DC, it’s theCube. Covering .conf2017 Brought to you by splunk. >> Welcome back inside
the Walter Washington Convention Center. We’re at .conf2017 in Washington
DC, the nations capital, it is alive and well and thriving. A little warm out there,
almost 90 degrees. But hot topic inside here, Dave. >> There’s a lot of heat in this city. (laughter) >> A lot of hot air. >> Yeah, absolutely. >> We’ll just leave it at that. Politics aside, of course. Joining us is Ben Miller, who is Director of High
Thoughput Screening at Recursion Pharmaceuticals. Ben, thanks for being
with us here on theCube. We appreciate the time. First off, I have many questions. First off let’s talk about
the company, what you do, and then what high
throughput screening means, and how that operation comes
into play when you have this great nexus of
biology and engineering that you’ve brought together. >> Recursion Pharmaceuticals
is treating drug discovery as a facial recognition problem. We’re applying machine-learning concepts to biological images to help
detect what types of drugs can rescue what types of diseases. We’re one of the few companies
that is both generating and analyzing our own data. As the director of the high
throughput screening group, what I do is generate images
for our data science teams to analyze, and that means
growing human cells up in massive quantities, perturbing
them with different types of disease reagents that cause
their morphology to change, and then photographing them
in the presence of compounds and in the absence of compounds. So we can see which compounds
cause these disease states to revert more to a
normal state for the cell. >> Okay, HTS then … Walk us through that if you would. >> HTS is a general term that’s
used in the pharmaceutical industry to denote a
assay that is executed in very large scale and in parallel. We tend to work on the order of multiples of 384 experiments per plate. We’re looking at hundreds of
thousands of images per plate, and we’re looking at
hundreds of plates per week. So when we say high throughput,
we mean 6-10 terabytes of data per day. >> Just extraordinary amounts of data. And the mission, as we
understand it, you’re looking at very rare genetic diseases,
your goal is to find cures for these over the next 15-20 years. Up to 100 of them, so that’s
why you’re going through this multiple examinations
of vast amounts of data. Human data. >> Yeah, there’s been a trend
in the pharmaceutical industry over the last years, where
the number of dollars spent per drug developed is increasing. And it now takes over one billion dollars to bring a drug to market. And every year it costs more
to bring a drug to market. We believe we can change that
by operating at a massively parallel scale and also
analyzing image data at a truly deep level. Looking at thousands of
different features per image, instead of just a single
feature in the image. >> That business is just like
this vicious cycle going on, and you guys are trying to break it. >> Yes, exactly. >> So what’s the state of
facial recognition been? I’ve had mixed reviews about it. Because I rave about it, I go, “Oh my God, “Facebook tagged me again,
it must be really good.” And then other’s have told
me, “Well it’s not really “as reliable as you might think.” What is your experience been? >> The only experience I’ve
had with facial recognition has been like yours, on
Facebook and things like that. What we’re doing is looking
more at cellular recognition. Being able to see differences in these cellular morphologies. I think there are some unique
challenges when you’re looking at images of thousands
of cells, versus images of a single person’s face. >> Okay, so you’ve taken that
concept down to the cell level and it’s highly accurate, presumably. >> It’s highly reproducible
is what I would say, yeah. >> So it takes some work to be accurate, and once you get it there
you can reproduce that, is that right? How does the sequence work? >> Yes, so there are
two parts to the coin. One is how consistently we
can produce these images and then how consistently those images represent the disease state. My focus is on making
the images as consistent as they can be, while realizing
that the disease states are all unique. So from our perspective,
we’re looking at thousands of different features in each image, and figuring out how
consistent those features are from image to image. >> So paint a picture of
your data stack, if you will. Infrastructure on up to the
apps, and where splunk fits in. >> Sure. So I guess you could
say that our data stack actually begins at
hospitals around the world where human cells are
collected from various medical waste samples. We culture those up, perturb
them with different reagents, add different potential
drugs back to them, and then photograph them. So at the beginning of our stack
we’ve got biological agents that are mixed together and
then photographs are generated. Those photographs are actually .tif files, and we have thousands
and thousands of them. They’re all uploaded in
to Amazon Web Services, their S3 system. We spin up a near infinite
number of virtual computers to process all of that image
data within a couple of hours. And then produce a result. This drug makes this disease
model look more like healthy and doesn’t have other side effects. We’re really reducing those
thousands of dimensions in our image down to two. How much does it look like a healthy cell, and how much does it just
look different then it should. >> And where does splunk
fit into that stack? >> All of those instruments
that are generating that data are equipped with splunk forwarders. So splunk is pulling all
of our operational data from the laboratory
together, and marrying it up with the image analysis that comes from our proprietary data analysis system. So by looking at the data
that we’re generating, how many cells we’re counting, how bright the intensity of the image is, comparing that back to
which dispenser we used, how long the plates sat at
room temperature, et cetera. We can figure out how to
optimize our production process so that we get reliable data. >> It’s essentially storing machine data in the splunk data store. And then do you have an
image database for …? >> Yeah. And the image database
is incredibly large. I wouldn’t even guess at the current size. >> Dave: And what is it? Is it something on
Amazon, an Amazon service? >> Yeah. So right now all of our
image data is stored on AWS. >> This is one of those interviews Dave that the subject matter kind
of trumps the technology because I want to know how it works. But you need the technology
obviously to drive it. So I’m trying to figure out,
“Alright, so you’re taking “human cells and you’re
taking snapshots in time, “and then looking at how they react “to certain perturbed actions.” But how does that picture
of maybe one person’s cell reacting to a reagent
to another person’s … How does your data analysis
provide you with some insight because Dave’s DNA is
different from my DNA, different from everybody in this building, so ultimately how are you
combing through all of that data to make sense of it. >> That’s true. Everybody has a unique
genetic fingerprint, but everybody is
susceptible to the same sets of major diseases. By looking at these images, and really that’s the
billion dollar question, is how representative are these
individual cellular images, how representative are they of
the general human population? And the effects that we
see at a cellular level, will they translate in
to human populations? We’re very close to clinical
trials on several compounds, but that’s when we will
really find out how much proof there is in this concept. >> Okay. You can’t really predict … Do you have a timeframe
or is just sort of, “Keep going, keep getting funding until you reach the answer?” Is it like survive until you thrive? >> I personally don’t maintain
that kind of timeline. My role is within the
laboratory producing the data as quickly as we can. We do have a goal of treating
100 different diseases in the next 10 years. And it’s really early days, we’re about 2 1/2 years in to that goal. It seems like we’re on track,
but there’s still a lot of work to be done between now and then. >> So it’s all cloud, right? And then splunk is throughout that stack, as we talked about. How do you envision, or do you envision, using it differently? Are you trying to get more
out of the splunk platform? What do you want to see from splunk? >> That’s a good question. I think right now we’re
using really the rudimentary basic features of splunk. Their database-connect app and their Machine Learning Toolkit are both pretty foundational
to the work that we do. But right now a lot of our
data models are one time use. We do a particular analysis
to find the root cause of a particular problem, we learn that, and that’s the last
time we use that model. Continuous implementation of data models is something that is
high on my list to do. As well as just ingesting
more and more data. We’re still fairly siloed. Our temperature and humidity data is separate from our machine data, and bringing that all into
splunk is on the list. >> Why are your models disposable? It sounds like it’s not done on purpose, it’s more of some kind of
infrastructure barrier? >> We’re really at the cutting
edge of technology right now, and we’re learning a lot
of things that people haven’t learned, that in
retrospect are obvious. To figure out the true cause
of a particular situation, a data model or a machine-learning model is really valuable, but once
you know that key salient fact, you don’t need to keep
track of it over time. You don’t need to know that
when your tire pressure is low your car gets less miles to the gallon. >> David: You have the answer. >> Right. But there are a lot of
problems like that in our field that have not been discovered yet. >> I inferred from your answer
you do see the potential to have some kind of
ongoing model evolution. For new use cases? >> In the extreme situation
we have a set of hundreds of operational parameters
that are going into producing this image of cells. And then we have thousands
of cellular features that are extracted from that image. There’s a machine-learning problem there. What are the optimal parameters to extract the optimal information? And that whole process could be automated to the point where we’re
using machine-learning to optimize our assay. To me that’s the future
of what we want to do. >> Were you with Recursion
when they brought in splunk? >> Yeah. >> You were. Did you look at alternatives? Did you look at maybe rolling
your own with open source? Is that even feasible? Wonder if you could talk about that. >> I had already been introduced to splunk at my previous job, and
at that previous company, before I heard of splunk, I
was starting to roll my own. I was writing a ton of Perl scripts, and all of these regular expressions, and searching network drives
to pull log files together. And I thought that maybe there would be a good business model behind that. >> You were building splunk. (laughter) >> And then I found splunk, and
those guys were so far ahead of things I was trying
to do on my own in a lab. So for me it was a no-brainer. But for our software engineering team, they are really dedicated to open source platforms
whenever possible. They evaluated the ELK Stack. Some of us had used Sumo
Logic and things like that. But for me, splunk had
the right license model and I could get off the ground really really rapidly with it. >> What about the license
model was attractive to you? >> Unlimited users, and
only paying for the data that we ingest. The ability to democratize that data, so that everybody in the
lab can go in and view it and I don’t have to worry about how many accounts I’m creating. That was really powerful. >> Dave: So you like the pricing model. >> Yeah. >> Some users have
chirped about the pricing, I saw some Wall Street
concerns about the pricing. The guys that we’ve talked to
on theCube today have said, “They like the pricing model,
that there’s value there.” And you’re sort of confirming that. >> Ben: Yeah. >> You’re not concerned
about the exponential growth of you data causing your license
fees to go through the roof >> In the laboratory, the image data that we’re generating is exponentially growing,
but the operational parameter data is more linearly growing. >> Dave: So it’s under control basically. >> Yeah, for our needs it is. >> Dave: You’re not paying for the images, you’re paying for the
meta data around that. >> Yeah. >> Well it’s a fascinating
proposition, it really is. Very eager to keep up
with this, keep track, and see the progress. Good luck with that. Look for having you back on theCube to monitor that progress, alright Ben? >> Great. Very good, thank you so much. Ben Miller joining us from Salt Lake City, good to have you here. Back with more on theCube in just a bit. You’re watching our live
coverage of .conf2017. (upbeat innovative music)

Leave a Reply

(*) Required, Your email will not be published